[PDB-4020] Puppetserver should handle 503s from PuppetDB Created: 2018/06/08 Updated: 2018/10/24 Resolved: 2018/09/19
|Affects Version/s:||PDB 5.2.4|
|Fix Version/s:||PDB 5.2.6, PDB 6.0.0|
|Remaining Estimate:||Not Specified|
|Time Spent:||Not Specified|
|Original Estimate:||Not Specified|
|Method Found:||Customer Feedback|
|CS Frequency:||2 - 5-25% of Customers|
|CS Severity:||3 - Serious|
|CS Business Value:||3 - $$$$|
|CS Impact:||Customers using HA will expect that all data is preserved and that the catalog compilation is also function.|
|Release Notes:||Bug Fix|
|Release Notes Summary:|| Prior to this fix, the http submission with command_broadcast enabled
always returned the last response. As a result, a failure would be shown if
the last connection produced a 503 response even though there was
previously a successful PuppetDB response and the minimum successful
responses have been met. This issue does not occur with responses that
raised an exception. Since the puppet http_pool does not raise 503
as an exception, this issue can be seen when the PuppetDB is in
This fix changes the behavior to send the last successful response
when the minimum successful submissions have been met.
As a Puppet user I expect the puppetserver to gracefully handle a transient PuppetDB maintenance mode state without reporting agent failures when there are multiple PuppetDB instances configured.
When one of the PuppetDB instances is in maintenance mode, it returns a 503 to the Puppetserver which should be handled when command_broadcast = true and min_successful_submissions = 1. However, puppetserver sees this as a failure and sends a 500 error to the agent. Since the other PuppetDB instance is available, and min_successful_submissions = 1, the 503 should be ignored and the command to the other PuppetDB should be successful. The actual result is that Puppetserver sends a 500 back to the agent and the agent runs are all failures for the duration that any of the PuppetDB nodes are in maintenance mode.
Steps to reproduce:
From the puppetserver.log
From the agent, which gets a 500 instead of a 503.
I suspect this is coming from an unhandled or unsent exception in https://github.com/puppetlabs/puppetdb/blob/master/puppet/lib/puppet/util/puppetdb/http.rb#L162-L192
|Comment by Justin Stoller [ 2018/06/08 ]|
If we were to implement the retry in the server, I believe would be actually be in the PDB terminus helpers here (this team may very well still do that work).
The work to relay to the agent would probably be the Puppet code base proper & the terminus.
I'm super curious about the pros and cons of relaying the 503 to the agent vs trying within the server.
|Comment by Maggie Dreyer [ 2018/06/08 ]|
With nothing to back this up except my musings:
|Comment by Jarret Lavallee [ 2018/06/08 ]|
Another item to consider about relaying to the agent is that different API queries to PuppetDB can send the 503, depending on the timing. The in the example above is a replace catalog, but I have seen it with replacing facts and store report. It may not be appropriate to relay a 503 from a store report to the agent, whereas it may for a replace facts.
|Comment by Charlie Sharpsteen [ 2018/06/08 ]|
For immediate workarounds, the PuppetDB terminus has a soft_write_failure option that allows Puppet Server to ignore these errors if enabled:
Re-trying the report submission is a good idea as we should avoid loosing data in transit. However, JRubies pausing and retrying will start to bottleneck other agent requests. Triggering an agent-side retry would free the JRuby worker up to handle non-PuppetDB requests such as classification or file metadata.
Additionally, PuppetDB should probably keep the /command API available during database maintenance. If Postgres is down or undergoing a migration, that should not prevent PuppetDB from accepting and enqueuing work. The only time the /command API needs to be unavailable is if the queue its self is undergoing a migration (i.e. ActiveMQ -> Stockpile).
|Comment by Sean Millichamp [ 2018/07/19 ]|
soft_write_failure isn't really an acceptable option in our environment - I need to know that it was written to at least one of the PuppetDBs. On the other hand, I wouldn't expect it to fail if it can write successfully to at least one of the PuppetDBs. The expectation would be that as long as one is online and fully in service it could and would receive the command and the data would get synchronized to the other PuppetDBs via PuppetDB synchronization as soon as they were able (within the sync interval).
Personally, I consider this a Puppetserver and/or PuppetDB terminus issue.
|Comment by Amy Sahli [ 2018/08/16 ]|
Will take a look at this after Puppet Server 6.
|Comment by Jarret Lavallee [ 2018/08/22 ]|
The core of this problem seems to be that when one of the PuppetDB instances returns a 503, the puppetserver throws an error back to the agent. In an HA setup, command_broadcast=true is set, so both PuppetDB instances should have received the query. Puppetserver returns the 503 to the agent instead even though min_successful_submissions = 1 and the primary PuppetDB instance would be able to handle the query.
A workaround to this is to set command_broadcast=false, so that the puppetserver does not send the command to all of the puppetdb instances in the server_urls, however, this would put more stress on transferring the items during the sync.
|Comment by Charlie Sharpsteen [ 2018/08/23 ]|
This could actually be a PDB bug as I suspect the guilty code lives under https://github.com/puppetlabs/puppetdb/tree/master/puppet .
|Comment by Jarret Lavallee [ 2018/08/24 ]|
Charlie Sharpsteen I think you are right about where this is occurring. I have moved this to be a PDB ticket.
|Comment by Jarret Lavallee [ 2018/08/30 ]|
Opened https://github.com/puppetlabs/puppetdb/pull/2571 for this.