[PDB-4020] Puppetserver should handle 503s from PuppetDB Created: 2018/06/08  Updated: 2018/10/24  Resolved: 2018/09/19

Status: Closed
Project: PuppetDB
Component/s: PuppetDB
Affects Version/s: PDB 5.2.4
Fix Version/s: PDB 5.2.6, PDB 6.0.0

Type: Bug Priority: Normal
Reporter: Jarret Lavallee Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: maintenance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Relates
Template:
Team: PuppetDB
Method Found: Customer Feedback
CS Priority: Major
CS Frequency: 2 - 5-25% of Customers
CS Severity: 3 - Serious
CS Business Value: 3 - $$$$
CS Impact: Customers using HA will expect that all data is preserved and that the catalog compilation is also function.
Release Notes: Bug Fix
Release Notes Summary: Prior to this fix, the http submission with command_broadcast enabled
always returned the last response. As a result, a failure would be shown if
the last connection produced a 503 response even though there was
previously a successful PuppetDB response and the minimum successful
responses have been met. This issue does not occur with responses that
raised an exception. Since the puppet http_pool does not raise 503
as an exception, this issue can be seen when the PuppetDB is in
maintenance mode.

This fix changes the behavior to send the last successful response
when the minimum successful submissions have been met.

 Description   

As a Puppet user I expect the puppetserver to gracefully handle a transient PuppetDB maintenance mode state without reporting agent failures when there are multiple PuppetDB instances configured. 

When one of the PuppetDB instances is in maintenance mode, it returns a 503 to the Puppetserver which should be handled when command_broadcast = true and min_successful_submissions = 1. However, puppetserver sees this as a failure and sends a 500 error to the agent. Since the other PuppetDB instance is available, and min_successful_submissions = 1, the 503 should be ignored and the command to the other PuppetDB should be successful. The actual result is that Puppetserver sends a 500 back to the agent and the agent runs are all failures for the duration that any of the PuppetDB nodes are in maintenance mode.

Steps to reproduce:

  1. Configure PuppetDB replication
  2. Configure command_broadcast = true in the puppetdb.conf on the master
  3. Run a puppet agent while one of the PuppetDBs is in maintenance mode and the other one is available.

Logs:

 

From the puppetserver.log

 

2018-06-08T10:35:23.831-07:00 WARN [qtp2042713953-1209] [puppetserver] Puppet Error connecting to pe-201810-agent-replica.puppetdebug.vlan on 8081 at route /pdb/cmd/v1?checksum=c36ef6428032c548e53a17e5c22e5b4447748574&version=9&certname=pe-201810-agent.puppetdebug.vlan&command=replace_catalog&producer-timestam p=1528479323, error message received was ''. Failing over to the next PuppetDB server_url in the 'server_urls' list 2018-06-08T10:35:23.835-07:00 ERROR [qtp2042713953-1209] [puppetserver] Puppet [503 ] PuppetDB is currently down. Try again later. 2018-06-08T10:35:23.835-07:00 ERROR [qtp2042713953-1209] [puppetserver] Puppet /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/puppetdb/command.rb:8 2:in `submit' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/puppetdb.rb:62:in `block in submit_command' /opt/puppetlabs/puppet/lib/ruby/vendor_rub y/puppet/util/profiler/around_profiler.rb:58:in `profile' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/profiler.rb:51:in `profile' /opt/puppetlab s/puppet/lib/ruby/vendor_ruby/puppet/util/puppetdb.rb:99:in `profile' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/puppetdb.rb:59:in `submit_comm and' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/indirector/catalog/puppetdb.rb:14:in `block in save' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/pup pet/util/profiler/around_profiler.rb:58:in `profile' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/profiler.rb:51:in `profile' /opt/puppetlabs/pup pet/lib/ruby/vendor_ruby/puppet/util/puppetdb.rb:99:in `profile' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/indirector/catalog/puppetdb.rb:11:in `sa ve' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/indirector/store_configs.rb:24:in `save' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/indirecto r/indirection.rb:204:in `find' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/network/http/api/indirected_routes.rb:121:in `do_find' /opt/puppetlabs/pup pet/lib/ruby/vendor_ruby/puppet/network/http/api/indirected_routes.rb:48:in `block in call' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/context.rb:65 :in `override' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet.rb:260:in `override' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/network/http/api/i ndirected_routes.rb:47:in `call' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/network/http/route.rb:82:in `block in process' org/jruby/RubyArray.java: 1735:in `each' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/network/http/route.rb:81:in `process' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/n etwork/http/route.rb:87:in `process' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/network/http/route.rb:87:in `process' /opt/puppetlabs/puppet/lib/rub y/vendor_ruby/puppet/network/http/handler.rb:64:in `block in process' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/profiler/around_profiler.rb:58 :in `profile' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/util/profiler.rb:51:in `profile' /opt/puppetlabs/puppet/lib/ruby/vendor_ruby/puppet/network /http/handler.rb:62:in `process' uri:classloader:/puppetserver-lib/puppet/server/master.rb:42:in `handleRequest'

 

From the agent, which gets a 500 instead of a 503.

 

 -> "HTTP/1.1 500 Server Error\r\n"
-> "Date: Fri, 08 Jun 2018 17:35:33 GMT\r\n"
-> "Content-Type: application/json;charset=utf-8\r\n"
-> "X-Puppet-Version: 5.5.1\r\n"
-> "Content-Length: 108\r\n"
-> "Server: Jetty(9.4.z-SNAPSHOT)\r\n"
-> "\r\n"
reading 108 bytes...
-> "{\"message\":\"Server Error: [503 ] PuppetDB is currently down. Try again later.\",\"issue_kind\":\"RUNTIME_ERROR\"}"
read 108 bytes
Conn keep-alive
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: [503 ] PuppetDB is currently down. Try again later.
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

 

I suspect this is coming from an unhandled or unsent exception in https://github.com/puppetlabs/puppetdb/blob/master/puppet/lib/puppet/util/puppetdb/http.rb#L162-L192



 Comments   
Comment by Justin Stoller [ 2018/06/08 ]

If we were to implement the retry in the server, I believe would be actually be in the PDB terminus helpers here (this team may very well still do that work).

The work to relay to the agent would probably be the Puppet code base proper & the terminus.

 

I'm super curious about the pros and cons of relaying the 503 to the agent vs trying within the server.

/cc Maggie Dreyer, Charlie Sharpsteen.

Comment by Maggie Dreyer [ 2018/06/08 ]

With nothing to back this up except my musings:
1) Seems like relaying to the agent might be more straightforward, given we already have the retry logic.
2) It does seem more correct, perhaps, to just have the server resubmit, since that's where the request originally came from.
3) If the server did the retrying, would the agent just pause waiting while that was going on? If so, that seems somewhat opaque from the agent's perspective, but if these requests are generally fast or the retry interval was sufficiently short, probably wouldn't cause a problem.

Comment by Jarret Lavallee [ 2018/06/08 ]

Another item to consider about relaying to the agent is that different API queries to PuppetDB can send the 503, depending on the timing. The in the example above is a replace catalog, but I have seen it with replacing facts and store report. It may not be appropriate to relay a 503 from a store report to the agent, whereas it may for a replace facts.

Comment by Charlie Sharpsteen [ 2018/06/08 ]

For immediate workarounds, the PuppetDB terminus has a soft_write_failure option that allows Puppet Server to ignore these errors if enabled:

https://puppet.com/docs/puppetdb/5.2/puppetdb_connection.html#softwritefailure

Re-trying the report submission is a good idea as we should avoid loosing data in transit. However, JRubies pausing and retrying will start to bottleneck other agent requests. Triggering an agent-side retry would free the JRuby worker up to handle non-PuppetDB requests such as classification or file metadata.

Additionally, PuppetDB should probably keep the /command API available during database maintenance. If Postgres is down or undergoing a migration, that should not prevent PuppetDB from accepting and enqueuing work. The only time the /command API needs to be unavailable is if the queue its self is undergoing a migration (i.e. ActiveMQ -> Stockpile).

Comment by Sean Millichamp [ 2018/07/19 ]

soft_write_failure isn't really an acceptable option in our environment - I need to know that it was written to at least one of the PuppetDBs. On the other hand, I wouldn't expect it to fail if it can write successfully to at least one of the PuppetDBs. The expectation would be that as long as one is online and fully in service it could and would receive the command and the data would get synchronized to the other PuppetDBs via PuppetDB synchronization as soon as they were able (within the sync interval).

Personally, I consider this a Puppetserver and/or PuppetDB terminus issue.

Comment by Amy Sahli [ 2018/08/16 ]

Will take a look at this after Puppet Server 6.

Comment by Jarret Lavallee [ 2018/08/22 ]

The core of this problem seems to be that when one of the PuppetDB instances returns a 503, the puppetserver throws an error back to the agent. In an HA setup, command_broadcast=true is set, so both PuppetDB instances should have received the query. Puppetserver returns the 503 to the agent instead even though min_successful_submissions = 1 and the primary PuppetDB instance would be able to handle the query.

A workaround to this is to set command_broadcast=false, so that the puppetserver does not send the command to all of the puppetdb instances in the server_urls, however, this would put more stress on transferring the items during the sync.

Comment by Charlie Sharpsteen [ 2018/08/23 ]

This could actually be a PDB bug as I suspect the guilty code lives under https://github.com/puppetlabs/puppetdb/tree/master/puppet .

Comment by Jarret Lavallee [ 2018/08/24 ]

Charlie Sharpsteen I think you are right about where this is occurring. I have moved this to be a PDB ticket.

Comment by Jarret Lavallee [ 2018/08/30 ]

Opened https://github.com/puppetlabs/puppetdb/pull/2571 for this.

Generated at Tue Aug 20 08:50:14 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.