As the pull request for this issue has demonstrated, retrying agent HTTP requests is not a cut and dried problem; there are a number of different request types that we may (or may not) want to retry that have different behaviors. In addition there are a number of cases where retrying may be necessary, and in each one they have different retry rates and backoff periods.
Retrying request types
First off, we have any number of HTTP request endpoints that may need to be retried.
These are very straightforward; file metadata and content is read only and only fetched via GET so these are universally safe to retry.
File bucket contents are less straightforward. It's always safe to try to GET file_bucket_file files, but retrying a POST of a file_bucket_file is more up in the air - POST requests cannot automatically be retried, but this is done to back up files and so is something we probably want to ensure happens. I'm on the fence about this.
Retrying catalog requests is not as straightforward as you would think because fetching a catalog might use GET, or it might use POST. The reason for the different cases is the size of the facts being uploaded as part of the request. If all facts can fit as query parameters GET is used; otherwise POST is used. If we only retry GET requests, then we will be inconsistent in whether we retry catalog requests. And as a coworker pointed out, catalogs have a built in retry mechanism - if catalog retrieval fails and we need to retry in the future when the network is less congested/the master is less overloaded, the runinterval of the Puppet agent will mean we'll naturally retry the catalog compilation.
Reports should be POSTed at the end of of a catalog application, and since they contain the status of the run and so we probably want to make sure that's submitted. Once again though this is a POST, so by default we wouldn't retry this.
Specific use cases that we want to handle
I would like to know exactly what problems we're trying to solve here and solve those, rather than trying to implement a solution that will solve a number of non-specific cases. So far we have a handful of use cases, and each case warrants different configuration for request retries.
- AWS ELBs don't gracefully remove Puppet masters from the LB pool and active agent connections are forcefully reset at the TCP/IP level. In this sort of failure we do not receive anything as nice as a HTTP 503, but should immediately retry the given request. In this scenario that single retry should be sufficient and no backoff is really necessary. (From the original issue description.) (Additional note: it looks like Amazon added connection draining so this exact use case might be moot. https://aws.amazon.com/about-aws/whats-new/2014/03/20/elastic-load-balancing-supports-connection-draining/)
- Multiple Puppet masters are updated behind an ELB simultaneously, and if the different masters are updated at different intervals an agent may fetch a catalog from one master, try to fetch files from a master that isn't yet up to date, and fail. To quote the original issue, "Retrying wouldn't be an ideal solution for this scenario as a retry could just hit that same out-of-date master again, but it could possibly work." In this case we can retry until the cows come home and hopefully things will get into a consistent state.
- A Puppet agent is connecting to a master over an unreliable network, and transient failures are causing HTTP requests to fail. In this case it makes sense to retry a set number of times, and use a backoff algorithm in case network congestion is the root cause.
- A Puppet agent is connecting to a load balancer whose Puppet master workers are overloaded and the LB responds with a 503. In this case the agent should retry and either use a backoff algorithm or use the value of the HTTP Retry-After header.
So these are the sort of cases we may want to retry and the scenarios we want to retry in. What is the specific use case that we're trying to solve? There are a bunch of edge cases in the different scenarios that I've outlined; can we pick a specific case and solve that one first?