Uploaded image for project: 'Puppet'
  1. Puppet
  2. PUP-2526

puppet agent retry failed http requests

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Normal
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Community
    • Labels:
    • Template:

      Description

      It would be nice if puppet agent had the ability had the ability to retry failed http requests to the puppet master.

      I have multiple puppet masters sitting behind a load balancer in AWS (ELB). Whenever we update a puppet master, the node is removed from the load balancer while the update occurs. Unfortunately the AWS ELB does not allow quiescent removal of nodes (such that existing connections are allowed to close gracefully). Instead it just severs the connections immediately. This causes errors for agents which are in the middle of making requests to that master.
      Another related scenario is when you're updating multiple puppet masters. The masters might be in the middle of updating, and so some masters have newer code than the others. A puppet agent gets a catalog from one master, which says a certain file should exist, but then when the agent goes to fetch that file, it fails because the master it tried to fetch from hasn't updated. Retrying wouldn't be an ideal solution for this scenario as a retry could just hit that same out-of-date master again, but it could possibly work. Yes the ideal solution here is session persistence, but the AWS ELB does not support it.

      It might be useful to even allow a configurable backoff (failure; sleep 2; failure; sleep 5; failure; abort...), though a single retry would be sufficient for the first scenario indicated above.
      If a backoff is implemented, I think it should only be done once in the whole puppet run. So that if you have a 100 different http requests that have to be made to the puppet master, you don't do the backoff wait thing 100+ times.

        Attachments

          Issue Links

            Activity

            Hide
            pdrake Peter Drake added a comment -

            I did some work on this here: https://github.com/puppetlabs/puppet/pull/2870

            Show
            pdrake Peter Drake added a comment - I did some work on this here: https://github.com/puppetlabs/puppet/pull/2870
            Hide
            andy Andrew Parker added a comment -

            As commented in the pull request. I'd like some more information about the exact failures that are being seen that this issue is supposed to resolve. Ideally this would be output from puppet or information about the actual network requests that were done.

            Show
            andy Andrew Parker added a comment - As commented in the pull request. I'd like some more information about the exact failures that are being seen that this issue is supposed to resolve. Ideally this would be output from puppet or information about the actual network requests that were done.
            Hide
            rb2k Marc Seeger added a comment - - edited

            Here is some copypasta from the PR:

            I would love to see this in as well.
            We run thousands of machines per puppet master and see this almost on every release.

            Some things that caused these failures in the past:

            • Machines being in a different region and having a bit of a flaky connection (e.g. machines from Asia calling to a US puppet master)
            • Security groups being rewritten in EC2 and network connections just dropping
            • "network connectivity issues" in AWS
            • Running releases on too many machines in parallel and thus putting too much load on the puppet master

            Some of these issues can be resolved by upsizing/load balancing the instance of the puppet master, but all of the network failures are something that a user might not be able to control

            and

            it's usually network problems so it's arbitrary requests that fail (file_metadata. file_metadatas, file_content, ...)

            Show
            rb2k Marc Seeger added a comment - - edited Here is some copypasta from the PR: I would love to see this in as well. We run thousands of machines per puppet master and see this almost on every release. Some things that caused these failures in the past: Machines being in a different region and having a bit of a flaky connection (e.g. machines from Asia calling to a US puppet master) Security groups being rewritten in EC2 and network connections just dropping "network connectivity issues" in AWS Running releases on too many machines in parallel and thus putting too much load on the puppet master Some of these issues can be resolved by upsizing/load balancing the instance of the puppet master, but all of the network failures are something that a user might not be able to control and it's usually network problems so it's arbitrary requests that fail (file_metadata. file_metadatas, file_content, ...)
            Hide
            adrien Adrien Thebo added a comment -

            As the pull request for this issue has demonstrated, retrying agent HTTP requests is not a cut and dried problem; there are a number of different request types that we may (or may not) want to retry that have different behaviors. In addition there are a number of cases where retrying may be necessary, and in each one they have different retry rates and backoff periods.

            Retrying request types

            First off, we have any number of HTTP request endpoints that may need to be retried.

            file_metadata, file_content

            These are very straightforward; file metadata and content is read only and only fetched via GET so these are universally safe to retry.

            file_bucket_file

            File bucket contents are less straightforward. It's always safe to try to GET file_bucket_file files, but retrying a POST of a file_bucket_file is more up in the air - POST requests cannot automatically be retried, but this is done to back up files and so is something we probably want to ensure happens. I'm on the fence about this.

            catalog

            Retrying catalog requests is not as straightforward as you would think because fetching a catalog might use GET, or it might use POST. The reason for the different cases is the size of the facts being uploaded as part of the request. If all facts can fit as query parameters GET is used; otherwise POST is used. If we only retry GET requests, then we will be inconsistent in whether we retry catalog requests. And as a coworker pointed out, catalogs have a built in retry mechanism - if catalog retrieval fails and we need to retry in the future when the network is less congested/the master is less overloaded, the runinterval of the Puppet agent will mean we'll naturally retry the catalog compilation.

            report

            Reports should be POSTed at the end of of a catalog application, and since they contain the status of the run and so we probably want to make sure that's submitted. Once again though this is a POST, so by default we wouldn't retry this.

            Specific use cases that we want to handle

            I would like to know exactly what problems we're trying to solve here and solve those, rather than trying to implement a solution that will solve a number of non-specific cases. So far we have a handful of use cases, and each case warrants different configuration for request retries.

            • AWS ELBs don't gracefully remove Puppet masters from the LB pool and active agent connections are forcefully reset at the TCP/IP level. In this sort of failure we do not receive anything as nice as a HTTP 503, but should immediately retry the given request. In this scenario that single retry should be sufficient and no backoff is really necessary. (From the original issue description.) (Additional note: it looks like Amazon added connection draining so this exact use case might be moot. https://aws.amazon.com/about-aws/whats-new/2014/03/20/elastic-load-balancing-supports-connection-draining/)
            • Multiple Puppet masters are updated behind an ELB simultaneously, and if the different masters are updated at different intervals an agent may fetch a catalog from one master, try to fetch files from a master that isn't yet up to date, and fail. To quote the original issue, "Retrying wouldn't be an ideal solution for this scenario as a retry could just hit that same out-of-date master again, but it could possibly work." In this case we can retry until the cows come home and hopefully things will get into a consistent state.
            • A Puppet agent is connecting to a master over an unreliable network, and transient failures are causing HTTP requests to fail. In this case it makes sense to retry a set number of times, and use a backoff algorithm in case network congestion is the root cause.
            • A Puppet agent is connecting to a load balancer whose Puppet master workers are overloaded and the LB responds with a 503. In this case the agent should retry and either use a backoff algorithm or use the value of the HTTP Retry-After header.

            Solution

            So these are the sort of cases we may want to retry and the scenarios we want to retry in. What is the specific use case that we're trying to solve? There are a bunch of edge cases in the different scenarios that I've outlined; can we pick a specific case and solve that one first?

            Show
            adrien Adrien Thebo added a comment - As the pull request for this issue has demonstrated, retrying agent HTTP requests is not a cut and dried problem; there are a number of different request types that we may (or may not) want to retry that have different behaviors. In addition there are a number of cases where retrying may be necessary, and in each one they have different retry rates and backoff periods. Retrying request types First off, we have any number of HTTP request endpoints that may need to be retried. file_metadata, file_content These are very straightforward; file metadata and content is read only and only fetched via GET so these are universally safe to retry. file_bucket_file File bucket contents are less straightforward. It's always safe to try to GET file_bucket_file files, but retrying a POST of a file_bucket_file is more up in the air - POST requests cannot automatically be retried, but this is done to back up files and so is something we probably want to ensure happens. I'm on the fence about this. catalog Retrying catalog requests is not as straightforward as you would think because fetching a catalog might use GET, or it might use POST. The reason for the different cases is the size of the facts being uploaded as part of the request. If all facts can fit as query parameters GET is used; otherwise POST is used. If we only retry GET requests, then we will be inconsistent in whether we retry catalog requests. And as a coworker pointed out, catalogs have a built in retry mechanism - if catalog retrieval fails and we need to retry in the future when the network is less congested/the master is less overloaded, the runinterval of the Puppet agent will mean we'll naturally retry the catalog compilation. report Reports should be POSTed at the end of of a catalog application, and since they contain the status of the run and so we probably want to make sure that's submitted. Once again though this is a POST, so by default we wouldn't retry this. Specific use cases that we want to handle I would like to know exactly what problems we're trying to solve here and solve those, rather than trying to implement a solution that will solve a number of non-specific cases. So far we have a handful of use cases, and each case warrants different configuration for request retries. AWS ELBs don't gracefully remove Puppet masters from the LB pool and active agent connections are forcefully reset at the TCP/IP level. In this sort of failure we do not receive anything as nice as a HTTP 503, but should immediately retry the given request. In this scenario that single retry should be sufficient and no backoff is really necessary. (From the original issue description.) (Additional note: it looks like Amazon added connection draining so this exact use case might be moot. https://aws.amazon.com/about-aws/whats-new/2014/03/20/elastic-load-balancing-supports-connection-draining/ ) Multiple Puppet masters are updated behind an ELB simultaneously, and if the different masters are updated at different intervals an agent may fetch a catalog from one master, try to fetch files from a master that isn't yet up to date, and fail. To quote the original issue, "Retrying wouldn't be an ideal solution for this scenario as a retry could just hit that same out-of-date master again, but it could possibly work." In this case we can retry until the cows come home and hopefully things will get into a consistent state. A Puppet agent is connecting to a master over an unreliable network, and transient failures are causing HTTP requests to fail. In this case it makes sense to retry a set number of times, and use a backoff algorithm in case network congestion is the root cause. A Puppet agent is connecting to a load balancer whose Puppet master workers are overloaded and the LB responds with a 503. In this case the agent should retry and either use a backoff algorithm or use the value of the HTTP Retry-After header. Solution So these are the sort of cases we may want to retry and the scenarios we want to retry in. What is the specific use case that we're trying to solve? There are a bunch of edge cases in the different scenarios that I've outlined; can we pick a specific case and solve that one first?
            Hide
            pdrake Peter Drake added a comment -

            The specific cases outlined here, while they represent real-world situations where problems are encountered, seem too specific / narrow. I think a better focus would be making the puppet agent a more resilient REST client, with a focus on handling the types of requests, responses and error conditions that any REST client may encounter. These primarily fall into 3 categories. A fairly reasonable first attempt at dealing with these 3 forms of errors is to retry idempotent requests (those we can reliably retry without affecting the result). Typically, GET and HEAD requests fall into this category unless they are being made on a time or request count limited URL. I am not aware of any such restrictions in Puppet.

            Problem types

            • Valid HTTP error response: Responses such as 500, 502, 503, 504 received from the server fall into this category. HTTP 503 indicates that a request may be safely retried and is expected to succeed at a later time. The Retry-After header may or may not be included in the response, but should be respected by the client if it is. For certain other HTTP 5xx responses, we may not want to retry at all.
            • Invalid HTTP response: Broken HTTP response or unexpected/unhandled HTTP response code (eg. HTTP 418). This could occur due to a network problem after the server processed the request or because the server had a problem mid-response (eg. segfault). The answer of whether or not to retry requests in this case is not obvious to me, but it should be safe to do so.
            • TCP/Network error: Typically presents in Ruby as a SystemCallError, SocketError or Timeout::Error. This usually means that something beneath Ruby experienced an error, such as TCP packet loss. We cannot assume these requests were not received by the server. These requests should be retried with some amount of caution as this is the main case which could cause increased load on the server if the network interruption occurred after the request was delivered to the server.

            Solution

            Retry all idempotent requests (GET, HEAD) that fall into the first and third category, with a reasonable delay between requests (either static such as the 3s delay in my original PR, or dynamic such as an exponential delay).

            Show
            pdrake Peter Drake added a comment - The specific cases outlined here, while they represent real-world situations where problems are encountered, seem too specific / narrow. I think a better focus would be making the puppet agent a more resilient REST client, with a focus on handling the types of requests, responses and error conditions that any REST client may encounter. These primarily fall into 3 categories. A fairly reasonable first attempt at dealing with these 3 forms of errors is to retry idempotent requests (those we can reliably retry without affecting the result). Typically, GET and HEAD requests fall into this category unless they are being made on a time or request count limited URL. I am not aware of any such restrictions in Puppet. Problem types Valid HTTP error response: Responses such as 500, 502, 503, 504 received from the server fall into this category. HTTP 503 indicates that a request may be safely retried and is expected to succeed at a later time. The Retry-After header may or may not be included in the response, but should be respected by the client if it is. For certain other HTTP 5xx responses, we may not want to retry at all. Invalid HTTP response: Broken HTTP response or unexpected/unhandled HTTP response code (eg. HTTP 418). This could occur due to a network problem after the server processed the request or because the server had a problem mid-response (eg. segfault). The answer of whether or not to retry requests in this case is not obvious to me, but it should be safe to do so. TCP/Network error: Typically presents in Ruby as a SystemCallError, SocketError or Timeout::Error. This usually means that something beneath Ruby experienced an error, such as TCP packet loss. We cannot assume these requests were not received by the server. These requests should be retried with some amount of caution as this is the main case which could cause increased load on the server if the network interruption occurred after the request was delivered to the server. Solution Retry all idempotent requests (GET, HEAD) that fall into the first and third category, with a reasonable delay between requests (either static such as the 3s delay in my original PR, or dynamic such as an exponential delay).
            Hide
            adrien Adrien Thebo added a comment - - edited

            Peter Drake as noted in my previous comment, when requesting a catalog a Puppet agent may either use a GET or a POST request. If we only retry GET and HEAD requests then this means that a Puppet agent may randomly stop retrying HTTP requests when fetching a catalog, which would be very surprising behavior if someone was depending on HTTP request retries. Are you proposing that this case isn't relevant?

            Show
            adrien Adrien Thebo added a comment - - edited Peter Drake as noted in my previous comment, when requesting a catalog a Puppet agent may either use a GET or a POST request. If we only retry GET and HEAD requests then this means that a Puppet agent may randomly stop retrying HTTP requests when fetching a catalog, which would be very surprising behavior if someone was depending on HTTP request retries. Are you proposing that this case isn't relevant?
            Hide
            pdrake Peter Drake added a comment -

            It was not my intent to suggest there are no idempotent requests other than GET/HEAD, rather that all GET/HEAD requests are by nature idempotent. POST requests used to request a catalog may well be idempotent and if so should be retried.

            Show
            pdrake Peter Drake added a comment - It was not my intent to suggest there are no idempotent requests other than GET/HEAD, rather that all GET/HEAD requests are by nature idempotent. POST requests used to request a catalog may well be idempotent and if so should be retried.
            Hide
            ffrank Felix Frank added a comment -

            Speaking for myself, I find the feature intriguing, but frankly, I wouldn't want most of my agents to behave that way.

            The majority of my agents are on the same, reliable network, along with the master. I propose that this is a very common setup. If requests time out, there is probably a rather severe reason for that. I wouldn't want the whole network of agents to potentially contribute to the mayhem by generating more traffic than necessary.

            So, if Puppet adopts this behavior, I would be most grateful for the ability to turn it off through an agent setting.

            Can such an option be added?

            Show
            ffrank Felix Frank added a comment - Speaking for myself, I find the feature intriguing, but frankly, I wouldn't want most of my agents to behave that way. The majority of my agents are on the same, reliable network, along with the master. I propose that this is a very common setup. If requests time out, there is probably a rather severe reason for that. I wouldn't want the whole network of agents to potentially contribute to the mayhem by generating more traffic than necessary. So, if Puppet adopts this behavior, I would be most grateful for the ability to turn it off through an agent setting. Can such an option be added?
            Hide
            rb2k Marc Seeger added a comment - - edited

            my agents are on the same, reliable network, along with the master

            You hit number one on the list

            Show
            rb2k Marc Seeger added a comment - - edited my agents are on the same, reliable network, along with the master You hit number one on the list
            Hide
            ffrank Felix Frank added a comment -

            Look, I didn't join this discussion to argue semantics or raise this issue on some pedestal of IT Best Practices.

            Experience mandates certain requirements in the software I use, and I hold at least that standard to the projects I devote my time to. I will not stand for a change in Puppet that may have a negative impact on the user experience.

            Snide attacks on my argument aside, is there anything disadvantageous about setting this on an opt-in basis?

            Show
            ffrank Felix Frank added a comment - Look, I didn't join this discussion to argue semantics or raise this issue on some pedestal of IT Best Practices. Experience mandates certain requirements in the software I use, and I hold at least that standard to the projects I devote my time to. I will not stand for a change in Puppet that may have a negative impact on the user experience. Snide attacks on my argument aside, is there anything disadvantageous about setting this on an opt-in basis?
            Hide
            andy Andrew Parker added a comment -

            Marc Seeger, Peter Drake, please help us to move this forward. Arguing from what the RFC for HTTP says is only one bit of information needed for this change, but is not the only, nor is the it necessarily the most important. Unfortunately puppet doesn't always adhere to the desired HTTP practices when it communicates (sometimes what should be a GET is performed as a POST). That means that we would really like to know exactly what kinds of problems are being encountered. I believe that you have already seen several cases of this occurring and determined that retries would help, but the rest of us need to know and understand what these cases are. I have requested is for you to provide some examples (logs of the failure would be ideal) of the kinds of failures that you are trying to resolve with this change. I'd really like to fix this, but I need to know what exactly is going to be fixed.

            To Felix Frank's point, this behavior has to be opt in because the puppet HTTP client code is reused in many different contexts. Some of those contexts are particular network setups used for master/agent communication, others are third party plugins that use the client code to communicate with completely different systems where the assumptions about the idempotency of GET/HEAD and retries cannot be made.

            Show
            andy Andrew Parker added a comment - Marc Seeger , Peter Drake , please help us to move this forward. Arguing from what the RFC for HTTP says is only one bit of information needed for this change, but is not the only, nor is the it necessarily the most important. Unfortunately puppet doesn't always adhere to the desired HTTP practices when it communicates (sometimes what should be a GET is performed as a POST). That means that we would really like to know exactly what kinds of problems are being encountered. I believe that you have already seen several cases of this occurring and determined that retries would help, but the rest of us need to know and understand what these cases are. I have requested is for you to provide some examples (logs of the failure would be ideal) of the kinds of failures that you are trying to resolve with this change. I'd really like to fix this, but I need to know what exactly is going to be fixed. To Felix Frank 's point, this behavior has to be opt in because the puppet HTTP client code is reused in many different contexts. Some of those contexts are particular network setups used for master/agent communication, others are third party plugins that use the client code to communicate with completely different systems where the assumptions about the idempotency of GET/HEAD and retries cannot be made.
            Hide
            rb2k Marc Seeger added a comment - - edited

            Any sporadic networking problem during a puppet run (dropped packets, routing issues, DNS issues, ...) can lead to failures like these:

            web-7638: 2014-01-10T11:04:12+00:00 daemon.err web-xxxx puppet-agent[28071]: (/Stage[main]/Apt/File[sources.list]) Could not evaluate: Connection timed out - connect(2) Could not retrieve file metadata for puppet:///modules/..../......./eu-west-1.sources.list-lucid: Connection timed out - connect(2)

            Since puppet doesn't currently retry these failures, the rest of the run will error out with "failed dependency" warnings.
            Especially when pushing out a combination of "new binary and new config file", this can get slightly problematic.

            Among lots of other events, this could for example happen whenever Amazon does something like this:

            10:22 AM PDT We are investigating network connectivity issues for some instances in the US-EAST-1 Region.
            11:09 AM PDT Between 10:03 AM and 10:18 AM PDT we experienced connectivity issues for some instances in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

            I think most other cloud providers will at some point have the same kind of networking problems.

            Show
            rb2k Marc Seeger added a comment - - edited Any sporadic networking problem during a puppet run (dropped packets, routing issues, DNS issues, ...) can lead to failures like these: web-7638: 2014-01-10T11:04:12+00:00 daemon.err web-xxxx puppet-agent [28071] : (/Stage [main] /Apt/File [sources.list] ) Could not evaluate: Connection timed out - connect(2) Could not retrieve file metadata for puppet:///modules/..../......./eu-west-1.sources.list-lucid: Connection timed out - connect(2) Since puppet doesn't currently retry these failures, the rest of the run will error out with "failed dependency" warnings. Especially when pushing out a combination of "new binary and new config file", this can get slightly problematic. Among lots of other events, this could for example happen whenever Amazon does something like this: 10:22 AM PDT We are investigating network connectivity issues for some instances in the US-EAST-1 Region. 11:09 AM PDT Between 10:03 AM and 10:18 AM PDT we experienced connectivity issues for some instances in the US-EAST-1 Region. The issue has been resolved and the service is operating normally. I think most other cloud providers will at some point have the same kind of networking problems.
            Hide
            kylo Kylo Ginsberg added a comment -

            Marc Seeger, Peter Drake it seems like there's a need here in some cases, but that not every one wants the behavior.

            Any thoughts on making this opt-in behavior?

            Show
            kylo Kylo Ginsberg added a comment - Marc Seeger , Peter Drake it seems like there's a need here in some cases, but that not every one wants the behavior. Any thoughts on making this opt-in behavior?
            Hide
            ffrank Felix Frank added a comment -

            In the case of file resources failing due to WAN restrictions, I wonder whether retries are even the best way forward.

            Wouldn't it be nicer if the agent could be configured to transparently use a nearby blob storage when resolving puppet:/// URLs? Sure, that may have other problems, but it might be worthwhile to at least investigate this line of thought.

            Marc Seeger, would you care to open a new ticket for the file via WAN issue?

            Show
            ffrank Felix Frank added a comment - In the case of file resources failing due to WAN restrictions, I wonder whether retries are even the best way forward. Wouldn't it be nicer if the agent could be configured to transparently use a nearby blob storage when resolving puppet:/// URLs? Sure, that may have other problems, but it might be worthwhile to at least investigate this line of thought. Marc Seeger , would you care to open a new ticket for the file via WAN issue?
            Hide
            glennpratt Glenn Pratt added a comment -

            Felix Frank this can and does happen in the same datacenter, in fact it just did:

            "Error: Could not set 'file' on ensure: Connection timed out - connect(2) at 75:/etc/puppet/modules/monitor/manifests/custom_plugins.pp"
            

            Retrying idempotent HTTP requests is relatively simple and extremely common.

            http://docs.aws.amazon.com/general/latest/gr/api-retries.html
            http://msdn.microsoft.com/en-us/library/dn589788.aspx

            Opt-in by the caller seems fine, this is how excon works:

            connection.request(:idempotent => true, :retry_limit => 6)
            

            https://github.com/excon/excon#options

            Are we asking for more opt-in, like a configuration flag for end users? If so, why?

            Show
            glennpratt Glenn Pratt added a comment - Felix Frank this can and does happen in the same datacenter, in fact it just did: "Error: Could not set 'file' on ensure: Connection timed out - connect(2) at 75:/etc/puppet/modules/monitor/manifests/custom_plugins.pp" Retrying idempotent HTTP requests is relatively simple and extremely common. http://docs.aws.amazon.com/general/latest/gr/api-retries.html http://msdn.microsoft.com/en-us/library/dn589788.aspx Opt-in by the caller seems fine, this is how excon works: connection.request(:idempotent => true, :retry_limit => 6) https://github.com/excon/excon#options Are we asking for more opt-in, like a configuration flag for end users? If so, why?
            Hide
            branan Branan Riley added a comment -

            Thank you for filing this issue. While we agree this is likely an improvement, addressing this issue would require a substantial architecture change that we do not anticipate being able to undertake due to other issues taking precedence. As such, this ticket will be closed as “Won’t Fix”. We may revisit this at a later time, and if so, will re-open this ticket.

            Outside of the boilerplate:
            There's an easy workaround of just running puppet again, and the discussion in this ticket shows that we are unlikely to come to a good solution for all requests the agent might make in all situations.

            Show
            branan Branan Riley added a comment - Thank you for filing this issue. While we agree this is likely an improvement, addressing this issue would require a substantial architecture change that we do not anticipate being able to undertake due to other issues taking precedence. As such, this ticket will be closed as “Won’t Fix”. We may revisit this at a later time, and if so, will re-open this ticket. Outside of the boilerplate: There's an easy workaround of just running puppet again, and the discussion in this ticket shows that we are unlikely to come to a good solution for all requests the agent might make in all situations.

              People

              • Assignee:
                pdrake Peter Drake
                Reporter:
                redmine.exporter redmine.exporter
                QA Contact:
                Eric Thompson
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Zendesk Support