Uploaded image for project: 'Puppet Server'
  1. Puppet Server
  2. SERVER-1767

Add ability to re-splay agents during thundering herd events

    Details

    • Template:
    • Team:
      Systems Engineering
    • Sub-team:
    • Story Points:
      3
    • Sprint:
      Server 2017-04-19, Server 2017-07-11
    • Release Notes:
      New Feature
    • Release Notes Summary:
      Hide
      Puppet Server can optionally return 503 responses for incoming requests when the backlog of outstanding requests for JRuby instances exceeds a configurable limit. These responses can be configured to include a {{Retry-After}} header indicating a randomized amount of time that the requester should sleep before retrying the request. Both of these behaviors can be configured through the new {{max-queued-requests}} and {{max-retry-delay}} settings in the {{jruby-puppet}} configuration.
      Show
      Puppet Server can optionally return 503 responses for incoming requests when the backlog of outstanding requests for JRuby instances exceeds a configurable limit. These responses can be configured to include a {{Retry-After}} header indicating a randomized amount of time that the requester should sleep before retrying the request. Both of these behaviors can be configured through the new {{max-queued-requests}} and {{max-retry-delay}} settings in the {{jruby-puppet}} configuration.
    • QA Risk Assessment:
      Needs Assessment

      Description

      When a group of agents start their puppet runs together they form a "thundering herd" which can exceed server resources. This results in a growing backlog of requests from puppet agents that are waiting for a JRuby instance to become free. If this backlog exceeds the size of the Jetty thread pool, other requests such as status checks will start timing out. The agent herd will tend to persist until a human manually remediates the situation using a rolling restart to space out the agents involved.

      Puppet Server should send a signal to agents when it is over capacity that indicates they should back off for a random period of time before resuming requests. This would allow a thundering herd to be automatically re-splayed without human intervention.

        Attachments

          Issue Links

            Activity

            Hide
            chuck Charlie Sharpsteen added a comment -

            Initial thought at a solution for this is to add a ring handler to the puppet server routes that compares the current request backlog to a configured limit. If the limit has been exceeded, a 503 Service Unavailable response is returned to the agent with the Retry-After header set to a random number of seconds. On the PUP side, the HTTP client could be updated to respond to a 503 with Retry-After by sleeping for the specified number of seconds — this would achieve the effect of splaying out an agent herd.

            Show
            chuck Charlie Sharpsteen added a comment - Initial thought at a solution for this is to add a ring handler to the puppet server routes that compares the current request backlog to a configured limit. If the limit has been exceeded, a 503 Service Unavailable response is returned to the agent with the Retry-After header set to a random number of seconds. On the PUP side, the HTTP client could be updated to respond to a 503 with Retry-After by sleeping for the specified number of seconds — this would achieve the effect of splaying out an agent herd.
            Hide
            chuck Charlie Sharpsteen added a comment -

            After some good discussion in the PR, we've agreed this is a good first approach for server-side mitigation of thundering herds. I'll clean up the code in the PR so that it is production-ready and have filed PUP-7451 for having the agent respect Retry-After headers.

            Show
            chuck Charlie Sharpsteen added a comment - After some good discussion in the PR, we've agreed this is a good first approach for server-side mitigation of thundering herds. I'll clean up the code in the PR so that it is production-ready and have filed PUP-7451 for having the agent respect Retry-After headers.
            Hide
            jeremy.barlow Jeremy Barlow added a comment -

            Merged to puppetserver#master at 2bd9eb.

            Show
            jeremy.barlow Jeremy Barlow added a comment - Merged to puppetserver#master at 2bd9eb .
            Hide
            karen Karen Van der Veer added a comment -

            Does this need Release notes? Charlie Sharpsteen Jeremy Barlow

            Show
            karen Karen Van der Veer added a comment - Does this need Release notes? Charlie Sharpsteen Jeremy Barlow
            Hide
            chuck Charlie Sharpsteen added a comment -

            Karen Van der Veer Release note added. This is a new feature, but a bit tricky since it requires PUP-3454 on the agent side to be useful. I'm hoping to get a PR for that up in the next couple of days.

            Show
            chuck Charlie Sharpsteen added a comment - Karen Van der Veer Release note added. This is a new feature, but a bit tricky since it requires PUP-3454 on the agent side to be useful. I'm hoping to get a PR for that up in the next couple of days.

              People

              • Assignee:
                chuck Charlie Sharpsteen
                Reporter:
                chuck Charlie Sharpsteen
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Zendesk Support