Uploaded image for project: 'Puppet Server'
  1. Puppet Server
  2. SERVER-1767

Add ability to re-splay agents during thundering herd events

    XMLWordPrintable

    Details

    • Template:
    • Team:
      Systems Engineering
    • Sub-team:
    • Story Points:
      3
    • Sprint:
      Server 2017-04-19, Server 2017-07-11
    • Release Notes:
      New Feature
    • Release Notes Summary:
      Hide
      Puppet Server can optionally return 503 responses for incoming requests when the backlog of outstanding requests for JRuby instances exceeds a configurable limit. These responses can be configured to include a {{Retry-After}} header indicating a randomized amount of time that the requester should sleep before retrying the request. Both of these behaviors can be configured through the new {{max-queued-requests}} and {{max-retry-delay}} settings in the {{jruby-puppet}} configuration.
      Show
      Puppet Server can optionally return 503 responses for incoming requests when the backlog of outstanding requests for JRuby instances exceeds a configurable limit. These responses can be configured to include a {{Retry-After}} header indicating a randomized amount of time that the requester should sleep before retrying the request. Both of these behaviors can be configured through the new {{max-queued-requests}} and {{max-retry-delay}} settings in the {{jruby-puppet}} configuration.
    • QA Risk Assessment:
      Needs Assessment

      Description

      When a group of agents start their puppet runs together they form a "thundering herd" which can exceed server resources. This results in a growing backlog of requests from puppet agents that are waiting for a JRuby instance to become free. If this backlog exceeds the size of the Jetty thread pool, other requests such as status checks will start timing out. The agent herd will tend to persist until a human manually remediates the situation using a rolling restart to space out the agents involved.

      Puppet Server should send a signal to agents when it is over capacity that indicates they should back off for a random period of time before resuming requests. This would allow a thundering herd to be automatically re-splayed without human intervention.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              chuck Charlie Sharpsteen
              Reporter:
              chuck Charlie Sharpsteen
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Zendesk Support