Uploaded image for project: 'Puppet Server'
  1. Puppet Server
  2. SERVER-1767

Add ability to re-splay agents during thundering herd events

    Details

    • Template:
    • Team:
      Systems Engineering
    • Sub-team:
    • Story Points:
      3
    • Sprint:
      Server 2017-04-19, Server 2017-07-11
    • Release Notes:
      New Feature
    • Release Notes Summary:
      Hide
      Puppet Server can optionally return 503 responses for incoming requests when the backlog of outstanding requests for JRuby instances exceeds a configurable limit. These responses can be configured to include a {{Retry-After}} header indicating a randomized amount of time that the requester should sleep before retrying the request. Both of these behaviors can be configured through the new {{max-queued-requests}} and {{max-retry-delay}} settings in the {{jruby-puppet}} configuration.
      Show
      Puppet Server can optionally return 503 responses for incoming requests when the backlog of outstanding requests for JRuby instances exceeds a configurable limit. These responses can be configured to include a {{Retry-After}} header indicating a randomized amount of time that the requester should sleep before retrying the request. Both of these behaviors can be configured through the new {{max-queued-requests}} and {{max-retry-delay}} settings in the {{jruby-puppet}} configuration.
    • QA Risk Assessment:
      Needs Assessment

      Description

      When a group of agents start their puppet runs together they form a "thundering herd" which can exceed server resources. This results in a growing backlog of requests from puppet agents that are waiting for a JRuby instance to become free. If this backlog exceeds the size of the Jetty thread pool, other requests such as status checks will start timing out. The agent herd will tend to persist until a human manually remediates the situation using a rolling restart to space out the agents involved.

      Puppet Server should send a signal to agents when it is over capacity that indicates they should back off for a random period of time before resuming requests. This would allow a thundering herd to be automatically re-splayed without human intervention.

        Attachments

          Issue Links

            Activity

              jsd-sla-details-panel

                People

                • Assignee:
                  chuck Charlie Sharpsteen
                  Reporter:
                  chuck Charlie Sharpsteen
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  9 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Zendesk Support