Uploaded image for project: 'Puppet'
  1. Puppet
  2. PUP-7517

Add logic for restarting hung catalog application

    Details

    • Template:
    • Team:
      Platform Core
    • Story Points:
      1
    • Sprint:
      Platform Core KANBAN
    • CS Priority:
      Major
    • CS Frequency:
      4 - 50-90% of Customers
    • CS Severity:
      3 - Serious
    • CS Business Value:
      4 - $$$$$
    • CS Impact:
      Hide
      A large number of customers have brought this to support and it's likely that a similarly large number of users have had this problem and not realized it. If the agent isn't running you aren't getting the value of puppet on those nodes and could see unexpected drift. The workaround requires either costly manual intervention or the expertise to use automated solutions to both find and fix impacted nodes
      Show
      A large number of customers have brought this to support and it's likely that a similarly large number of users have had this problem and not realized it. If the agent isn't running you aren't getting the value of puppet on those nodes and could see unexpected drift. The workaround requires either costly manual intervention or the expertise to use automated solutions to both find and fix impacted nodes
    • Release Notes:
      New Feature
    • Release Notes Summary:
      Hide
      A new runtimeout setting has been added which can be used to ensure Puppet agent runs are cancelled when a specified time limit is exceeded. The setting defaults to 0 which preserves existing behavior of allowing agent runs an unlimited amount of time to complete.
      Show
      A new runtimeout setting has been added which can be used to ensure Puppet agent runs are cancelled when a specified time limit is exceeded. The setting defaults to 0 which preserves existing behavior of allowing agent runs an unlimited amount of time to complete.
    • QA Risk Assessment:
      Automate
    • QA Risk Assessment Reason:
      tests added with code change

      Description

      On occasion, a puppet agent can end up waiting indefinitely on some process that will never return or terminate. Some examples:

      • A HTTP connection that was established, but then broken (http_read_timeout defaults to infinity...)
      • An I/O read that won't return (for example, networked file system that got interrupted)
      • A subprocess that was executed without a timeout and that isn't going to return.
      • Module or other plugin code that is susceptible to hangs and contains no defensive timeout or other guard logic.

      When this situation occurs, further agent runs will be blocked as the hung run will be holding onto required catalog locks. Often, manual remediation is required to re-start the hung agents.

      In situations where hangs can occur often due to transient environment issues (such as flaky networks), it would be useful for the Puppet Daemon to have logic for automatically determining when a hung run should be terminated so that a new one can be started.

      For example: if the previous run has been holding the catalog lock for longer than n times the run_interval, kill it start a new one.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                justin Justin Stoller
                Reporter:
                chuck Charlie Sharpsteen
                QA Contact:
                Eric Delaney
              • Votes:
                4 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Zendesk Support