Uploaded image for project: 'Puppet'
  1. Puppet
  2. PUP-7517

Add logic for restarting hung catalog application

    XMLWordPrintable

Details

    • Major
    • 4 - 50-90% of Customers
    • 3 - Serious
    • 4 - $$$$$
    • Hide
      A large number of customers have brought this to support and it's likely that a similarly large number of users have had this problem and not realized it. If the agent isn't running you aren't getting the value of puppet on those nodes and could see unexpected drift. The workaround requires either costly manual intervention or the expertise to use automated solutions to both find and fix impacted nodes
      Show
      A large number of customers have brought this to support and it's likely that a similarly large number of users have had this problem and not realized it. If the agent isn't running you aren't getting the value of puppet on those nodes and could see unexpected drift. The workaround requires either costly manual intervention or the expertise to use automated solutions to both find and fix impacted nodes
    • New Feature
    • Hide
      A new runtimeout setting has been added which can be used to ensure Puppet agent runs are cancelled when a specified time limit is exceeded. The setting defaults to 0 which preserves existing behavior of allowing agent runs an unlimited amount of time to complete.
      Show
      A new runtimeout setting has been added which can be used to ensure Puppet agent runs are cancelled when a specified time limit is exceeded. The setting defaults to 0 which preserves existing behavior of allowing agent runs an unlimited amount of time to complete.
    • Automate
    • tests added with code change

    Description

      On occasion, a puppet agent can end up waiting indefinitely on some process that will never return or terminate. Some examples:

      • A HTTP connection that was established, but then broken (http_read_timeout defaults to infinity...)
      • An I/O read that won't return (for example, networked file system that got interrupted)
      • A subprocess that was executed without a timeout and that isn't going to return.
      • Module or other plugin code that is susceptible to hangs and contains no defensive timeout or other guard logic.

      When this situation occurs, further agent runs will be blocked as the hung run will be holding onto required catalog locks. Often, manual remediation is required to re-start the hung agents.

      In situations where hangs can occur often due to transient environment issues (such as flaky networks), it would be useful for the Puppet Daemon to have logic for automatically determining when a hung run should be terminated so that a new one can be started.

      For example: if the previous run has been holding the catalog lock for longer than n times the run_interval, kill it start a new one.

      Attachments

        Issue Links

          Activity

            People

              justin Justin Stoller
              chuck Charlie Sharpsteen
              Eric Delaney Eric Delaney
              Votes:
              4 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Zendesk Support