Uploaded image for project: 'Puppet'
  1. Puppet
  2. PUP-9111

Restarting services using launchd service provider is flaky

    XMLWordPrintable

Details

    • Platform OS
    • Platform OS Kanban
    • Needs Assessment
    • Bug Fix
    • Fix a race condition between puppet and launchd when restartring services on OSX
    • Needs Assessment

    Description

      Puppet Version: 5.4.0
      OS Name/Version: macOS 10.13.6

      Restarting jobs will frequently fail. When restarting a service, the launchd provider calls stop and then start. The stop method executes launchctl unload -w <plist path>, and then calls enable The enable method reads the disable overrides pilst (/var/db/com.apple.xpc.launchd/disabled.plist) and modifies it.

      However, there is a race. While the launchd process is changing states, the disables plist is sometimes zero-length. When this occurs, read_plist (which is just Puppet::Util::(Plist.read_plist_file fails:

      Info: /Stage[main]/Main/File[/tmp/foo]: Scheduling refresh of Service[keepalive]
      Warning: Cannot read file /var/db/com.apple.xpc.launchd/disabled.plist; Puppet is skipping it.\nDetails: Execution of '/usr/bin/plutil -convert xml1 -o - /var/db/com.apple.xpc.launchd/disabled.plist' returned 1: /var/db/com.apple.xpc.launchd/disabled.plist: Property List error: Cannot parse a NULL or zero-length data / JSON error: No value.
      Error: /Stage[main]/Main/Service[keepalive]: Failed to call refresh: undefined method `[]=' for nil:NilClass
      Error: /Stage[main]/Main/Service[keepalive]: undefined method `[]=' for nil:NilClass 
      

      This leaves the service stopped and disabled, and the puppet run is unsuccessful. A subsequent puppet run will restart and re-enable the job.

      A simple workaround would be to loop and sleep while reading the overrides plist. I have a small commit that does this.

      There is a larger architectural issue with the launchd provider, however. We only get into this error state because the provider is trying to reconcile a service resource that should be ensure => running and enable => true. When it calls stop, the resource that should be running/enabled is now stopped/disabled, and the stop method helpfully tries to fix the disabled bit by calling enable. But the enable (and disable) methods use direct modification of the overrides plist to do their work, and this is not a supported method.

      From the launchctl man page:

      Overrides the Disabled key and sets it to false or true for the load and unload subcommands respectively. In previous versions, this option would modify the configuration file. Now the state of the Disabled key is stored elsewhere on- disk in a location that may not be directly manipulated by any process other than launchd.

      I think a larger re-thinking of how to manage launchd services with this provider is needed.

      Desired Behavior:

      The service successfully restarts.

      Actual Behavior:

      The service is left stopped and disabled until a subsequent puppet run.

      Steps to reproduce:
      Trigger a refresh on a launchd service. This doesn't happen on every service restart, but if you do several in a row eventually this will crop up.

      I've attached a keepalive.sh. Put this in /tmp and make it executable. Put keepalive.plist in /Library/LaunchDaemons/. Run this repeatedly until puppet fails:

      sudo rm -f /tmp/foo; sudo puppet apply --verbose --color=true --trace keepalive.pp

      Attachments

        1. keepalive.plist
          0.6 kB
        2. keepalive.pp
          0.1 kB
        3. keepalive.sh
          0.1 kB

        Activity

          People

            Unassigned Unassigned
            ccaviness Clay Caviness
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Zendesk Support