Affects Version/s: PUP 5.4.0
Sprint:Platform OS Kanban
Method Found:Needs Assessment
Release Notes:Bug Fix
Release Notes Summary:Fix a race condition between puppet and launchd when restartring services on OSX
QA Risk Assessment:Needs Assessment
Puppet Version: 5.4.0
OS Name/Version: macOS 10.13.6
Restarting jobs will frequently fail. When restarting a service, the launchd provider calls stop and then start. The stop method executes launchctl unload -w <plist path>, and then calls enable The enable method reads the disable overrides pilst (/var/db/com.apple.xpc.launchd/disabled.plist) and modifies it.
However, there is a race. While the launchd process is changing states, the disables plist is sometimes zero-length. When this occurs, read_plist (which is just Puppet::Util::(Plist.read_plist_file fails:
This leaves the service stopped and disabled, and the puppet run is unsuccessful. A subsequent puppet run will restart and re-enable the job.
A simple workaround would be to loop and sleep while reading the overrides plist. I have a small commit that does this.
There is a larger architectural issue with the launchd provider, however. We only get into this error state because the provider is trying to reconcile a service resource that should be ensure => running and enable => true. When it calls stop, the resource that should be running/enabled is now stopped/disabled, and the stop method helpfully tries to fix the disabled bit by calling enable. But the enable (and disable) methods use direct modification of the overrides plist to do their work, and this is not a supported method.
From the launchctl man page:
Overrides the Disabled key and sets it to false or true for the load and unload subcommands respectively. In previous versions, this option would modify the configuration file. Now the state of the Disabled key is stored elsewhere on- disk in a location that may not be directly manipulated by any process other than launchd.
I think a larger re-thinking of how to manage launchd services with this provider is needed.
The service successfully restarts.
The service is left stopped and disabled until a subsequent puppet run.
Steps to reproduce:
Trigger a refresh on a launchd service. This doesn't happen on every service restart, but if you do several in a row eventually this will crop up.
I've attached a keepalive.sh. Put this in /tmp and make it executable. Put keepalive.plist in /Library/LaunchDaemons/. Run this repeatedly until puppet fails: