[PUP-2987] Puppet service hangs when attempting to stop on Windows Created: 2014/07/29  Updated: 2017/03/24  Resolved: 2014/08/21

Status: Closed
Project: Puppet
Component/s: None
Affects Version/s: None
Fix Version/s: PUP 3.7.0

Type: Bug Priority: Critical
Reporter: Ethan Brown Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by PUP-1911 Windows installer downgrade fails to ... Closed
Relates
relates to PUP-3064 Prevent race condition in Windows ser... Accepted
relates to MODULES-3757 [puppet_agent] Agent Upgrade - instal... Closed
relates to PUP-7392 Acceptance: Create test verifying tha... Closed
relates to MCO-743 acceptance: test service stop/start/r... Closed
relates to PUP-3126 Update win32-service gem to 0.8.6+ Closed
relates to PUP-4416 acceptance: windows: puppet service s... Closed
Template:
Story Points: 2
Sprint: 2014-08-20, 2014-09-03
QA Contact: Eric Thompson

 Description   

There is a race condition in the Puppet service code that prevents it from properly stopping.

Generally speaking this can be reproduced by installing from the MSI, then attempting to upgrade / remove the MSI shortly thereafter.

In the wild, we've noticed it can be either puppet / pe-puppet or mcollective / pe-mcollective, so the service code for puppet and mcollective should both be addressed.

This has also been experienced in the Azure environment, where the Azure extension handler code addresses this by force stopping the service process by PID if the service doesn't exit gracefully.
https://github.com/puppetlabs/azure-extension-handler/blob/master/installer/windows.ps1#L137-L149



 Comments   
Comment by Ethan Brown [ 2014/08/14 ]

Current patch merged into win32-service gem (PR # 21) addresses the hang by ensuring that even where threads abort early or exceptions are raised, that events will be signaled / SetTheServiceStatus will be called to stop the service.

This still doesn't address why there is a race under some circumstances and the thread Service_Main can be prematurely aborted.... but it at least prevents the service from hanging while trying to STOP in a STOP_PENDING state.

Comment by Josh Cooper [ 2014/08/20 ]

Merged to master in 2c8dfc34

Comment by Josh Cooper [ 2014/08/20 ]

For review, download x86 and x64 MSIs from http://jenkins-release.delivery.puppetlabs.net/view/Windows%20Builds/job/Windows%20polling%20puppet%20master/ (note you have to follow the downstream job, and then grab the archived build artifacts.)

For each MSI, install puppet, and from the command line ensure you can start and stop the service without hanging. It is helpful to start the service with debugging enabled sc start puppet --debug --logtofile and in a different windows tail -f C:\ProgramData\PuppetLabs\puppet\var\log\windows.log, to see the service state transitions.

Comment by Rob Reynolds [ 2014/08/21 ]

Verified with http://builds.puppetlabs.lan/puppet/34e55daba11653c079b8277c9dd8ad7347dfd512/artifacts/windows/.

ServiceRecycle.ps1:

for ($i = 1; $i -le 500; $i++) {
	Write-Host "== Iteration '$i' - Puppet Service Recyle =="
	&sc.exe stop puppet | out-null
	sleep 1
	&sc.exe query puppet
	&sc.exe start puppet --debug --logtofile | out-null
	sleep 1
	&sc.exe query puppet
}

Note: Unfortunately the sleep time was necessary b/c Windows...

Produced STOPPED, STARTED messages like the following:

== Iteration '25' - Puppet Service Recyle ==
 
SERVICE_NAME: puppet
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 1  STOPPED
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0
 
SERVICE_NAME: puppet
        TYPE               : 10  WIN32_OWN_PROCESS
        STATE              : 4  RUNNING
                                (STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0

The log produced this output:

2014-08-21 19:57:49 +0000 Puppet (notice): Service started
2014-08-21 19:57:49 +0000 Puppet (notice): Service stopped
2014-08-21 19:57:52 +0000 Puppet (debug): Using 'C:/Program Files/Puppet Labs/Puppet/bin/puppet.bat'
2014-08-21 19:57:52 +0000 Puppet (notice): Service stopping / killing worker thread
2014-08-21 19:57:53 +0000 Puppet (notice): Service started
2014-08-21 19:57:53 +0000 Puppet (notice): Service stopped
2014-08-21 19:57:56 +0000 Puppet (debug): Using 'C:/Program Files/Puppet Labs/Puppet/bin/puppet.bat'
2014-08-21 19:57:56 +0000 Puppet (notice): Service stopping / killing worker thread

Comment by Kurt Wall [ 2017/03/23 ]

Ethan Brown To test this with a beaker test, would it be sufficient to adopt Josh Cooper's method: this the following sequence of commands some number of times, verifying that the stop, start, and query options succeed each time (no hang, obviously, but also no warning/error) and make appropriate entries in the log?

sc stop puppet
sc query puppet
sc start puppet
sc query puppet

Comment by Ethan Brown [ 2017/03/24 ]

Kurt Wall that would probably be sufficient, though you might run into some issues here, so I'm not sure it's worth the effort:

  • We install the agent with the service stopped, intentionally. That's so Beaker can call puppet agent or puppet apply and not have to worry about interfering with a background service (so if something in such a test went wrong for some reason and left the service running, there's a potential for the rest of the test run to have problems as a result)
  • The log will tell you some things, but it's not really a formal contract - you'd be better served by getting a PID for the service (which could be a little tricky).
  • What will you do to clean-up a hung service if the test doesn't pass? (Kill the PID?)

Was there a reason that such an old ticket is being revisited out of curiosity? Has there been a recent regression?

Comment by Kurt Wall [ 2017/03/24 ]

Ethan Brown Good questions. Answers in matching bullets:

  • I don't mind starting and stopping the services – it lets me test the commands. I can add a teardown step to make sure the service is stopped.
  • I'm good to try to get the PID, but I have an approach that seems to be working.
  • I hadn't thought of this – I should try to kill the PID.

There has been no regression of which I'm aware. Agent team QA is finally in a position to implement a backlog of unimplemented test cases, which is maintained in Test Rail. We created this backlog when we triaged, literally, thousands of closed Jira tickets looking for test cases and test holes. This is part of a long arc of work that's been going on behind the scenes for about the last 3 1/2 years.

Generated at Sun Oct 20 19:41:02 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.