[PUP-3064] Prevent race condition in Windows service code from prematurely exiting win32-service Service_Main Created: 2014/08/14  Updated: 2020/03/04

Status: Accepted
Project: Puppet
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Normal
Reporter: Ethan Brown Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: daemon, platform-os, windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to MCO-802 MCollective service fails to either s... Closed
relates to PA-1855 Custom compiled Ruby version for Wind... Accepted
relates to PUP-2987 Puppet service hangs when attempting ... Closed
Template:
Team: Night's Watch
Story Points: 2
QA Contact: Erik Dasher

 Description   

A workaround has been put in place as part of PUP-2987 to fix issues discovered in the win32-service gem. It also has slightly restructured the Windows service daemon.

This should resolve deadlocks experienced when terminating the service.

However, the reason that this works is due to the fact that Service_Main now contains an ensure block that makes certain to call SetTheServiceStatus.call(SERVICE_STOPPED, NO_ERROR, 0, 0) if the thread prematurely aborts or an exception is otherwise generated.

In local testing under windbg, as soon as the SCM signals the service (inside of Service_CtrlEx) that it has requested a stop, it appears that the Service_Main completely aborts, despite the fact that it is currently waiting on while(WaitForSingleObject(@@hStopEvent, 1000) != WAIT_OBJECT_0) do.

It should continue to wait on while(WaitForSingleObject(@@hStopCompletedEvent, 1000) != WAIT_OBJECT_0) do, which allows for the service code to perform proper clean up, but this never happens.

Instead the service stops abruptly without following the orchestration set forward by the signaled events.

For our particular code, this might not be an issue – but should the daemon code ever become more complicated, it's possible that resources may not be freed and the situation could escalate quickly.

Some ideas of what to look at with this issue:

  • FFI calls can have a :blocking parameter set, which can control how they behave with respect to the Ruby GIL. Some of the native API calls specify this, but perhaps incorrectly? https://github.com/ffi/ffi/wiki/Callbacks
  • The win32-service gem uses class level variables to store state that is shared between Ruby threads, Win32 threads, and Win32 callbacks created by FFI. This feels dangerous, and should potentially be rewritten to avoid using Ruby objects (which might? translate to weird GC issues across threads that Ruby doesn't own) in lieu of passing unmanaged pointers / values where possible. The use of closures to capture class variables feels wrong.
  • Note that SetServiceStatus must always be called after any cleanup operations -> http://msdn.microsoft.com/en-us/library/windows/desktop/ms686241(v=vs.85).aspx

Other helpful tips:

  • I prefer to install the latest x64 MSI built by Jenkins, then to modify the vendored win32-service code that should live in c:\program files\Puppet Labs\Puppet\sys\ruby\lib\ruby\gems\2.0.0\gems\win32-service-0.8.5\lib\win32
  • The Puppet service can be launched from the command line with sc start puppet --debug --trace --verbose
  • I typically wait for Puppet to fail to resolve a bogus master name before attaching to the process. Once attached sc stop puppet to observe the services shutdown behavior.
  • The cv2pdb application can be used to extract PDB format debug files from the Ruby that we build - see http://blog.morlad.at/blah/mingw_postmortem. Despite what this article says, it would appear that windbg searches for a ruby.dbg or ruby.pdb that lives alongside ruby.exe in the c:\program files\puppet labs\puppet\sys\ruby\bin directory
  • Windbg x64 must be used to attach to the ruby.exe service process. Start windbg from the command line with &'C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\windbg.exe' -g -G -Q -sflags 0x80030377 -y 'C:\Program Files\Puppet Labs\Puppet\sys\ruby\bin' This allows windbg to load the ruby symbols even though they don't have the same hash as expected.
  • It can be convenient to add event log code to the vendored daemon.rb from the win32-service gem like:

    def self.report_windows_event(type,id,message)
      begin
        eventlog = nil
        eventlog = Win32::EventLog.open("Application")
        eventlog.report_event(
          :source      => "Puppet",
          :event_type  => type,   # EVENTLOG_ERROR_TYPE, etc
          :event_id    => id,     # 0x01 or 0x02, 0x03 etc.
          :data        => message # "the message"
        )
      rescue Exception => e
        # Ignore all errors
      ensure
        if (!eventlog.nil?)
          eventlog.close
        end
      end
    end
 
## sample usage for sprinkling around the code to log useful stuff
report_windows_event(EVENTLOG_INFORMATION_TYPE,0x01,"Win32::Daemon - Signaling @@hStopEvent #{@@hStopEvent} from SetTheServiceStatus")

  • Similarly, it can be useful to add attach_pfunc :GetCurrentThreadId, [], :dword to the win32-service gems functions.rb so that GetCurrentThreadId.to_s(16) can be added in logging statements. This becomes useful with windbg attached to understand which threads are Ruby vs native, etc.
  • When logging handle values use .to_s(16) as windbg handles are in hex. To view all data for the handle (like an event), use a windbg command like !handle 2bf ff. !htrace 2bf should also be able to retrieve the id as well. Note that using !handle will bump the HandleCount by 1 and the PointerCount by 2, so take that into account when reading this data.
  • A really useful command in windbg is sxe et to automatically break whenever a thread terminates. When event logs with useful info on handles / thread ids are combined with this, the ordering of threads exiting can be observed.
  • Trying to decipher parameters passed to native functions in windbg on x64 is painful because the only call convention is fastcall. Parameters are passed through specific registers as defined in http://msdn.microsoft.com/en-us/library/9z1stfyw.aspx In my experience the call stack UI in windbg doesn't properly show values (like handles) passed to APIs like WaitForSingleObject properly. Here be dragons (which is why logging values can be more convenient)
  • For windbg beginners, there's a great chart maintained at http://windbg.info/doc/1-common-cmds.html


 Comments   
Comment by Ethan Brown [ 2014/08/20 ]

During the last observation of the service code termination, the following threads were observed prior to the SCM stop request, and then were observed after the stop request was initiated, the order tracked as they shut down.

Those labelled with TOAST were active threads prior to the STOP request, but no longer running once threads began to exit. Those labelled DEAD were killed in the order given - there are notes about how each thread maps to the Ruby code, and any additional info. Therefore, this can be used as a key in any additional debugging. The important thing to note is that the thread exit order here is not what is expected.

code
TOAST ef8
ntdll!ZwDelayExecution+0xa
KERNELBASE!SleepEx+0xab
x64_msvcrt_ruby200!rb_mutex_unlock+0x186
x64_msvcrt_ruby200!ruby_thread_has_gvl_p+0x30b
x64_msvcrt_ruby200!rb_vm_localjump_error+0x55ba
x64_msvcrt_ruby200!rb_vm_localjump_error+0x5be9
x64_msvcrt_ruby200!rb_funcall+0xd1e
x64_msvcrt_ruby200!rb_funcall+0xeb0
x64_msvcrt_ruby200!rb_vm_invoke_proc+0x27
x64_msvcrt_ruby200!rb_thread_lock_destroy+0x7bc
x64_msvcrt_ruby200!rb_thread_lock_destroy+0xa6c
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

TOAST - f3c
ntdll!NtWaitForWorkViaWorkerFactory+0xa
ntdll!TppWorkerThread+0x2c9
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

e7c - DEAD - 1
ntdll!NtWaitForMultipleObjects+0xa
KERNELBASE!WaitForMultipleObjectsEx+0xe8
kernel32!WaitForMultipleObjects+0xb0
x64_msvcrt_ruby200!rb_mutex_trylock+0x125
x64_msvcrt_ruby200!rb_mutex_trylock+0x3b0
x64_msvcrt_ruby200!rb_thread_kill+0x30f
x64_msvcrt_ruby200!rb_last_status_get+0xd3a
x64_msvcrt_ruby200!rb_error_arity+0x125
x64_msvcrt_ruby200!rb_f_send+0x5a1
x64_msvcrt_ruby200!rb_vm_localjump_error+0x1cf8
x64_msvcrt_ruby200!rb_vm_localjump_error+0x5be9
x64_msvcrt_ruby200!rb_funcall+0xd1e
x64_msvcrt_ruby200!rb_funcall+0xeb0
x64_msvcrt_ruby200!rb_vm_invoke_proc+0x27
x64_msvcrt_ruby200!rb_thread_lock_destroy+0x7bc
x64_msvcrt_ruby200!rb_thread_lock_destroy+0xa6c
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

94c - DEAD - 2
ntdll!NtWaitForSingleObject+0xa
KERNELBASE!WaitForSingleObjectEx+0x79
ffi_c!Init_ffi_c+0x51b4
x64_msvcrt_ruby200!rb_thread_polling+0x103
ffi_c!Init_ffi_c+0x502a
x64_msvcrt_ruby200!rb_thread_lock_destroy+0x60a
x64_msvcrt_ruby200!rb_thread_lock_destroy+0xa6c
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

edc – ThreadProc, which should be native thread based on using CreateThread to create it - DEAD - 3
ntdll!NtWaitForSingleObject+0xa
KERNELBASE!WaitForSingleObjectEx+0x79
sechost!ScSendResponseReceiveControls+0x13b
sechost!ScDispatcherLoop+0x121
sechost!StartServiceCtrlDispatcherA+0x6d
ffi_c!Init_ffi_c+0x90b7
0x2baea10
0x563c60

850 - DEAD - 4
ntdll!NtWaitForSingleObject+0xa
KERNELBASE!WaitForSingleObjectEx+0x79
x64_msvcrt_ruby200!rb_thread_list+0x11af
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

ea0 - thr object busyloop inside of Ruby thread in mainloop - DEAD - 5
ntdll!NtWaitForSingleObject+0xa
KERNELBASE!WaitForSingleObjectEx+0x79
ffi_c!Init_ffi_c+0x55ac
ffi_c!Init_ffi_c+0x8b7a
ffi_c!Init_ffi_c+0x900f
0x430000
0x31aff30

8cc - DEAD - 6
ntdll!NtWaitForWorkViaWorkerFactory+0xa
ntdll!TppWorkerThread+0x2c9
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

ee4 – mainloop, owner of events, and thread handle - DEAD - 7
ntdll!NtWaitForSingleObject+0xa
KERNELBASE!WaitForSingleObjectEx+0x79
ffi_c!Init_ffi_c+0x55ac
ffi_c!Init_ffi_c+0x8b7a
ffi_c!Init_ffi_c+0x900f
0x440000
0x279ff60

9fc – line 215 of Service_Main – part that ends up dying prematurely... DEAD - 8
ntdll!NtWaitForMultipleObjects+0xa
KERNELBASE!WaitForMultipleObjectsEx+0xe8
kernel32!WaitForMultipleObjects+0xb0
x64_msvcrt_ruby200!rb_mutex_trylock+0x125
x64_msvcrt_ruby200!rb_mutex_trylock+0x3b0
x64_msvcrt_ruby200!rb_thread_kill+0x30f
x64_msvcrt_ruby200!rb_last_status_get+0xd3a
x64_msvcrt_ruby200!rb_error_arity+0x125
x64_msvcrt_ruby200!rb_vm_localjump_error+0x1cf8
x64_msvcrt_ruby200!rb_vm_localjump_error+0x5be9
x64_msvcrt_ruby200!rb_funcall+0xd1e
x64_msvcrt_ruby200!rb_funcall+0xeb0
x64_msvcrt_ruby200!rb_vm_invoke_proc+0x27
x64_msvcrt_ruby200!rb_block_lambda+0x7b3
x64_msvcrt_ruby200!rb_vm_localjump_error+0x6147
x64_msvcrt_ruby200!rb_funcall2+0x8f
ffi_c!Init_ffi_c+0x5ccb
x64_msvcrt_ruby200!rb_rescue2+0x18a
ffi_c!Init_ffi_c+0x548f
x64_msvcrt_ruby200!rb_thread_lock_destroy+0x60a

5dc
ntdll!NtWaitForMultipleObjects+0xa
KERNELBASE!WaitForMultipleObjectsEx+0xe8
kernel32!WaitForMultipleObjects+0xb0
x64_msvcrt_ruby200!rb_mutex_trylock+0x125
x64_msvcrt_ruby200!rb_mutex_trylock+0x522
x64_msvcrt_ruby200!rb_thread_sleep_forever+0x24e
x64_msvcrt_ruby200!rb_thread_sleep_forever+0x44a
x64_msvcrt_ruby200!rb_ensure+0xda
x64_msvcrt_ruby200!rb_thread_list+0x1019
x64_msvcrt_ruby200!rb_thread_list+0x1157
x64_msvcrt_ruby200!rb_error_arity+0x125
x64_msvcrt_ruby200!rb_f_send+0x5a1
x64_msvcrt_ruby200!rb_vm_localjump_error+0x1cf8
x64_msvcrt_ruby200!rb_vm_localjump_error+0x5be9
x64_msvcrt_ruby200!rb_iseq_eval_main+0x1e0
x64_msvcrt_ruby200!rb_check_copyable+0x3266
x64_msvcrt_ruby200!ruby_run_node+0x51
ruby!main+0x54
ruby!__tmainCRTStartup+0x257
ruby!mainCRTStartup+0x18
code

Generated at Mon Sep 28 15:45:27 PDT 2020 using Jira 8.5.2#805002-sha1:a66f9354b9e12ac788984e5d84669c903a370049.