Uploaded image for project: 'Puppet'
  1. Puppet
  2. PUP-3064

Prevent race condition in Windows service code from prematurely exiting win32-service Service_Main

    Details

    • Type: Bug
    • Status: Accepted
    • Priority: Normal
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Template:
    • Team:
      Platform OS
    • Story Points:
      2

      Description

      A workaround has been put in place as part of PUP-2987 to fix issues discovered in the win32-service gem. It also has slightly restructured the Windows service daemon.

      This should resolve deadlocks experienced when terminating the service.

      However, the reason that this works is due to the fact that Service_Main now contains an ensure block that makes certain to call SetTheServiceStatus.call(SERVICE_STOPPED, NO_ERROR, 0, 0) if the thread prematurely aborts or an exception is otherwise generated.

      In local testing under windbg, as soon as the SCM signals the service (inside of Service_CtrlEx) that it has requested a stop, it appears that the Service_Main completely aborts, despite the fact that it is currently waiting on while(WaitForSingleObject(@@hStopEvent, 1000) != WAIT_OBJECT_0) do.

      It should continue to wait on while(WaitForSingleObject(@@hStopCompletedEvent, 1000) != WAIT_OBJECT_0) do, which allows for the service code to perform proper clean up, but this never happens.

      Instead the service stops abruptly without following the orchestration set forward by the signaled events.

      For our particular code, this might not be an issue – but should the daemon code ever become more complicated, it's possible that resources may not be freed and the situation could escalate quickly.

      Some ideas of what to look at with this issue:

      • FFI calls can have a :blocking parameter set, which can control how they behave with respect to the Ruby GIL. Some of the native API calls specify this, but perhaps incorrectly? https://github.com/ffi/ffi/wiki/Callbacks
      • The win32-service gem uses class level variables to store state that is shared between Ruby threads, Win32 threads, and Win32 callbacks created by FFI. This feels dangerous, and should potentially be rewritten to avoid using Ruby objects (which might? translate to weird GC issues across threads that Ruby doesn't own) in lieu of passing unmanaged pointers / values where possible. The use of closures to capture class variables feels wrong.
      • Note that SetServiceStatus must always be called after any cleanup operations -> http://msdn.microsoft.com/en-us/library/windows/desktop/ms686241(v=vs.85).aspx

      Other helpful tips:

      • I prefer to install the latest x64 MSI built by Jenkins, then to modify the vendored win32-service code that should live in c:\program files\Puppet Labs\Puppet\sys\ruby\lib\ruby\gems\2.0.0\gems\win32-service-0.8.5\lib\win32
      • The Puppet service can be launched from the command line with sc start puppet --debug --trace --verbose
      • I typically wait for Puppet to fail to resolve a bogus master name before attaching to the process. Once attached sc stop puppet to observe the services shutdown behavior.
      • The cv2pdb application can be used to extract PDB format debug files from the Ruby that we build - see http://blog.morlad.at/blah/mingw_postmortem. Despite what this article says, it would appear that windbg searches for a ruby.dbg or ruby.pdb that lives alongside ruby.exe in the c:\program files\puppet labs\puppet\sys\ruby\bin directory
      • Windbg x64 must be used to attach to the ruby.exe service process. Start windbg from the command line with &'C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\windbg.exe' -g -G -Q -sflags 0x80030377 -y 'C:\Program Files\Puppet Labs\Puppet\sys\ruby\bin' This allows windbg to load the ruby symbols even though they don't have the same hash as expected.
      • It can be convenient to add event log code to the vendored daemon.rb from the win32-service gem like:

          def self.report_windows_event(type,id,message)
            begin
              eventlog = nil
              eventlog = Win32::EventLog.open("Application")
              eventlog.report_event(
                :source      => "Puppet",
                :event_type  => type,   # EVENTLOG_ERROR_TYPE, etc
                :event_id    => id,     # 0x01 or 0x02, 0x03 etc.
                :data        => message # "the message"
              )
            rescue Exception => e
              # Ignore all errors
            ensure
              if (!eventlog.nil?)
                eventlog.close
              end
            end
          end
       
      ## sample usage for sprinkling around the code to log useful stuff
      report_windows_event(EVENTLOG_INFORMATION_TYPE,0x01,"Win32::Daemon - Signaling @@hStopEvent #{@@hStopEvent} from SetTheServiceStatus")
      

      • Similarly, it can be useful to add attach_pfunc :GetCurrentThreadId, [], :dword to the win32-service gems functions.rb so that GetCurrentThreadId.to_s(16) can be added in logging statements. This becomes useful with windbg attached to understand which threads are Ruby vs native, etc.
      • When logging handle values use .to_s(16) as windbg handles are in hex. To view all data for the handle (like an event), use a windbg command like !handle 2bf ff. !htrace 2bf should also be able to retrieve the id as well. Note that using !handle will bump the HandleCount by 1 and the PointerCount by 2, so take that into account when reading this data.
      • A really useful command in windbg is sxe et to automatically break whenever a thread terminates. When event logs with useful info on handles / thread ids are combined with this, the ordering of threads exiting can be observed.
      • Trying to decipher parameters passed to native functions in windbg on x64 is painful because the only call convention is fastcall. Parameters are passed through specific registers as defined in http://msdn.microsoft.com/en-us/library/9z1stfyw.aspx In my experience the call stack UI in windbg doesn't properly show values (like handles) passed to APIs like WaitForSingleObject properly. Here be dragons (which is why logging values can be more convenient)
      • For windbg beginners, there's a great chart maintained at http://windbg.info/doc/1-common-cmds.html

        Attachments

          Issue Links

            Activity

              jsd-sla-details-panel

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  ethan Ethan Brown
                  QA Contact:
                  Erik Dasher
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  3 Start watching this issue

                  Dates

                  • Created:
                    Updated:

                    Zendesk Support