Uploaded image for project: 'Puppet Server'
  1. Puppet Server
  2. SERVER-2193

Provide an API for returning Ruby-level stacktraces for JRuby instances

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: SERVER 5.3.1
    • Fix Version/s: SERVER 6.6.0
    • Component/s: None
    • CS Priority:
      Normal
    • CS Frequency:
      2 - 5-25% of Customers
    • CS Severity:
      3 - Serious
    • CS Business Value:
      4 - $$$$$
    • CS Impact:
      Hide
      Long running Jrubies are very difficult to identify the root cause for. Some customers have been satisfied with just restarting puppetserver but that will not always be the case. The cost of not having this is a customer confidence issue when we cannot explain the root cause of their issue, only provide the workaround of restarting.

      Support has reached the extent of how deeply they can dig without this additional functional.

      The only additional option is debug mode which has significant performance impacts and is not suitable for running in production. It also requires a restart which means you may be waiting for an extended period to see the issue recur.
      Show
      Long running Jrubies are very difficult to identify the root cause for. Some customers have been satisfied with just restarting puppetserver but that will not always be the case. The cost of not having this is a customer confidence issue when we cannot explain the root cause of their issue, only provide the workaround of restarting. Support has reached the extent of how deeply they can dig without this additional functional. The only additional option is debug mode which has significant performance impacts and is not suitable for running in production. It also requires a restart which means you may be waiting for an extended period to see the issue recur.
    • Release Notes:
      New Feature
    • Release Notes Summary:
      Hide
      Added an endpoint to Puppet Server, `/puppet-admin-api/v1/jruby-pool/thread-dump`, that will return a thread dump of running JRuby instances, if `jruby.management.enabled` has been set to true in the JVM running puppetserver. See https://github.com/puppetlabs/puppetserver/blob/master/documentation/admin-api/v1/jruby-pool.markdown#get-puppet-admin-apiv1jruby-poolthread-dump for details.
      Show
      Added an endpoint to Puppet Server, `/puppet-admin-api/v1/jruby-pool/thread-dump`, that will return a thread dump of running JRuby instances, if `jruby.management.enabled` has been set to true in the JVM running puppetserver. See https://github.com/puppetlabs/puppetserver/blob/master/documentation/admin-api/v1/jruby-pool.markdown#get-puppet-admin-apiv1jruby-poolthread-dump for details.
    • QA Risk Assessment:
      Needs Assessment

      Description

      As an administrator of Puppet Infrastructure, I need to determine why a Puppet Server request using a JRuby instance has stalled before I restart the service. This can happen if the request has triggered methods from core Puppet code or a Ruby plugin that will never return.

      Reproduction Case

      • Install and start Puppet Server on a CentOS 7 VM:

      rpm -Uvh http://yum.puppetlabs.com/puppet5/puppet5-release-el-7.noarch.rpm
      yum install -y puppetserver
       
      /opt/puppetlabs/puppet/bin/puppet config set --section main server "$(hostname -f)"
      systemctl start puppetserver
      

      • Configure site.pp to sleep for an absurd amount of time in response to a catalog request:

      cat <<"EOM" > /etc/puppetlabs/code/environments/production/manifests/site.pp
      node default {
        inline_template('<% Kernel.sleep(1800) %>')
        # I feel much better after a 30 minute catnap!
      }
      EOM
      

      • Kick off a puppet run: /opt/puppetlabs/bin/puppet agent -t
      • Send the node's hostname to your coworker on Friday afternoon and tell them to figure out what is going on without looking at site.pp (because this could be a problem with 1 out of 100s of installed modules). Leave for a two week vacation.

      Current Options

      Below are listed two current options for checking the state of stalled JRuby requests. Both can help identify that a request is indeed stalled, but neither provides visibility into what Ruby code is responsible.

      Checking the status of the compiler pool shows that a catalog request is hung, but gives no info as to where the request is stalled:

      # curl -k https://localhost:8140/status/v1/services/jruby-metrics?level=debug|python -m json.tool
      {
          "active_alerts": [],
          "detail_level": "debug",
          "service_name": "jruby-metrics",
          "service_status_version": 1,
          "service_version": "5.3.1",
          "state": "running",
          "status": {
              "experimental": {
                  "jruby-pool-lock-status": {
                      "current-state": ":not-in-use",
                      "last-change-time": "2018-04-20T18:24:02.904Z"
                  },
                  "metrics": {
                      "average-borrow-time": 82,
                      "average-free-jrubies": 0.000271409884577374,
                      "average-lock-held-time": 0,
                      "average-lock-wait-time": 0,
                      "average-requested-jrubies": 0.0,
                      "average-wait-time": 0,
                      "borrow-count": 5,
                      "borrow-retry-count": 0,
                      "borrow-timeout-count": 0,
                      "borrowed-instances": [
                          {
                              "duration-millis": 470045,
                              "reason": {
                                  "request": {
                                      "request-method": "post",
                                      "route-id": "puppet-v3-catalog-/*/",
                                      "uri": "/puppet/v3/catalog/tvtyjvubmmaixqi.delivery.puppetlabs.net"
                                  }
                              },
                              "time": 1524248690580
                          }
                      ],
                      "num-free-jrubies": 0,
                      "num-jrubies": 1,
                      "num-pool-locks": 0,
                      "queue-limit-hit-count": 0,
                      "queue-limit-hit-rate": 0.0,
                      "requested-count": 5,
                      "requested-instances": [],
                      "return-count": 4
                  }
              }
          }
      }
      

      Pulling a thread dump using jstack after installing java-1.8.0-openjdk-devel shows a JRuby thread waiting (on RubyKernel.sleep, but thats a lucky coincidence), but the stack trace is at the Java and JRuby AST level which gives no indications as to what Ruby code caused it:

      # sudo -u puppet jstack $(cat /var/run/puppetlabs/puppetserver/puppetserver.pid)
      ...
      "qtp281635025-65" #65 prio=5 os_prio=0 tid=0x00007f30f00c4000 nid=0x9f3 in Object.wait() [0x00007f30edaa4000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00000000f568d888> (a org.jruby.RubyThread)
      	at org.jruby.RubyThread.sleep(RubyThread.java:1002)
      	- locked <0x00000000f568d888> (a org.jruby.RubyThread)
      	- locked <0x00000000f568d888> (a org.jruby.RubyThread)
      	at org.jruby.RubyKernel.sleep(RubyKernel.java:803)
      	at org.jruby.RubyKernel$INVOKER$s$0$1$sleep.call(RubyKernel$INVOKER$s$0$1$sleep.gen)
      	at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:667)
      	at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)
      	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)
      	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)
      	at org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
      	at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
      	at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
      	at org.jruby.ast.RootNode.interpret(RootNode.java:129)
      	at org.jruby.evaluator.ASTInterpreter.INTERPRET_EVAL(ASTInterpreter.java:95)
       
      ...skipping a lot of JRuby AST calls for brevity...
       
      puppetlabs.services.request_handler.request_handler_core$jruby_request_handler$fn__31686.invoke(request_handler_core.clj:273)
      	at puppetlabs.puppetserver.jruby_request$wrap_with_jruby_instance$fn__31597.invoke(jruby_request.clj:48)
      	at puppetlabs.puppetserver.jruby_request$wrap_with_error_handling$fn__31593.invoke(jruby_request.clj:33)
      	at puppetlabs.services.request_handler.request_handler_service$reify__31694$service_fnk__5266__auto___positional$reify__31709.handle_request(request_handler_service.clj:47)
      	at puppetlabs.services.protocols.request_handler$eval27407$fn__27408$G__27399__27411.invoke(request_handler.clj:3)
      	at puppetlabs.services.protocols.request_handler$eval27407$fn__27408$G__27398__27415.invoke(request_handler.clj:3)
      	at clojure.core$partial$fn__4759.invoke(core.clj:2515)
      	at puppetlabs.trapperkeeper.authorization.ring_middleware$eval39879$wrap_authorization_check__39888$fn__39891$fn__39892.invoke(ring_middleware.clj:284)
      	at puppetlabs.ring_middleware.core$eval21857$wrap_bad_request__21866$fn__21869$fn__21874.invoke(core.clj:168)
      	at puppetlabs.ring_middleware.core$eval21953$wrap_uncaught_errors__21962$fn__21965$fn__21970.invoke(core.clj:211)
      	at puppetlabs.ring_middleware.core$eval21584$wrap_request_logging__21589$fn__21590$fn__21592.invoke(core.clj:45)
      	at puppetlabs.i18n.core$locale_negotiator$fn__124.invoke(core.clj:357)
      	at puppetlabs.ring_middleware.core$eval21613$wrap_response_logging__21618$fn__21619$fn__21620.invoke(core.clj:51)
      	at puppetlabs.puppetserver.ringutils$wrap_with_puppet_version_header$fn__17331.invoke(ringutils.clj:83)
      	at puppetlabs.services.master.master_core$eval38154$v3_ruby_routes__38159$fn__38160$fn__38177.invoke(master_core.clj:738)
      	at bidi.ring$eval22897$fn__22898.invoke(ring.cljc:25)
      	at bidi.ring$eval22876$fn__22877$G__22867__22886.invoke(ring.cljc:16)
      	at puppetlabs.comidi$make_handler$fn__25007.invoke(comidi.clj:245)
      	at puppetlabs.metrics.http$eval37332$wrap_with_request_metrics__37337$fn__37341$fn__37343$fn__37344$fn__37345.invoke(http.clj:152)
      	at puppetlabs.metrics.http.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
      	at com.codahale.metrics.Timer.time(Timer.java:101)
      	at puppetlabs.metrics.http$eval37332$wrap_with_request_metrics__37337$fn__37341$fn__37343$fn__37344.invoke(http.clj:152)
      	at puppetlabs.metrics.http.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
      	at com.codahale.metrics.Timer.time(Timer.java:101)
      	at puppetlabs.metrics.http$eval37332$wrap_with_request_metrics__37337$fn__37341$fn__37343.invoke(http.clj:148)
      	at puppetlabs.comidi$eval25068$wrap_with_route_metadata__25073$fn__25074$fn__25076.invoke(comidi.clj:332)
      	at puppetlabs.trapperkeeper.services.webserver.jetty9_core$ring_handler$fn__35219.invoke(jetty9_core.clj:433)
      	at puppetlabs.trapperkeeper.services.webserver.jetty9_core.proxy$org.eclipse.jetty.server.handler.AbstractHandler$ff19274a.handle(Unknown Source)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
      	at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
      	at puppetlabs.trapperkeeper.services.webserver.normalized_uri_helpers$eval34795$normalize_uri_handler__34800$fn__34801$fn__34802.invoke(normalized_uri_helpers.clj:74)
      	at puppetlabs.trapperkeeper.services.webserver.normalized_uri_helpers.proxy$org.eclipse.jetty.server.handler.HandlerWrapper$ff19274a.handle(Unknown Source)
      	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
      	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
      	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
      	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:170)
      	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
      	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
      	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
      	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
      	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:521)
      	at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
      	at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169)
      	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
      	at org.eclipse.jetty.server.Server.handle(Server.java:564)
      	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
      	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
      	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
      	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
      	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:278)
      	at org.eclipse.jetty.io.ssl.SslConnection$3.succeeded(SslConnection.java:148)
      	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
      	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
      	at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
      	at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
      	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:672)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590)
      	at java.lang.Thread.run(Thread.java:748)
      ...
      

        Attachments

          Issue Links

            Activity

              jsd-sla-details-panel

                People

                • Assignee:
                  chuck Charlie Sharpsteen
                  Reporter:
                  chuck Charlie Sharpsteen
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  10 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Zendesk Support