[SERVER-2193] Provide an API for returning Ruby-level stacktraces for JRuby instances Created: 2018/04/20  Updated: 2019/09/16  Resolved: 2019/09/16

Status: Resolved
Project: Puppet Server
Component/s: None
Affects Version/s: SERVER 5.3.1
Fix Version/s: SERVER 6.6.0

Type: New Feature Priority: Major
Reporter: Charlie Sharpsteen Assignee: Charlie Sharpsteen
Resolution: Done Votes: 0
Labels: resolved-issue-added
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2018-05-16 at 11.30.37 PM.png    
Issue Links:
Blocks
is blocked by SERVER-2611 Update JRuby to 9.2.8.0 Resolved
Relates
Template:
Epic Link: Improved JRuby Debugging UX
Team: Server
CS Priority: Normal
CS Frequency: 2 - 5-25% of Customers
CS Severity: 3 - Serious
CS Business Value: 4 - $$$$$
CS Impact: Long running Jrubies are very difficult to identify the root cause for. Some customers have been satisfied with just restarting puppetserver but that will not always be the case. The cost of not having this is a customer confidence issue when we cannot explain the root cause of their issue, only provide the workaround of restarting.

Support has reached the extent of how deeply they can dig without this additional functional.

The only additional option is debug mode which has significant performance impacts and is not suitable for running in production. It also requires a restart which means you may be waiting for an extended period to see the issue recur.
Release Notes: New Feature
Release Notes Summary: Added an endpoint to Puppet Server, `/puppet-admin-api/v1/jruby-pool/thread-dump`, that will return a thread dump of running JRuby instances, if `jruby.management.enabled` has been set to true in the JVM running puppetserver. See https://github.com/puppetlabs/puppetserver/blob/master/documentation/admin-api/v1/jruby-pool.markdown#get-puppet-admin-apiv1jruby-poolthread-dump for details.
QA Risk Assessment: Needs Assessment

 Description   

As an administrator of Puppet Infrastructure, I need to determine why a Puppet Server request using a JRuby instance has stalled before I restart the service. This can happen if the request has triggered methods from core Puppet code or a Ruby plugin that will never return.

Reproduction Case

  • Install and start Puppet Server on a CentOS 7 VM:

rpm -Uvh http://yum.puppetlabs.com/puppet5/puppet5-release-el-7.noarch.rpm
yum install -y puppetserver
 
/opt/puppetlabs/puppet/bin/puppet config set --section main server "$(hostname -f)"
systemctl start puppetserver

  • Configure site.pp to sleep for an absurd amount of time in response to a catalog request:

cat <<"EOM" > /etc/puppetlabs/code/environments/production/manifests/site.pp
node default {
  inline_template('<% Kernel.sleep(1800) %>')
  # I feel much better after a 30 minute catnap!
}
EOM

  • Kick off a puppet run: /opt/puppetlabs/bin/puppet agent -t
  • Send the node's hostname to your coworker on Friday afternoon and tell them to figure out what is going on without looking at site.pp (because this could be a problem with 1 out of 100s of installed modules). Leave for a two week vacation.

Current Options

Below are listed two current options for checking the state of stalled JRuby requests. Both can help identify that a request is indeed stalled, but neither provides visibility into what Ruby code is responsible.

Checking the status of the compiler pool shows that a catalog request is hung, but gives no info as to where the request is stalled:

# curl -k https://localhost:8140/status/v1/services/jruby-metrics?level=debug|python -m json.tool
{
    "active_alerts": [],
    "detail_level": "debug",
    "service_name": "jruby-metrics",
    "service_status_version": 1,
    "service_version": "5.3.1",
    "state": "running",
    "status": {
        "experimental": {
            "jruby-pool-lock-status": {
                "current-state": ":not-in-use",
                "last-change-time": "2018-04-20T18:24:02.904Z"
            },
            "metrics": {
                "average-borrow-time": 82,
                "average-free-jrubies": 0.000271409884577374,
                "average-lock-held-time": 0,
                "average-lock-wait-time": 0,
                "average-requested-jrubies": 0.0,
                "average-wait-time": 0,
                "borrow-count": 5,
                "borrow-retry-count": 0,
                "borrow-timeout-count": 0,
                "borrowed-instances": [
                    {
                        "duration-millis": 470045,
                        "reason": {
                            "request": {
                                "request-method": "post",
                                "route-id": "puppet-v3-catalog-/*/",
                                "uri": "/puppet/v3/catalog/tvtyjvubmmaixqi.delivery.puppetlabs.net"
                            }
                        },
                        "time": 1524248690580
                    }
                ],
                "num-free-jrubies": 0,
                "num-jrubies": 1,
                "num-pool-locks": 0,
                "queue-limit-hit-count": 0,
                "queue-limit-hit-rate": 0.0,
                "requested-count": 5,
                "requested-instances": [],
                "return-count": 4
            }
        }
    }
}

Pulling a thread dump using jstack after installing java-1.8.0-openjdk-devel shows a JRuby thread waiting (on RubyKernel.sleep, but thats a lucky coincidence), but the stack trace is at the Java and JRuby AST level which gives no indications as to what Ruby code caused it:

# sudo -u puppet jstack $(cat /var/run/puppetlabs/puppetserver/puppetserver.pid)
...
"qtp281635025-65" #65 prio=5 os_prio=0 tid=0x00007f30f00c4000 nid=0x9f3 in Object.wait() [0x00007f30edaa4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000f568d888> (a org.jruby.RubyThread)
	at org.jruby.RubyThread.sleep(RubyThread.java:1002)
	- locked <0x00000000f568d888> (a org.jruby.RubyThread)
	- locked <0x00000000f568d888> (a org.jruby.RubyThread)
	at org.jruby.RubyKernel.sleep(RubyKernel.java:803)
	at org.jruby.RubyKernel$INVOKER$s$0$1$sleep.call(RubyKernel$INVOKER$s$0$1$sleep.gen)
	at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:667)
	at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:206)
	at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)
	at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)
	at org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
	at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
	at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
	at org.jruby.ast.RootNode.interpret(RootNode.java:129)
	at org.jruby.evaluator.ASTInterpreter.INTERPRET_EVAL(ASTInterpreter.java:95)
 
...skipping a lot of JRuby AST calls for brevity...
 
puppetlabs.services.request_handler.request_handler_core$jruby_request_handler$fn__31686.invoke(request_handler_core.clj:273)
	at puppetlabs.puppetserver.jruby_request$wrap_with_jruby_instance$fn__31597.invoke(jruby_request.clj:48)
	at puppetlabs.puppetserver.jruby_request$wrap_with_error_handling$fn__31593.invoke(jruby_request.clj:33)
	at puppetlabs.services.request_handler.request_handler_service$reify__31694$service_fnk__5266__auto___positional$reify__31709.handle_request(request_handler_service.clj:47)
	at puppetlabs.services.protocols.request_handler$eval27407$fn__27408$G__27399__27411.invoke(request_handler.clj:3)
	at puppetlabs.services.protocols.request_handler$eval27407$fn__27408$G__27398__27415.invoke(request_handler.clj:3)
	at clojure.core$partial$fn__4759.invoke(core.clj:2515)
	at puppetlabs.trapperkeeper.authorization.ring_middleware$eval39879$wrap_authorization_check__39888$fn__39891$fn__39892.invoke(ring_middleware.clj:284)
	at puppetlabs.ring_middleware.core$eval21857$wrap_bad_request__21866$fn__21869$fn__21874.invoke(core.clj:168)
	at puppetlabs.ring_middleware.core$eval21953$wrap_uncaught_errors__21962$fn__21965$fn__21970.invoke(core.clj:211)
	at puppetlabs.ring_middleware.core$eval21584$wrap_request_logging__21589$fn__21590$fn__21592.invoke(core.clj:45)
	at puppetlabs.i18n.core$locale_negotiator$fn__124.invoke(core.clj:357)
	at puppetlabs.ring_middleware.core$eval21613$wrap_response_logging__21618$fn__21619$fn__21620.invoke(core.clj:51)
	at puppetlabs.puppetserver.ringutils$wrap_with_puppet_version_header$fn__17331.invoke(ringutils.clj:83)
	at puppetlabs.services.master.master_core$eval38154$v3_ruby_routes__38159$fn__38160$fn__38177.invoke(master_core.clj:738)
	at bidi.ring$eval22897$fn__22898.invoke(ring.cljc:25)
	at bidi.ring$eval22876$fn__22877$G__22867__22886.invoke(ring.cljc:16)
	at puppetlabs.comidi$make_handler$fn__25007.invoke(comidi.clj:245)
	at puppetlabs.metrics.http$eval37332$wrap_with_request_metrics__37337$fn__37341$fn__37343$fn__37344$fn__37345.invoke(http.clj:152)
	at puppetlabs.metrics.http.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.metrics.http$eval37332$wrap_with_request_metrics__37337$fn__37341$fn__37343$fn__37344.invoke(http.clj:152)
	at puppetlabs.metrics.http.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.metrics.http$eval37332$wrap_with_request_metrics__37337$fn__37341$fn__37343.invoke(http.clj:148)
	at puppetlabs.comidi$eval25068$wrap_with_route_metadata__25073$fn__25074$fn__25076.invoke(comidi.clj:332)
	at puppetlabs.trapperkeeper.services.webserver.jetty9_core$ring_handler$fn__35219.invoke(jetty9_core.clj:433)
	at puppetlabs.trapperkeeper.services.webserver.jetty9_core.proxy$org.eclipse.jetty.server.handler.AbstractHandler$ff19274a.handle(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
	at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
	at puppetlabs.trapperkeeper.services.webserver.normalized_uri_helpers$eval34795$normalize_uri_handler__34800$fn__34801$fn__34802.invoke(normalized_uri_helpers.clj:74)
	at puppetlabs.trapperkeeper.services.webserver.normalized_uri_helpers.proxy$org.eclipse.jetty.server.handler.HandlerWrapper$ff19274a.handle(Unknown Source)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:170)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:521)
	at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
	at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:564)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:278)
	at org.eclipse.jetty.io.ssl.SslConnection$3.succeeded(SslConnection.java:148)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
	at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
	at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:672)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590)
	at java.lang.Thread.run(Thread.java:748)
...



 Comments   
Comment by Charlie Sharpsteen [ 2018/04/20 ]

Two places where adding this might make sense:

  • As an additional output option for the status API, perhaps at a TRACE level: status/v1/services/jruby-metrics?level=trace
  • As a dedicated endpoint under /puppet-admin-api

The second might make more sense as it requires authentication via a certificate — given stacktraces show up on security audits as sensitive information that should be access controlled.

Comment by Justin Stoller [ 2018/04/27 ]

Discussed at our recent grooming session.

We like the idea of being able to easily pin point the cause of a hung process and then prioritize fixing the cause of the hang before adding in a global timeout to JRuby that may or may not leave the system in a good state.

We were unsure about an actual HTTP api for it though and Patrick Carlisle believes there's a why to get a meaningful JRuby stack trace out of Puppet Server using linux's perf tool (and maybe reifyclasses?). Right now we're going to look into that further so that we can hopefully provide a faster response to the current customer escalation that Charlie mentioned.

We will do that work in PE-24082, share it more widely if successful, and reprioritize this and SERVER-1997 in relation to how successful that work is.

Comment by Charlie Sharpsteen [ 2018/04/27 ]

Tools like JStack won't correlate a JRuby thread with the request that is being processed by Puppet Server — I don't know if perf would be any more helpful in that regard. An API could return the JRuby instance ID and borrow reason which makes it very easy to find the right stacktrace.

If we could get the info from perf that would be awesome and a useful option to have. However, I think an API would save us time in the long run by providing better context and also cutting through the red tape that can come with installing debugging tools on production servers.

Comment by Justin Stoller [ 2018/05/07 ]

I had a minute between collecting stacktraces for my last ticket and I wanted to investigate this a little further. I think we can get a stack trace from a JRuby scripting container calling the below, though I haven't verified it's a Ruby vs JRuby interpreter stacktrace:

(-> scripting-container .getProvider .getRuntime .getCurrentContext (.createBacktrace2 100 true))

Or probably the java equivalent implemented on our subclass of scripting container. Access to each scripting container should be available even when checked out by the pools getRegisteredElements method. Doing this from the jruby-pool-manager impl in clojure should allow us to display checkout reason and stacktrace.

If we don't want to open this up as an http api we might still want to register a signal handler?

Comment by Justin Stoller [ 2018/05/07 ]

Also, fwiw, my current hunch is that a Ruby stack trace won't help the previously mentioned support escalation (though if my hunch is correct it'd obviously eliminate JRuby as a culprit).

Comment by Justin Stoller [ 2018/05/16 ]

Whell, turns out you can already do this!

The steps to do this are:

  1. Add -Djruby.management.enabled=true to the JAVA_ARGS.
  2. Uncomment the metrics.metrics-webservice.servlet-init-params.policyLocation setting in /etc/puppetlabs/puppetserver/conf.d/metrics.conf (should be a file path to /etc/puppetlabs/puppetserver/jolokia-access.xml.
  3. Create a file in the location specified by policyLocation above with the content:
    1. ## <?xml version="1.0" encoding="utf-8"?>
       
      <restrict>
       
        <allow>
          <mbean>
            <name>org.jruby:type=Runtime,service=Runtime</name>
            <operation>threadDump</operation>
          </mbean>
        </allow>
       
      </restrict>
      

  4. Restart the server
  5. Go to https://<master>/metrics/v2/list and find the JRuby instance whose stacktrace you want
    1. The full mbean name will be org.jruby:name=<internal jruby instance identifier>,service=Runtime,type=Runtime
  6. Go to https://<master>/metrics/v2/exec/<full mbean name>/threadDump to see the current thread dump for that Jruby instance.

 
Note: the Jolokia exec functionality that this leverages is super powerful. Users should be very weary of turning it on as it allows remote code execution by design. There are ways in the jolokia-access.xml file to further limit access (like by ip range) that they should also take advantage of.

Comment by Justin Stoller [ 2018/05/16 ]

I've attached a screenshot of a query against a local vagrant vm running Puppet Server 5.3.1.

Comment by Charlie Sharpsteen [ 2018/05/17 ]

Oh nice! Yeah, the JMX exec functionality that Jolokia exposes is a very, very sharp tool which is exactly why I locked it down by default

But, given the heavy lifting seems to be already available via JMX, perhaps it would be straight-forward to add an Puppet Server admin API endpoint that takes a JRuby instance ID and returns the borrow reason, borrow time and JMX stacktrace. This would solve two shortcomings of the JMX approach:

  • The internal JRuby instance identifier isn't directly mappable to the JRuby pool id which identifies "the instance that has been stuck for 24 hours doing something".
  • The /metrics/v2 API has no auth, while /puppet-admin-api requires a certificate.

I think we'll get the most use out if this if it is exposed in a way that can be enabled by default and automatically polled by the PE support script. Otherwise, if someone has to jump through hoops to properly secure JMX exec and then re-start the service (which clears the hang that we need to debug), then we'll have to put in more effort to walk users through it along with a lower success rate.

Comment by Justin Stoller [ 2018/05/17 ]

Below is a threadDump from a VM stuck in the clojure slurp call that we're seeing in PE-24082. You might notice that the JRuby thread dump printer does a great job of filtering out JVM symbols and then pretty printing the non-Java calls, making the clojure stack easier to read as well (with the exception that functions names are all invoke).

Oddly, JRuby prints out stack traces for all threads its VM knows about, which includes jetty threads that have checked it out previously. You can see both JRuby instances' threads in the below thread dump (the calls to io.clj are where the issue leaves clojure and/or ruby and becomes a pure java problem).

All threads known to Ruby instance 1555721580
 
Thread: qtp1105564375-63
Stack:
jruby_internal.clj:133:in `invoke': thread dump (RuntimeError)
	from jruby_internal.clj:130:in `invoke'
	from core.clj:2515:in `invoke'
	from jruby_internal.clj:271:in `invoke'
	from jruby_internal.clj:264:in `invoke'
	from jruby_internal.clj:312:in `invoke'
	from jruby_internal.clj:301:in `invoke'
	from jruby_core.clj:212:in `invoke'
	from jruby_core.clj:199:in `invoke'
	from jruby_request.clj:46:in `invoke'
	from jruby_request.clj:33:in `invoke'
	from request_handler_service.clj:47:in `handle_request'
	from request_handler.clj:3:in `invoke'
	from request_handler.clj:3:in `invoke'
	from core.clj:2515:in `invoke'
	from ring_middleware.clj:284:in `invoke'
	from core.clj:168:in `invoke'
	from core.clj:211:in `invoke'
	from core.clj:45:in `invoke'
	from core.clj:357:in `invoke'
	from core.clj:51:in `invoke'
	from ringutils.clj:83:in `invoke'
	from master_core.clj:742:in `invoke'
	from ring.cljc:25:in `invoke'
	from ring.cljc:16:in `invoke'
	from comidi.clj:245:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:148:in `invoke'
	from comidi.clj:332:in `invoke'
	from jetty9_core.clj:433:in `invoke'
	from normalized_uri_helpers.clj:74:in `invoke'
 
 
Thread: clojure-agent-send-pool-0
Stack:
thread dump (RuntimeError)
 
 
Thread: qtp1105564375-59
Stack:
jruby_internal.clj:133:in `invoke': thread dump (RuntimeError)
	from jruby_internal.clj:130:in `invoke'
	from core.clj:2515:in `invoke'
	from jruby_internal.clj:271:in `invoke'
	from jruby_internal.clj:264:in `invoke'
	from jruby_internal.clj:312:in `invoke'
	from jruby_internal.clj:301:in `invoke'
	from jruby_core.clj:212:in `invoke'
	from jruby_core.clj:199:in `invoke'
	from jruby_request.clj:46:in `invoke'
	from jruby_request.clj:33:in `invoke'
	from request_handler_service.clj:47:in `handle_request'
	from request_handler.clj:3:in `invoke'
	from request_handler.clj:3:in `invoke'
	from core.clj:2515:in `invoke'
	from ring_middleware.clj:284:in `invoke'
	from core.clj:168:in `invoke'
	from core.clj:211:in `invoke'
	from core.clj:45:in `invoke'
	from core.clj:357:in `invoke'
	from core.clj:51:in `invoke'
	from ringutils.clj:83:in `invoke'
	from master_core.clj:742:in `invoke'
	from ring.cljc:25:in `invoke'
	from ring.cljc:16:in `invoke'
	from comidi.clj:245:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:148:in `invoke'
	from comidi.clj:332:in `invoke'
	from jetty9_core.clj:433:in `invoke'
	from normalized_uri_helpers.clj:74:in `invoke'
 
 
Thread: async-dispatch-2
Stack:
thread dump (RuntimeError)
 
 
Thread: qtp1105564375-61
Stack:
jruby_internal.clj:133:in `invoke': thread dump (RuntimeError)
	from jruby_internal.clj:130:in `invoke'
	from core.clj:2515:in `invoke'
	from jruby_internal.clj:271:in `invoke'
	from jruby_internal.clj:264:in `invoke'
	from jruby_internal.clj:312:in `invoke'
	from jruby_internal.clj:301:in `invoke'
	from jruby_core.clj:212:in `invoke'
	from jruby_core.clj:199:in `invoke'
	from jruby_request.clj:46:in `invoke'
	from jruby_request.clj:33:in `invoke'
	from request_handler_service.clj:47:in `handle_request'
	from request_handler.clj:3:in `invoke'
	from request_handler.clj:3:in `invoke'
	from core.clj:2515:in `invoke'
	from ring_middleware.clj:284:in `invoke'
	from core.clj:168:in `invoke'
	from core.clj:211:in `invoke'
	from core.clj:45:in `invoke'
	from core.clj:357:in `invoke'
	from core.clj:51:in `invoke'
	from ringutils.clj:83:in `invoke'
	from master_core.clj:738:in `invoke'
	from ring.cljc:25:in `invoke'
	from ring.cljc:16:in `invoke'
	from comidi.clj:245:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:148:in `invoke'
	from comidi.clj:332:in `invoke'
	from jetty9_core.clj:433:in `invoke'
	from normalized_uri_helpers.clj:74:in `invoke'
 
 
Thread: qtp1105564375-62
Stack:
jruby_internal.clj:133:in `invoke': thread dump (RuntimeError)
	from jruby_internal.clj:130:in `invoke'
	from core.clj:2515:in `invoke'
	from jruby_internal.clj:271:in `invoke'
	from jruby_internal.clj:264:in `invoke'
	from jruby_internal.clj:312:in `invoke'
	from jruby_internal.clj:301:in `invoke'
	from jruby_core.clj:212:in `invoke'
	from jruby_core.clj:199:in `invoke'
	from jruby_request.clj:46:in `invoke'
	from jruby_request.clj:33:in `invoke'
	from request_handler_service.clj:47:in `handle_request'
	from request_handler.clj:3:in `invoke'
	from request_handler.clj:3:in `invoke'
	from core.clj:2515:in `invoke'
	from ring_middleware.clj:284:in `invoke'
	from core.clj:168:in `invoke'
	from core.clj:211:in `invoke'
	from core.clj:45:in `invoke'
	from core.clj:357:in `invoke'
	from core.clj:51:in `invoke'
	from ringutils.clj:83:in `invoke'
	from master_core.clj:721:in `invoke'
	from ring.cljc:25:in `invoke'
	from ring.cljc:16:in `invoke'
	from comidi.clj:245:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:148:in `invoke'
	from comidi.clj:332:in `invoke'
	from jetty9_core.clj:433:in `invoke'
	from normalized_uri_helpers.clj:74:in `invoke'
 
 
Thread: qtp1105564375-66
Stack:
io.clj:337:in `invokeStatic': thread dump (RuntimeError)
	from io.clj:334:in `invoke'
	from io.clj:406:in `invokeStatic'
	from io.clj:391:in `doInvoke'
	from core.clj:6665:in `invokeStatic'
	from core.clj:6657:in `doInvoke'
	from request_handler_core.clj:78:in `invokeStatic'
	from request_handler_core.clj:54:in `invoke'
	from request_handler_core.clj:86:in `invokeStatic'
	from request_handler_core.clj:81:in `invoke'
	from request_handler_core.clj:269:in `invoke'
	from jruby_request.clj:48:in `invoke'
	from jruby_request.clj:33:in `invoke'
	from request_handler_service.clj:47:in `handle_request'
	from request_handler.clj:3:in `invoke'
	from request_handler.clj:3:in `invoke'
	from core.clj:2515:in `invoke'
	from ring_middleware.clj:284:in `invoke'
	from core.clj:168:in `invoke'
	from core.clj:211:in `invoke'
	from core.clj:45:in `invoke'
	from core.clj:357:in `invoke'
	from core.clj:51:in `invoke'
	from ringutils.clj:83:in `invoke'
	from master_core.clj:738:in `invoke'
	from ring.cljc:25:in `invoke'
	from ring.cljc:16:in `invoke'
	from comidi.clj:245:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:148:in `invoke'
	from comidi.clj:332:in `invoke'
	from jetty9_core.clj:433:in `invoke'
	from normalized_uri_helpers.clj:74:in `invoke'
 
 
Thread: qtp1105564375-65
Stack:
io.clj:337:in `invokeStatic': thread dump (RuntimeError)
	from io.clj:334:in `invoke'
	from io.clj:406:in `invokeStatic'
	from io.clj:391:in `doInvoke'
	from core.clj:6665:in `invokeStatic'
	from core.clj:6657:in `doInvoke'
	from request_handler_core.clj:78:in `invokeStatic'
	from request_handler_core.clj:54:in `invoke'
	from request_handler_core.clj:86:in `invokeStatic'
	from request_handler_core.clj:81:in `invoke'
	from request_handler_core.clj:269:in `invoke'
	from jruby_request.clj:48:in `invoke'
	from jruby_request.clj:33:in `invoke'
	from request_handler_service.clj:47:in `handle_request'
	from request_handler.clj:3:in `invoke'
	from request_handler.clj:3:in `invoke'
	from core.clj:2515:in `invoke'
	from ring_middleware.clj:284:in `invoke'
	from core.clj:168:in `invoke'
	from core.clj:211:in `invoke'
	from core.clj:45:in `invoke'
	from core.clj:357:in `invoke'
	from core.clj:51:in `invoke'
	from ringutils.clj:83:in `invoke'
	from master_core.clj:738:in `invoke'
	from ring.cljc:25:in `invoke'
	from ring.cljc:16:in `invoke'
	from comidi.clj:245:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:152:in `invoke'
	from http.clj:148:in `invoke'
	from comidi.clj:332:in `invoke'
	from jetty9_core.clj:433:in `invoke'
	from normalized_uri_helpers.clj:74:in `invoke'

Comment by Eric Sorenson [ 2018/05/17 ]

That seems pretty nice. Charlie Sharpsteen you OK with the current state of stack dumps?

Comment by Charlie Sharpsteen [ 2018/05/17 ]

Eric Sorenson The output looks great and definitely something we'll try at the moment. However, I still see a need for a solution that doesn't involve discarding the state of the problem and has a lower bar for secure configuration.

This should be data that we can collect immediately as a first response to troubleshooting a hung Puppet Server, ideally as part of an automated diagnostic check in the PE support script.

Comment by Justin Stoller [ 2018/05/22 ]

Charlie Sharpsteen what's the best way for us to get the current abilities documented for customer success/customers? Do you think this would be valuable in the FOSS docs, or should this target some CS best practices docs?

Comment by Charlie Sharpsteen [ 2018/05/29 ]

Filed SUP-465 to get the JMX instructions written up as a KB article similar to what we have for using JStack to pull Java stack traces.

I'm also marking this ticket as re-opened as we really do need something that can be deployed as a first response without any re-configuration that discards the state we need to debug.

Comment by Charlie Sharpsteen [ 2019/08/15 ]

I took a stab at spiking this out, and it turned out to be fairly straight-forward:

diff --git a/src/clj/puppetlabs/services/puppet_admin/puppet_admin_core.clj b/src/clj/puppetlabs/services/puppet_admin/puppet_admin_core.clj
index a2d58be9..95d55e22 100644
--- a/src/clj/puppetlabs/services/puppet_admin/puppet_admin_core.clj
+++ b/src/clj/puppetlabs/services/puppet_admin/puppet_admin_core.clj
@@ -3,6 +3,12 @@
   (:require [puppetlabs.kitchensink.core :as ks]
             [puppetlabs.puppetserver.ringutils :as ringutils]
             [puppetlabs.services.protocols.jruby-puppet :as jruby-puppet]
+
+            [puppetlabs.services.jruby-pool-manager.jruby-core :as jruby-core]
+            [puppetlabs.trapperkeeper.services :as tk-services]
+            [puppetlabs.ring-middleware.utils :as tk-ring]
+            [clojure.java.jmx :as jmx]
+
             [puppetlabs.puppetserver.liberator-utils :as liberator-utils]
             [schema.core :as schema]
             [liberator.core :refer [defresource]]
@@ -76,7 +82,21 @@
       (environment-cache-resource jruby-service
         (get-in request [:query-params "environment"])))
     (comidi/DELETE "/jruby-pool" []
-      (handle-jruby-pool-flush jruby-service))))
+      (handle-jruby-pool-flush jruby-service))
+    (comidi/GET "/jruby-pool/thread-dump" []
+      (let [{:keys [pool-context]} (tk-services/service-context jruby-service)
+            jruby-instances (jruby-core/registered-instances pool-context)]
+        (tk-ring/json-response
+         200
+         (for [instance jruby-instances]
+           (let [mbean-name (-> (:scripting-container instance)
+                                .getProvider
+                                .getRuntime
+                                .getBeanManager
+                                .base
+                                (str "service=Runtime"))]
+             {:id (:id instance)
+              :thread-dump (jmx/invoke mbean-name :threadDump)})))))))
 
 (defn versioned-routes
   [jruby-service]

The code above requires -Djruby.management.enabled=true to be set on the process. The implementation probably belongs over in the jruby-utils project so that it can be used by other services that run JRuby.

Comment by Charlie Sharpsteen [ 2019/08/19 ]

Looks like a JRuby bug hit somewhere between 9.2.0.0 and 9.2.7.0 that broke the stack traces provides by jruby.management.enabled. Fixed in 9.2.8.0 by this commit:

https://github.com/jruby/jruby/commit/d2875a4

Comment by Justin Stoller [ 2019/08/19 ]

Ha, we want 9.2.8 in Lovejoy for the perf improvements. We were releasing 9.2.7 to prepare for the release not knowing exactly when 9.2.8 was coming out, of course it came out while we were in the middle of upgrading. I'll do the 9.2.8 upgrade now.

Comment by Charlie Sharpsteen [ 2019/08/19 ]

Put up a pair of pull requests. The first is to jruby-utils, and provides an API for gathering thread-dump info from a JRuby pool:

https://github.com/puppetlabs/jruby-utils/pull/71

The second is to Puppetserver and adds an REST API endpoint to /puppet-admin-api which retrieves the thread dumps from jruby-utils and formats them as JSON:

https://github.com/puppetlabs/puppetserver/pull/2154

Both require -Djruby.management.enabled=true to be set in JAVA_ARGS. But, once that is in place, thread dumps can be pulled in one REST API call vs. re-configuring JMX access rules, listing out JRuby mbeans, and then looping through the list and invoking the threadDump method.

Generated at Thu Nov 21 03:22:43 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.