Uploaded image for project: 'Puppet Server'
  1. Puppet Server
  2. SERVER-2860

JRubyInstance borrow timing-out on multithreaded mode

    XMLWordPrintable

Details

    • Bug
    • Status: Ready for Review
    • Normal
    • Resolution: Unresolved
    • SERVER 6.11.1
    • None
    • Puppet Server
    • PUP Bug Template
    • Froyo
    • Needs Assessment
    • 41282
    • 1
    • Needs Assessment

    Description

      Puppet Version:
      Puppet Server Version:
      OS Name/Version:

      I'm testing the new multithreaded option on Puppet Servers 6.11.1 running on a Kubernetes cluster. The main reason for going with this option is reducing the amount of memory used by Puppet.

      In my tests I noticed that after some time (around 2 or 3 hours) the server freezes and from that moment every single requests produces a line similar to the following one:

      Error 503 on SERVER at /puppet/v3/file_metadatas/pluginfacts: Attempt to borrow a JRubyInstance from the pool timed out.
      

      Around the same time in which I start seeing the errors, I see this line in the logs:

      Max borrows reached, but JRubyPool could not be flushed because lock could not be acquired. Will try again later.
      

      This is a consequence of using max-requests-per-instance, setting needed as there are memory leaks.

      I'm monitoring all JRuby metrics relevant and available. These plots show the evolution of a pod running Puppet Server from the moment in which is spawned to the moment in which is killed:

      The plot on the left shows the borrow count, the retries and the timeouts. At some point it's not possible to borrow more threads. The middle one shows threads requested and returned. Both are quite similar so it seems that threads are successfully returned. The one on the left shows the total number of threads and the ones free. In this setup I'm setting the maximum number of threads to 5 and as you can see, at some point there's only one thread free (even when Puppet reports that there are none). It doesn't matter if the server has traffic or not, the number of free threads doesn't change.

      I have max-requests-per-instance set to 3000 and at the moment of starting failing, the borrow count is at 6K. Failing requests have a backend time of ~1200000, which matches with the default value for borrow-timeout.

      My latest test has been to set max-requests-per-instance to 0 to see if the problem was related with the lock acquired when refreshing the instance. With this configuration the server was able to run for 7 hours, but the number of free threads with and without traffic was still 1 for most of the time.

      The following plots cover the whole range since I spawned the pod (with traffic already pointing to the server) until several minutes after I stopped it. As you can see the number of free threads is 0 or 1 after 2 hours running. After that, it never goes above 1.

      A thread dump of the java process (done with jstack) shows a deadlock. You can find the whole dump attached to this ticket.

      Attachments

        1. dump.txt
          762 kB
          David Moreno García
        2. image-2020-08-21-14-37-25-015.png
          93 kB
          David Moreno García
        3. image-2020-08-21-14-43-24-300.png
          317 kB
          David Moreno García

        Issue Links

          Activity

            People

              maggie Maggie Dreyer
              davidmogar David Moreno García
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Zendesk Support