[SERVER-389] Bump default-borrow-timeout to something significantly higher Created: 2015/02/23  Updated: 2016/09/27  Resolved: 2015/03/09

Status: Closed
Project: Puppet Server
Component/s: None
Affects Version/s: None
Fix Version/s: SERVER 1.0.8, SERVER 2.0.0

Type: Improvement Priority: Normal
Reporter: Jeremy Barlow Assignee: Erik Dasher
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to SERVER-246 change macro 'borrow's to 'borrow-wit... Closed
relates to SERVER-391 Improve error message for JRuby pool ... Closed
relates to SERVER-408 Expose configurable `borrow-timeout` ... Closed
Template:
Epic Link: Green: Puppet 4.0 Changes
Sub-team: emerald
Story Points: 1
Sprint: Server Emerald 2015-03-04, Server Emerald 2015-03-18
QA Contact: Erik Dasher

 Description   

The default-borrow-timeout for JRuby instances is currently set at 60 seconds. See:

https://github.com/puppetlabs/puppet-server/blob/cd659b95ac557502f3bab2eea878fa7e6a131b6f/src/clj/puppetlabs/services/jruby/jruby_puppet_service.clj#L13

It doesn't seem unrealistic in production for this timeout to be hit. For example, all of the allocated JRubies could be concurrently used for compiling expensive catalog compilations, each of which could take longer than 60 seconds to complete. We should think about setting a significantly larger default for this - ideally one with some significant buffer above the maximum time a realistic, expensive catalog compilation would take to complete. Not sure what that is yet and may need to solicit some outside feedback to come up with a good number for that...

There's quite a bit of discussion in the thread below around what we think the timeout for JRuby pool borrowing should be. We ultimately settled on 20 minutes. The number is somewhat arbitrary in that we don't have much real-world data on catalog compilations and report processing to suggest a specific number. An agent which has successfully made a socket connection with a master would effectively use an infinite timeout on the read it would perform to wait for the response to a request. Considering that, we also discussed the possibility of reverting back to the "infinite" timeout that is in use on currently released puppetserver code. Ultimately, though, the majority was concerned that not bounding the wait time could lead to thread "hangs" on the server, ultimately manifested as "hangs" in the response back to a client. 20 minutes was a compromise that seems to be "long enough" to account for the upper bound on worst-case catalog compilations / report processing while not leaving the server vulnerable to these "hangs".



 Comments   
Comment by Chris Price [ 2015/02/23 ]

Note that in production, virtually all of the meaningful uses of the JRuby pool goes through this macro:

https://github.com/puppetlabs/puppet-server/blob/cd659b95ac557502f3bab2eea878fa7e6a131b6f/src/clj/puppetlabs/services/jruby/jruby_puppet_service.clj#L96

which uses the `borrow` signature that doesn't specify a timeout. Therefore, it is my belief that in the catalog compilation scenario that you are describing, we are already avoiding this issue. If we wanted to bump the default timeout for purposes relating to testing, that might still be worthwhile, but I don't believe the change that you're suggesting would have any impact on production scenarios.

Whether or not we should be using a timeout in the typical use case is a different question.

Comment by Jeremy Barlow [ 2015/02/23 ]

The macro doesn't specify a timeout but it calls into a JRuby service function, `borrow-instance`, which makes a call to borrow-from-pool-with-timeout with a timeout value:

https://github.com/puppetlabs/puppet-server/blob/cd659b95ac557502f3bab2eea878fa7e6a131b6f/src/clj/puppetlabs/services/jruby/jruby_puppet_service.clj#L53

That timeout value may derive from the borrow-timeout key in the jruby-puppet configuration section or, if not specified in that section, the default-borrow-timeout, which is 60 seconds.

The code on the stable branch doesn't have this change. Calls made to the JRuby service to borrow an instance with no timeout, basically any call in a production code path, would effectively use an infinite timeout. The change to fix a timeout on each JRuby borrow request came into the master branch for https://github.com/puppetlabs/puppet-server/pull/329.

Maybe it makes sense to rework this such that calls to the JRubyPuppet service and CI don't require a timeout.

Comment by Chris Price [ 2015/02/24 ]

Crap, you're right, sorry. I've been spending a lot more time in the stable branch lately.

This is obviously not good the way that it is; we can't ship with this. I guess we need to put SERVER-389 and maybe SERVER-266 into the Puppet Server 2.0 epic, and probably talk through the options before anyone takes on the work.

Comment by Jeremy Barlow [ 2015/02/25 ]

Chris Price and I talked about this and agreed that it's probably still better to have a timeout for borrowing JRuby pool instances - at least in the context of incoming HTTP requests, like from an agent - than to revert back to the state where the timeout is infinite. Based on some limited, empirical data we have, we agreed that bumping the default up to 20 minutes would be reasonable.

Comment by Kevin Corcoran [ 2015/02/25 ]

20 minutes sounds extremely high to me. You mention some data that led you to that number - could you elaborate?

Taking a step back, I don't see why this timeout should be increased at all. It seems like, from the perspective of an agent and especially a user interacting with the HTTP API directly, if the server can't handle my request within 60 seconds because the system is so heavily loaded, that an HTTP 503 (triggered by a timeout) is exactly what should happen.

On that note, I wonder if Jetty and/or the agent's HTTP client timeout connections after a period of inactivity, which would render this change useless anyways?

Comment by Jeremy Barlow [ 2015/02/25 ]

Chris Price could give some more perspective on some of the real-world-ish compilation times he's seen with some fairly complex catalogs. It was on the order of some number of minutes, not seconds in some cases.

I believe a keep-alive connection – where TCP keep-alive packets may periodically be sent to keep lower-level transports from killing the connection due to inactivity - would only be able to hit an idle-timeout that results in the connection being terminated by the Jetty server if no HTTP request were outstanding. So if you had an HTTP keep-alive request with no response having been given to the client yet, the Jetty default idle-timeout, 30 seconds, wouldn't cause the connection to be killed by Jetty.

I'm not as familiar with timeouts from the Puppet client side of things. I believe the client uses different timeouts per request type - like file vs. configuration requests use different values. Given the potentially high response time for a catalog to be provided, I'm not sure if the client enforces an upper bound for catalog requests. Josh Cooper probably knows that.

Comment by Kevin Corcoran [ 2015/02/25 ]

It was on the order of some number of minutes, not seconds in some cases.

Wow, that is wild.

Okay, so let's say that all of the JRubies are occupied for several minutes, and one more request comes in - in this case, it still seems like a timeout after a minute is reasonable. The work of compiling a catalog hasn't even been started, and there's no way of knowing when it will be.

I also wonder about the potential impact on a heavily loaded system of letting things block for 20 minutes ... I think increasing this timeout also increases the resource consumption of the system under heavy load - now you have more connections open, more threads blocked waiting, etc.

Comment by Jeremy Barlow [ 2015/02/25 ]

Where JRubies are such an expensive resource and, therefore, only a handful to actually exist in a given server installation, I expect the case where all of the JRubies to all be checked out from the pool to not be all that uncommon. Combining that with realistic times for the length of a catalog compilation and it seems like with a 60 second borrow timeout, it seems like a server could be hitting these failure conditions fairly frequently. From the agent's perspective, I'm not exactly sure what that means. If an agent were to have some retry intelligence built into it, it seems like the same connection build up situation that you're mentioning would happen around the retry attempts. If an agent were more dumb and just bypass the catalog compilation - deferring to using a stale catalog - then it seems like getting agents, typically running on 30 minute run intervals, to using the latest catalog could be delayed significantly. Having the agent delay a hopefully much smaller amount of time on a pending HTTP catalog request than on the standard default run interval to get to the latest catalog would seem to better mitigate this concern.

I agree with you that we shouldn't put the server into a situation were the load it is taking on is "unbounded". Although I think the tuning in terms of the maximum number of connections / requests that the server should allow for would be better done at the server level rather than on a per-request type basis. And I think we need to more thought into what the reasonable defaults at that layer need to be.

If we can't come to an agreement on this thread, might be getting to the point soon where a few of us need to get together directly to hash this out...

Comment by Chris Price [ 2015/02/25 ]

I keep waffling on this one. It seems like maybe what we should do is determine what the agent uses for a timeout on the catalog request, and just set it to a value very slightly higher than that?

Comment by Chris Price [ 2015/02/25 ]

Sorry to thrash on this Jeremy Barlow, Nate Wolfe. Two things:

1. I was wrong this morning when I said that my gatling simulations with some customer data were showing a 3m catalog compile time. They're actually less than a minute. (I don't think these are necessarily the most expensive compiles a user would ever see but they've got some complexity to them, so they're not a terrible benchmark.)
2. Looking at the puppet docs:

https://docs.puppetlabs.com/references/latest/configuration.html#configtimeout

I think this implies that by default an agent will timeout its HTTP request after 2 minutes. In which case it seems like it's probably pointless for us to wait much longer than that on the borrow. Jeremy Barlow does that sound logical to you?

Comment by Jeremy Barlow [ 2015/02/25 ]

2. Looking at the puppet docs:
https://docs.puppetlabs.com/references/latest/configuration.html#configtimeout
I think this implies that by default an agent will timeout its HTTP request after 2 minutes. In which case it seems like it's probably pointless for us to wait much longer than that on the borrow

If this "configtimeout" is what the agent uses for the catalog request, then I'd agree with you. Josh Cooper - can you confirm that?

Comment by Kevin Corcoran [ 2015/02/25 ]

a 2-minute timeout sounds reasonable to me.

Comment by Jeremy Barlow [ 2015/02/25 ]

If we go with a much more moderate timeout, I think we also need to think about how this is apparently affecting our current Travis CI runs on the master branch. I have no idea what it is that's causing those to run so slowly that basic requests for a pool instance are timing out. Maybe we need to add some sort of test-time only construct that allows most of the tests - or at least the batch of ones that seem to be most directly affected by this problem - to do borrows with a much higher or infinite timeout? I'd really like to avoid having tests fail periodically because of this issue.

Comment by Chris Price [ 2015/02/25 ]

Agree, but if we decide 2m is the number, let's see how that goes w/Travis and then react accordingly.

Comment by Kevin Corcoran [ 2015/02/25 ]

Jeremy Barlow - Agreed that making Travis runs stop failing is important. However, I think the discussion on this ticket so far has been concerned with production configuration. If Travis is still failing on 2-minute timeouts, perhaps we just need to make that timeout configurable and bump it up for Travis.

Comment by Jeremy Barlow [ 2015/02/25 ]

Agree with Chris Price and Kevin Corcoran on the wait and see approach with Travis and the 2 minute timeout, assuming we end up doing that.

If Travis is still failing on 2-minute timeouts, perhaps we just need to make that timeout configurable and bump it up for Travis.

The timeout is already configurable at the TK service level, via borrow-timeout under the jruby-puppet section. If we need to do something custom for testing, we might be able to just tweak it through that.

Comment by Josh Cooper [ 2015/02/25 ]

If this "configtimeout" is what the agent uses for the catalog request, then I'd agree with you.

The agent no longer uses `configtimeout` because it conflated too many issues. The agent currently uses two network timeouts http_connect_timeout defaulting to 2m and http_read_timeout defaulting to infinite. The connect timeout handles the TCP handshake. The read timeout handles anytime puppet makes a blocking read request.

Catalog compilation is one area where the client will block for awhile, with an idle connection that might be closed by jetty or loadbalancer. The other is report processing, and that's the one we had the escalation about. It was taking 90secs to process the report and the loadbalancer was killing the idle connection.

Comment by Jeremy Barlow [ 2015/03/03 ]

Didn't see any failures in recent CI that seem to be related to this - https://jenkins.puppetlabs.com/job/platform_puppet-server_integration-system_full-master/197 - so moving to testing.

Comment by Erik Dasher [ 2015/03/09 ]

Discussed this issue with Jeremy Barlow; we view this modification as low risk. The previous value for this timeout was infinite. The case where this could start causing problems is where customers have operations that routinely take 20 minutes to complete, in which case they really ought to be optimizing.

Marking as Resolved.

Generated at Sat Sep 26 14:49:27 PDT 2020 using Jira 8.5.2#805002-sha1:a66f9354b9e12ac788984e5d84669c903a370049.