[PDB-716] HTTP API gets stuck Created: 2014/06/18  Updated: 2015/06/04  Resolved: 2015/05/14

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: PDB 2.0.0
Fix Version/s: PDB 2.3.3

Type: Bug Priority: Normal
Reporter: Erik Dalén Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File dump1     HTML File dump10     HTML File dump2     HTML File dump3     HTML File dump4     HTML File dump5     HTML File dump6     HTML File dump7     HTML File dump8     HTML File dump9     File puppetdbdashboard.tiff    
Issue Links:
Relates
relates to PDB-686 java.util.concurrent.TimeoutException... Closed
Template:

 Description   

Sometimes the HTTP API seems to get completely stuck. Other parts continue working though according to logs.

Attaching 10 stack trace dumps taken a few seconds apart with jstack -F <pid>



 Comments   
Comment by Ken Barber [ 2014/06/18 ]

Erik Dalén I don't suppose you can also supply your dashboard stats as an image so we can get an idea of the current load on that PDB host?

Comment by Ken Barber [ 2014/06/18 ]

Also your puppetdb conf.d stuff would be useful, scrubbed of course. Just so we can double check any tuning settings.

The only setting today I can think of today is to adjust the max-threads to a higher number, like 300/400 or so. It might at least reduce the frequency of the problem, at the risk of perhaps causing a high load - but we can experiment with the number.

Comment by Erik Dalén [ 2014/06/19 ]

The config is fairly standard:

/etc/puppetdb/conf.d# cat *.ini
# See README.md for more thorough explanations of each section and
# option.
 
[global]
# Store mq/db data in a custom directory
vardir = /var/lib/puppetdb
# Use an external logback config file
logging-config = /etc/puppetdb/logback.xml
 
 
[command-processing]
# How many command-processing threads to use, defaults to (CPUs / 2)
# threads = 4
 
# Maximum amount of disk space (in MB) to allow for ActiveMQ persistent message storage
# store-usage = 102400
 
# Maximum amount of disk space (in MB) to allow for ActiveMQ temporary message storage
# temp-usage = 51200
[database]
# For the embedded DB: org.hsqldb.jdbcDriver
# For PostgreSQL: org.postgresql.Driver
# Defaults to embedded DB
classname = org.postgresql.Driver
 
# For the embedded DB: hsqldb
# For PostgreSQL: postgresql
# Defaults to embedded DB
subprotocol = postgresql
 
# For the embedded DB: file:/path/to/database;hsqldb.tx=mvcc;sql.syntax_pgs=true
# For PostgreSQL: //host:port/databaseName
# Defaults to embedded DB located in <vardir>/db
subname = //lon3-puppetdbdb-b1.lon3.spotify.net:5432/puppetdb
 
# Connect as a specific user
# username = foobar
username = puppetdb
 
# Use a specific password
# password = foobar
password = ---
 
# How often (in minutes) to compact the database
# gc-interval = 60
gc-interval = 60
 
# Number of seconds before any SQL query is considered 'slow'; offending
# queries will not be interrupted, but will be logged at the WARN log level.
log-slow-statements = 10
report-ttl = 14d
syntax_pgs = true
node-ttl = 7d
node-purge-ttl = 1d
conn-keep-alive = 45
conn-lifetime = 0
conn-max-age = 60
[jetty]
# Hostname to list for clear-text HTTP.  Default is localhost
#host = localhost
host = localhost
# Port to listen on for clear-text HTTP.
port = 8080
 
ssl-host = lon3-puppetdb-a1.lon3.spotify.net
ssl-port = 8081
# keystore = /etc/puppetdb/ssl/keystore.jks
# truststore = /etc/puppetdb/ssl/truststore.jks
 
# key-password = ---
# trust-password = ---
ssl-key = /etc/puppetdb/ssl/private.pem
ssl-cert = /etc/puppetdb/ssl/public.pem
ssl-ca-cert = /etc/puppetdb/ssl/ca.pem
[repl]
# Set to true to enable the remote REPL
enabled = false
 
# Either nrepl or swank or telnet
type = nrepl
 
# What port the REPL should listen on
port = 8082

Comment by Erik Dalén [ 2014/06/19 ]

Screenshot of the dashboard. The load average on the host is around 3 (24 core host). And it drops to 0 when it gets stuck.

Comment by Ken Barber [ 2014/07/04 ]

Erik Dalén lets get some more info, since I foresee trying to replicate this with load being a hard task, I think getting a more precise way to reproduce this might be a good approach.

Now, I still want to explore the possibility that a version of Ruby is causing the HTTP lockup first, since I see a lot of the same symptoms with PDB-686. Can you provide me with a list of Ruby versions that are talking directly to PuppetDB? If the list is small I should be able to double check this problem does/does not exist since replicating PDB-686 is pretty easy (just do a POST command with raw JSON, and watch it block basically).

At the very least I can rule out my concerns around PDB-686 being the cause, and we can focus on other ideas.

Comment by Erik Dalén [ 2014/07/14 ]

The only ruby versions used are ruby 1.8.7 (2010-08-16 patchlevel 302) [x86_64-linux] from Debian squeeze and ruby 2.0.0p384 (2014-01-12) [x86_64-linux-gnu] from Ubuntu Trusty.

Comment by Ken Barber [ 2014/07/30 ]

So some discussions today about this problem and points:

  • http://wiki.eclipse.org/Jetty/Howto/High_Load#Jetty_Tuning <- specifically mentions limiting the maxqueue which we do not
  • We don't expose JMX for Jetty (PDB-671) but we could. This might help us gleen more information when this fails. Obviously /metrics doesn't work here, but pure JMX does.
  • We currently are limited in upgrading Jetty beyond 9.1.0. I would love to rule this out however ... TK-40 speaks of the limitation in upgrading to 9.2.x, TK-32 speaks of a bug in Jetty 9.1.5 that prevents us from upgrading to that release.
  • We have only heard stories of this problem from users who are very large in capacity. Erik Dalén is the obvious case, plus we had another case today. So this has only been reported in the wild twice. "Me toos" are welcome, with details on your capacity.
  • NSCA logging would be useful also, PDB-477 but in lieu of that a reverse proxy is probably just as good.

So reproducibility would be my desired goal here, in lieu of that access to an environment that is exhibiting the bug - and the willingness to run a debugging release would be just as good if not more expedient.

Comment by Ken Barber [ 2014/07/31 ]

Some users have reported random entropy problems, and proven that by switching to use /dev/urandom this solves their issues. The symptom similarity is hard to guess without a lot more data however. Anyone who gets this issue should ensure they have this setting in their JAVA_ARGS:

-Djava.security.egd=file:/dev/urandom

This might not have been placed correctly during a package upgrade, due to the handling around configuration files in dpkg, and rpm. ie. the user usually gets prompted to replace these files, or at least a warning, but mostly the default is not to override them.

Comment by Ken Barber [ 2015/01/08 ]

Is this still happening mate? I'm still yet to hear of anyone else getting this besides yourself unfortunately.

Comment by Ken Barber [ 2015/01/21 ]

This has been open for a while, we've never seen it ourselves, or have had any other reports in the wild of this happening. If someone can show us how to reproduce this, please re-open it.

Comment by Erik Dalén [ 2015/01/23 ]

Sorry for a bit late reply, but just looked into this and it seems to be happening quite regularly on our production instances still. We have a script that restarts it every time it happens though. We have the urandom thing in the JAVA_ARGS, so that is not the issue.

Actually, if anything this issue is more frequent nowadays, but only happens on the more heavily loaded instances still. We migrated them to Ubuntu Trusty just a while ago, and I was hoping that might solve it, but nope.

Here is a log from one of the prod instances showing a bit how regular the restarts are:

root@lon3-puppetdb-a4:~# grep 'Starting broker' /var/log/puppetdb/puppetdb.log|tail -n10
2015-01-22 12:02:06,669 INFO  [c.p.p.c.services] Starting broker
2015-01-22 14:42:15,950 INFO  [c.p.p.c.services] Starting broker
2015-01-22 18:42:01,802 INFO  [c.p.p.c.services] Starting broker
2015-01-22 21:37:18,014 INFO  [c.p.p.c.services] Starting broker
2015-01-22 22:57:00,666 INFO  [c.p.p.c.services] Starting broker
2015-01-23 00:47:02,552 INFO  [c.p.p.c.services] Starting broker
2015-01-23 02:42:16,459 INFO  [c.p.p.c.services] Starting broker
2015-01-23 05:32:16,629 INFO  [c.p.p.c.services] Starting broker
2015-01-23 07:32:14,278 INFO  [c.p.p.c.services] Starting broker
2015-01-23 09:02:15,508 INFO  [c.p.p.c.services] Starting broker

But is there any debugging command we can enable and save the output from when it happens in that script? Want more up to date jstack -F outputs?

Comment by Ken Barber [ 2015/01/23 ]

Dude thanks for getting back to me , if you go back through this comment thread, you can see a whole bunch of things I had been asking for .

  • We enabled Jetty JMX monitoring now just for this very ticket, so what I was looking for was a way to graph the Jetty thread usage from your end so we can see what its doing when it crashes. This thread monitoring may be best monitored via JMX directly via HTTP instead, because if the problem is with HTTP thread hanging we may lose the data.
  • More up to date thread dumps may be a good idea also if you can get them. Last time I looked they didn't really help us.
  • Obviously testing with 2.2.2 is the best thing, but thats obvious.
  • Beyond that, I would really really love some sort of a more comprehensive dump. That is, perhaps a YourKit production CPU/Memory dump during the failure. YourKit is a commercial product, but perhaps in this scenario its the way to go ... I don't want to ask you to spend money to fix the problem , but I think YourKit as a solution is reasonably inexpensive probably compared to the hassles this problem would cause. If this is something you can explore, we have YourKit clients here so we can read such dumps - let me know if you want to go down that road, as I feel more information is the key here. I think also YourKit has a trial, and I also believe the headless version for the collector might actually be free.

I really do mean it when I say I haven't been able to reproduce this, or have seen it anywhere else. I've seen symptoms like this but they turn out to be something completely different, which is frustrating . Incidentally we have been talking about embedding a free yourkit headless profile in our products, but this work hasn't happened yet.

Now I've got your attention again, I'll reopen the ticket .

Comment by Ken Barber [ 2015/04/14 ]

Erik Dalén we upgraded jetty in a recent 2.3 release, with 2.3.3 are you seeing this issue resolved now? Or is it still happening?

Comment by Erik Dalén [ 2015/04/14 ]

So far I've only upgraded the instances for the testing environment where we didn't see this happening much anyway. Will upgrade the production ones tomorrow, can check if it happens on any of them after that.

Comment by Ken Barber [ 2015/04/14 ]

Erik Dalén sweet, thanks.

Comment by Erik Dalén [ 2015/05/14 ]

This issue is fixed now with the jetty upgrades.

Comment by Ken Barber [ 2015/05/15 ]

Oh oh ... thats excellent news . Thanks Erik Dalén.

Generated at Mon Sep 16 23:15:46 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.