[PDB-707] The first request to PuppetDB after DB backend restart fails Created: 2014/06/13  Updated: 2016/04/27  Resolved: 2014/09/03

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: PDB 1.6.3
Fix Version/s: PDB 2.2.1

Type: Bug Priority: Normal
Reporter: Michal Ruzicka Assignee: Ken Barber
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
Relates
relates to PDB-743 PE testing Report retrieval failure d... Closed
Template:
Story Points: 3
Sprint: 20140813 to 20140827, 20140827 to 20140910

 Description   

The first request sent to PuppetDB after its backend DB has been restarted fails. Any subsequent requests are processed normally.

Steps to reproduce

service pe-postgresql restart
curl -i --cacert /opt/puppet/share/puppet-dashboard/certs/pe-internal-dashboard.ca_cert.pem --cert /opt/puppet/share/puppet-dashboard/certs/pe-internal-dashboard.cert.pem --key /opt/puppet/share/puppet-dashboard/certs/pe-internal-dashboard.private_key.pem 'https://<puppetdb host>:8081/v3/event-counts?query=%5B%22and%22%2C%5B%22and%22%2C%5B%22%3C%22%2C%22report-receive-time%22%2C%222014-06-13T13%3A24%3A05Z%22%5D%5D%2C%5B%22not%22%2C%5B%22%3D%22%2C%22containing-class%22%2Cnull%5D%5D%5D&distinct-resources=true&distinct-start-time=2014-06-13T12%3A54%3A05Z&distinct-end-time=2014-06-13T13%3A24%3A05Z&summarize-by=containing-class&count-by=certname&counts-filter=%5B%22%3E%22%2C%22failures%22%2C0%5D&limit=1000&offset=0&order-by=%5B%7B%22field%22%3A%22containing-class%22%2C%22order%22%3A%22ASC%22%7D%5D&include-total=true'

where <puppetdb host> is the host where PuppetDB is running

Additional information

The output of the curl command looks like this:

HTTP/1.1 500 Server Error
Date: Fri, 13 Jun 2014 13:41:17 GMT
Content-Length: 0
Connection: close
Server: Jetty(7.x.y-SNAPSHOT)

The /var/log/pe-puppetdb/pe-puppetdb.log log file contains the following information:

2014-06-13 06:41:17,177 ERROR [qtp820981039-41] [bonecp.ConnectionHandle] Database access problem. Killing off all remaining connect
ions in the connection pool. SQL State = 57P01
2014-06-13 06:41:17,180 WARN  [qtp820981039-41] [server.AbstractHttpConnection] /v3/event-counts?query=%5B%22and%22%2C%5B%22and%22%2
C%5B%22%3C%22%2C%22report-receive-time%22%2C%222014-06-13T13%3A24%3A05Z%22%5D%5D%2C%5B%22not%22%2C%5B%22%3D%22%2C%22containing-class
%22%2Cnull%5D%5D%5D&distinct-resources=true&distinct-start-time=2014-06-13T12%3A54%3A05Z&distinct-end-time=2014-06-13T13%3A24%3A05Z&
summarize-by=containing-class&count-by=certname&counts-filter=%5B%22%3E%22%2C%22failures%22%2C0%5D&limit=1000&offset=0&order-by=%5B%
7B%22field%22%3A%22containing-class%22%2C%22order%22%3A%22ASC%22%7D%5D&include-total=true
org.postgresql.util.PSQLException: This connection has been closed.
        at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.setAutoCommit(AbstractJdbc2Connection.java:769)
        at com.jolbox.bonecp.ConnectionHandle.setAutoCommit(ConnectionHandle.java:1063)
        at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:222)
        at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__2278.invoke(jdbc.clj:228)
        at clojure.java.jdbc.internal$with_connection_STAR_.invoke(internal.clj:186)
        at com.puppetlabs.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:225)
        at com.puppetlabs.puppetdb.http.event_counts$produce_body.invoke(event_counts.clj:28)
        at com.puppetlabs.puppetdb.http.event_counts$routes$fn__7581$fn__7585$fn__7590.invoke(event_counts.clj:46)
        at com.puppetlabs.puppetdb.http.event_counts$routes$fn__7581$fn__7585.invoke(event_counts.clj:43)
        at com.puppetlabs.puppetdb.http.event_counts$routes$fn__7581.invoke(event_counts.clj:43)
        at com.puppetlabs.middleware$verify_accepts_content_type$fn__6056.invoke(middleware.clj:104)
        at com.puppetlabs.middleware$validate_query_params$fn__6061.invoke(middleware.clj:131)
        at com.puppetlabs.middleware$wrap_with_paging_options$fn__6051.invoke(middleware.clj:84)
        at com.puppetlabs.puppetdb.http.v3$v3_app$fn__7835.invoke(v3.clj:60)
        at net.cgrand.moustache$alter_request$fn__6453.invoke(moustache.clj:54)
        at com.puppetlabs.puppetdb.http.v3$v3_app.invoke(v3.clj:20)
        at com.puppetlabs.puppetdb.http.server$routes$fn__9305.invoke(server.clj:41)
        at net.cgrand.moustache$alter_request$fn__6453.invoke(moustache.clj:54)
        at com.puppetlabs.puppetdb.http.server$routes.invoke(server.clj:41)
        at ring.middleware.resource$wrap_resource$fn__9272.invoke(resource.clj:17)
        at ring.middleware.params$wrap_params$fn__6395.invoke(params.clj:55)
        at com.puppetlabs.middleware$wrap_with_authorization$fn__6035.invoke(middleware.clj:37)
        at com.puppetlabs.middleware$wrap_with_certificate_cn$fn__6039.invoke(middleware.clj:52)
        at com.puppetlabs.middleware$wrap_with_default_body$fn__6043.invoke(middleware.clj:59)
        at com.puppetlabs.middleware$wrap_with_metrics_STAR_$fn__6077$fn__6086.invoke(middleware.clj:183)
        at com.puppetlabs.utils.metrics$multitime_BANG__STAR_$fn__6014$fn__6015$fn__6016.invoke(metrics.clj:14)
        at com.puppetlabs.utils.metrics.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source)
        at com.yammer.metrics.core.Timer.time(Timer.java:91)
        at com.puppetlabs.utils.metrics$multitime_BANG__STAR_$fn__6014$fn__6015.invoke(metrics.clj:14)
        at com.puppetlabs.utils.metrics$multitime_BANG__STAR_$fn__6014$fn__6015$fn__6016.invoke(metrics.clj:14)
        at com.puppetlabs.utils.metrics.proxy$java.lang.Object$Callable$f8c5758f.call(Unknown Source)
        at com.yammer.metrics.core.Timer.time(Timer.java:91)
        at com.puppetlabs.utils.metrics$multitime_BANG__STAR_$fn__6014$fn__6015.invoke(metrics.clj:14)
        at com.puppetlabs.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:17)
        at com.puppetlabs.middleware$wrap_with_metrics_STAR_$fn__6077.invoke(middleware.clj:182)
        at com.puppetlabs.middleware$wrap_with_globals$fn__6047.invoke(middleware.clj:70)
        at com.puppetlabs.middleware$wrap_with_debug_logging$fn__6031.invoke(middleware.clj:24)
        at ring.adapter.jetty$proxy_handler$fn__5782.invoke(jetty.clj:18)
        at ring.adapter.jetty.proxy$org.eclipse.jetty.server.handler.AbstractHandler$0.handle(Unknown Source)
        at org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:267)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:349)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:452)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:884)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:938)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:76)
        at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:191)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:609)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:45)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:744)



 Comments   
Comment by Ken Barber [ 2014/06/13 ]

Michal Ruzicka I can't replicate this in 2.0 on Debian:

https://gist.github.com/kbarber/18bbc392b8a1ae404380

And I can't replicate it on a local 1.6.x setup here either:

https://gist.github.com/kbarber/62c0fa6ebbe3dc8b89f9

Have you tested this works outside of PE at all? It looks fine to me. Perhaps the database is taking time to start up or something in your case. It smells like something specific to the PE installation, not PDB itself but I can't think what.

Comment by Ken Barber [ 2014/06/13 ]

Actually now I can replicate it:

ruby-1.9.3-p547@puppetdb (git)-[1.6.x]
[ken@kb puppetdb]# PGDATA=~/Library/Application\ Support/Postgres93/var/ pg_ctl restart -m fast
waiting for server to shut down.... done
server stopped
server starting
2014-06-13 15:54:39 BST [sess:539b10af.12378,pid:74616,vtid:,tid:0]LOG:  redirecting log output to logging collector process
2014-06-13 15:54:39 BST [sess:539b10af.12378,pid:74616,vtid:,tid:0]HINT:  Future log output will appear in directory "pg_log".
ruby-1.9.3-p547@puppetdb (git)-[1.6.x]
[ken@kb puppetdb]# curl 'http://localhost:8080/v3/facts/facterversion'
ruby-1.9.3-p547@puppetdb (git)-[1.6.x]
[ken@kb puppetdb]# curl 'http://localhost:8080/v3/facts/facterversion'
[ {
  "certname" : "anawdlxglhmtoaxywidwginvgyeeqwcslixvmerweuzsefktyk",
  "name" : "facterversion",
  "value" : "1.7.3"
} ]%

The trick is to not stop, then attempt a connection, and then start again - that hides the error.

Comment by Michal Ruzicka [ 2014/06/13 ]

Ken Barber Yes, that is AFAIKT equivalent to the steps I described in this ticket.

Comment by Ken Barber [ 2014/06/13 ]

A work-around currently is to just set your 'conn-keep-alive' very low, like 1 minute. This will force an idle test every minute potentially surfacing the problem in the background before a client/user actually attempts to use the service.

Another solution is to ping the PuppetDB interface (/v3/nodes for example) to clear out the 57P01 after postgresql restarts and before any other applications use it.

These workarounds do not solve the problem, they just mask it.

Comment by Ken Barber [ 2014/06/13 ]

So a quick asssessment of this problem:

  • Tried upgrading to bonecp 0.8.0-RELEASE, no change
  • setAcquireRetryAttempts doesn't fix it
  • setTransactionRecoveryPeriodInMinutes doesn't fix it
  • setCloseConnectionWatch true doesn't fix it
  • setLazyInit false doesn't fix it

In general it seems like its just a postgresql/bonecp handling 'incorrectness' during an exception. Fortunately bonecp does provide us with the ability to add hooks during connection states, so we can probably trap this exception and behave better. The in-built behaviour is almost correct, it sees 1 connection in 57P01 state, clears all the other connections, but doesn't re-acquire for the current failing connection unfortunately.

I'm seeing messages about this problem all over the net, without a clear indication of how to fix it, so I don't believe we're alone.

Anyway - if we can believe the connection hook idea might work, I'm going to estimate a 3 day effort to fix this. The suggested work-around for now is to reduce the 'conn-keep-alive' setting in database.ini to 1 minute, so that only API connections within the first idle 1 minute after a restart will exhibit this behaviour.

Comment by Aaron Armstrong [ 2014/06/13 ]

Ken Barber - PE doesn't need to pick this fix up until the next major release.

Comment by Ken Barber [ 2014/06/16 ]

Aaron Armstrong thanks for letting me know, Its still in the next sprint and I'll make sure I bring up the requirement in next planning.

Comment by Kenn Hussey [ 2014/06/18 ]

Need input based on result of trying Ken Barber's suggested workarounds.

Comment by Ken Barber [ 2014/08/06 ]

Something important to note, we're going to put in a work around for PDB-743 to mask this issue for Puppet testing: https://github.com/kbarber/pe-puppet/commit/1ca3598e6f65e92f6ebcdc449ef20e79a23cfa60

When we fix PDB-707, we will need to remove the workaround for Puppet's tests.

Comment by Ken Barber [ 2014/09/01 ]

I have four ideas to solve this:

http://jolbox.com/forum/viewtopic.php?f=3&t=582

Number 1 already kind of works, and I can do this with some level on our control, but numbers 2, 3 and 4 are more about exploring more clever ways of handling this without wrapping.

Comment by Ken Barber [ 2014/09/02 ]

I've ended up just catching the basic exception in BoneCP 0.7.1 and doing a retry on our end, and providing a patch upstream to bonecp core for the future:

https://github.com/puppetlabs/puppetdb/pull/1073

Comment by Brett Wooldridge [ 2014/10/22 ]

This would be fixed by PDB-992. HikariCP will never give you a closed connection.

Comment by Ken Barber [ 2014/10/22 ]

Brett Wooldridge the status of the ticket already says its fixed. See my previous comment.

Generated at Sun Aug 25 05:35:48 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.