[PDB-868] Potential race condition causing API error in db_resilience/postgres_restart.rb Created: 2014/09/14  Updated: 2016/04/27  Resolved: 2014/09/17

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: PDB 2.2.1
Fix Version/s: PDB 2.2.1

Type: Bug Priority: Normal
Reporter: Ken Barber Assignee: Ken Barber
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Template:
Story Points: 2
Sprint: PuppetDB 2014-09-24

 Description   

Saw this again in upgrade tests today (https://jenkins.puppetlabs.com/job/PuppetDB%20Acceptance%20IN%20THE%20CLOUD/43/BEAKER_CONFIG=ec2-west-ubuntu1004-64mda-64a,BEAKER_OPTIONS=postgres,PUPPETDB_INSTALL_MODE=upgrade,label=beaker-ec2/consoleFull). Upgrade from 2.2.0 to 6bcf92428b55ca8ffd147f7789a865d8a3e5441c on lucid amd64 failed with the following error:

Errored Tests Cases:
  Test Case acceptance/tests/db_resilience/postgres_restart.rb reported: #<JSON::ParserError: A JSON text must at least contain two octets!>



 Comments   
Comment by Ken Barber [ 2014/09/15 ]

One thing that is strange, I can't replicate this for my local setup with 9.3. I'm testing this in a pretty tight loop but don't see a race. Maybe the tests are only failing on older postgresql setups?

Comment by Ken Barber [ 2014/09/15 ]

Unable to replicate with PostgreSQL 8.4 on centos, I've switched the acceptance tests now so that they are going to preserve failed hosts. This way, we can SSH in and see the failure. I've also changed the EC2 zombie killer to kill after 12 hours so we don't have the deleted underneath us.

Comment by Ken Barber [ 2014/09/16 ]

It seems this is a double 57P01:

2014-09-16 20:15:28,786 INFO  [c.p.p.command] [2abb8cfb-7dd0-4c86-a2c3-ba60103e9779] [replace facts] ubuntu-1004-64-2.us-west-2.compute.internal
2014-09-16 20:15:28,893 INFO  [c.p.p.command] [e84de777-5a7b-4b3b-80bb-64c241cc29b6] [replace catalog] ubuntu-1004-64-2.us-west-2.compute.internal
2014-09-16 20:15:29,314 INFO  [c.p.p.command] [a70f6e2a-9ac0-4030-a9f4-25ff1d4396cb] [store report] puppet v3.7.1 - ubuntu-1004-64-2.us-west-2.compute.internal
2014-09-16 20:15:47,209 ERROR [c.j.b.ConnectionHandle] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 57P01
2014-09-16 20:15:47,215 ERROR [c.j.b.ConnectionHandle] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 57P01
2014-09-16 20:15:47,215 WARN  [c.p.jdbc] Caught exception. Last attempt, throwing exception.
2014-09-16 20:15:47,226 ERROR [c.p.http] Error streaming response
org.postgresql.util.PSQLException: This connection has been closed.
        at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822) ~[puppetdb.jar:na]
        at org.postgresql.jdbc2.AbstractJdbc2Connection.setAutoCommit(AbstractJdbc2Connection.java:769) ~[puppetdb.jar:na]
        at com.jolbox.bonecp.ConnectionHandle.setAutoCommit(ConnectionHandle.java:1063) ~[puppetdb.jar:na]
        at clojure.java.jdbc.internal$transaction_STAR_.invoke(internal.clj:222) ~[na:na]
        at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__6546$fn__6547.invoke(jdbc.clj:272) ~[na:na]
        at clojure.java.jdbc.internal$with_connection_STAR_.invoke(internal.clj:186) ~[na:na]
        at com.puppetlabs.jdbc$with_transacted_connection_fn$fn__6546.invoke(jdbc.clj:269) ~[na:na]
        at com.puppetlabs.jdbc$eval6524$retry_sql_STAR___6525$fn__6526$fn__6527.invoke(jdbc.clj:241) ~[na:na]
        at com.puppetlabs.jdbc$eval6524$retry_sql_STAR___6525$fn__6526.invoke(jdbc.clj:240) ~[na:na]
        at com.puppetlabs.jdbc$eval6524$retry_sql_STAR___6525.invoke(jdbc.clj:233) ~[na:na]
        at com.puppetlabs.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:268) ~[na:na]
        at com.puppetlabs.puppetdb.http.nodes$produce_body$fn__12785$fn__12788.invoke(nodes.clj:30) ~[na:na]
        at com.puppetlabs.http$stream_json_response$fn__7260.invoke(http.clj:234) ~[na:na]
        at ring.util.io$piped_input_stream$fn__7124.invoke(io.clj:25) [na:na]
        at clojure.core$binding_conveyor_fn$fn__4145.invoke(core.clj:1910) [puppetdb.jar:na]
        at clojure.lang.AFn.call(AFn.java:18) [puppetdb.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_51]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]

So now I should look into retries and exponential back-off. I'm presuming this is due to the race I've described.

Generated at Wed Jun 26 23:42:34 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.