Uploaded image for project: 'PuppetDB'
  1. PuppetDB
  2. PDB-4760

Broadcast commands transiently fail with "Connection is closed"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Normal
    • Resolution: Fixed
    • PDB 6.11.0
    • PDB 6.12.0
    • PuppetDB
    • None
      • Replace catalog and replace facts commands do not nest with-connection macros
      • retry-with-monitored-connection re-uses the same PostgreSQL connection.
    • HA
    • Manual Test
    • Bug Fix
    • Fixed an issue when command submissions would fail when PostgreSQL was under high write load.
    • Needs Assessment

    Description

      When running PuppetDB with two write databases configured, and running benchmark to simlate a large number of submissions, there will be occassional errors with "replace catalogs" and "replace facts" commands that fail and log errors like the following.

      Retrying after attempt 0 for host-2, due to: clojure.lang.ExceptionInfo: [3] [replace catalog] Unable to broadcast command for host-2 {:kind :puppetlabs.puppetdb.command/retry} [Ljava.lang.Throwable;@1186448b
      clojure.lang.ExceptionInfo: [3] [replace catalog] Unable to broadcast command for host-2
              at puppetlabs.puppetdb.command$broadcast_cmd.invokeStatic(command.clj:606)
              at puppetlabs.puppetdb.command$broadcast_cmd.invoke(command.clj:555)
              at puppetlabs.puppetdb.command$process_cmd$fn__39637$fn__39638.invoke(command.clj:631)
              at puppetlabs.puppetdb.command$process_cmd$fn__39637.invoke(command.clj:629)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760$fn__37761.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
              at com.codahale.metrics.Timer.time(Timer.java:101)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760$fn__37761.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
              at com.codahale.metrics.Timer.time(Timer.java:101)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
              at puppetlabs.puppetdb.command$process_cmd.invokeStatic(command.clj:626)
              at puppetlabs.puppetdb.command$process_cmd.invoke(command.clj:612)
              at puppetlabs.puppetdb.command$process_message.invokeStatic(command.clj:720)
              at puppetlabs.puppetdb.command$process_message.invoke(command.clj:677)
              at puppetlabs.puppetdb.command$message_handler$fn__39667.invoke(command.clj:752)
              at puppetlabs.puppetdb.threadpool$dochan$fn__39260$fn__39261.invoke(threadpool.clj:110)
              at puppetlabs.puppetdb.threadpool$gated_execute$fn__39224.invoke(threadpool.clj:64)
              at clojure.lang.AFn.run(AFn.java:22)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
              Suppressed: java.sql.SQLException: Connection is closed
                      at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:490)
                      at com.sun.proxy.$Proxy11.getAutoCommit(Unknown Source)
                      at com.zaxxer.hikari.pool.HikariProxyConnection.getAutoCommit(HikariProxyConnection.java)
                      at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:761)
                      at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:741)
                      at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn$fn__28084.invoke(jdbc.clj:528)
                      at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063$fn__28064$fn__28065.invoke(jdbc.clj:500)
                      at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063$fn__28064.invoke(jdbc.clj:499)
                      at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063.invoke(jdbc.clj:490)
                      at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn.invokeStatic(jdbc.clj:526)
                      at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:515)
                      at puppetlabs.puppetdb.command$exec_replace_catalog.invokeStatic(command.clj:295)
                      at puppetlabs.puppetdb.command$exec_replace_catalog.invoke(command.clj:292)
                      at puppetlabs.puppetdb.command$exec_command.invokeStatic(command.clj:444)
                      at puppetlabs.puppetdb.command$exec_command.invoke(command.clj:438)
                      at puppetlabs.puppetdb.command$attempt_exec_command$fn__39565.invoke(command.clj:513)
                      at puppetlabs.puppetdb.command$call_with_quick_retry$fn__39558.invoke(command.clj:495)
                      at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:494)
                      at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:492)
                      at puppetlabs.puppetdb.command$attempt_exec_command.invokeStatic(command.clj:512)
                      at puppetlabs.puppetdb.command$attempt_exec_command.invoke(command.clj:509)
                      at puppetlabs.puppetdb.command$broadcast_cmd$attempt_exec__39600$fn__39603.invoke(command.clj:577)
                      at puppetlabs.puppetdb.command$broadcast_cmd$attempt_exec__39600.invoke(command.clj:576)
                      at puppetlabs.puppetdb.command$broadcast_cmd$fn__39623$fn__39624$fn__39625.invoke(command.clj:590)
                      ... 4 common frames omitted
      

      When turning on debug logging, I could see that the first attempt of each command would fail with a message like

      2020-05-29 14:12:31,273 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 1
      org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
        Where: SQL statement "SELECT 1 FROM ONLY "public"."certnames" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
      

      and the subsequent retries would fail with

      2020-05-29 14:12:31,290 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 2
      java.sql.SQLException: Connection is closed
              at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:515)
      

      until it finally threw the error in the first code block.

      This is because exec-replace-catalogs and exec-replace-facts end up nesting with-connection macros

      (jdbc/with-monitored-db-connection db status
        <catch and attempt retry>
        (jdbc/with-transacted-connection
          <attempt storage>))
      

      When the initial storage attempt fails, with-transacted-connection marks the connection as closed, but it is still bound by the wrapping with-monitored-db-connection, so it attempts to use the same connection, that is now marked as closed, for its retries.

      There is a similar issue in the macro retry-with-monitored-connection

      (retry-sql 5
                 (with-monitored-db-connection db-spec status
                   (with-db-transaction [:isolation isolation]
                     (f))))
      

      will get a new connection for each retry. This is used by "store report", which is why we don't see the transient ConnectionClosed errors with that command.

      Attachments

        Activity

          People

            rob.browning Rob Browning
            austin.blatt Austin Blatt
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Zendesk Support