Uploaded image for project: 'PuppetDB'
  1. PuppetDB
  2. PDB-4760

Broadcast commands transiently fail with "Connection is closed"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Normal
    • Resolution: Fixed
    • Affects Version/s: PDB 6.11.0
    • Fix Version/s: PDB 6.12.0
    • Component/s: None
    • Labels:
      None
    • Template:
    • Acceptance Criteria:
      • Replace catalog and replace facts commands do not nest with-connection macros
      • retry-with-monitored-connection re-uses the same PostgreSQL connection.
    • Team:
      HA
    • Method Found:
      Manual Test
    • Release Notes:
      Bug Fix
    • Release Notes Summary:
      Fixed an issue when command submissions would fail when PostgreSQL was under high write load.
    • QA Risk Assessment:
      Needs Assessment

      Description

      When running PuppetDB with two write databases configured, and running benchmark to simlate a large number of submissions, there will be occassional errors with "replace catalogs" and "replace facts" commands that fail and log errors like the following.

      Retrying after attempt 0 for host-2, due to: clojure.lang.ExceptionInfo: [3] [replace catalog] Unable to broadcast command for host-2 {:kind :puppetlabs.puppetdb.command/retry} [Ljava.lang.Throwable;@1186448b
      clojure.lang.ExceptionInfo: [3] [replace catalog] Unable to broadcast command for host-2
              at puppetlabs.puppetdb.command$broadcast_cmd.invokeStatic(command.clj:606)
              at puppetlabs.puppetdb.command$broadcast_cmd.invoke(command.clj:555)
              at puppetlabs.puppetdb.command$process_cmd$fn__39637$fn__39638.invoke(command.clj:631)
              at puppetlabs.puppetdb.command$process_cmd$fn__39637.invoke(command.clj:629)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760$fn__37761.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
              at com.codahale.metrics.Timer.time(Timer.java:101)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760$fn__37761.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
              at com.codahale.metrics.Timer.time(Timer.java:101)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37759$fn__37760.invoke(metrics.clj:14)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
              at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
              at puppetlabs.puppetdb.command$process_cmd.invokeStatic(command.clj:626)
              at puppetlabs.puppetdb.command$process_cmd.invoke(command.clj:612)
              at puppetlabs.puppetdb.command$process_message.invokeStatic(command.clj:720)
              at puppetlabs.puppetdb.command$process_message.invoke(command.clj:677)
              at puppetlabs.puppetdb.command$message_handler$fn__39667.invoke(command.clj:752)
              at puppetlabs.puppetdb.threadpool$dochan$fn__39260$fn__39261.invoke(threadpool.clj:110)
              at puppetlabs.puppetdb.threadpool$gated_execute$fn__39224.invoke(threadpool.clj:64)
              at clojure.lang.AFn.run(AFn.java:22)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
              Suppressed: java.sql.SQLException: Connection is closed
                      at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:490)
                      at com.sun.proxy.$Proxy11.getAutoCommit(Unknown Source)
                      at com.zaxxer.hikari.pool.HikariProxyConnection.getAutoCommit(HikariProxyConnection.java)
                      at clojure.java.jdbc$db_transaction_STAR_.invokeStatic(jdbc.clj:761)
                      at clojure.java.jdbc$db_transaction_STAR_.invoke(jdbc.clj:741)
                      at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn$fn__28084.invoke(jdbc.clj:528)
                      at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063$fn__28064$fn__28065.invoke(jdbc.clj:500)
                      at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063$fn__28064.invoke(jdbc.clj:499)
                      at puppetlabs.puppetdb.jdbc$eval28058$retry_sql_STAR___28063.invoke(jdbc.clj:490)
                      at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn.invokeStatic(jdbc.clj:526)
                      at puppetlabs.puppetdb.jdbc$with_transacted_connection_fn.invoke(jdbc.clj:515)
                      at puppetlabs.puppetdb.command$exec_replace_catalog.invokeStatic(command.clj:295)
                      at puppetlabs.puppetdb.command$exec_replace_catalog.invoke(command.clj:292)
                      at puppetlabs.puppetdb.command$exec_command.invokeStatic(command.clj:444)
                      at puppetlabs.puppetdb.command$exec_command.invoke(command.clj:438)
                      at puppetlabs.puppetdb.command$attempt_exec_command$fn__39565.invoke(command.clj:513)
                      at puppetlabs.puppetdb.command$call_with_quick_retry$fn__39558.invoke(command.clj:495)
                      at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:494)
                      at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:492)
                      at puppetlabs.puppetdb.command$attempt_exec_command.invokeStatic(command.clj:512)
                      at puppetlabs.puppetdb.command$attempt_exec_command.invoke(command.clj:509)
                      at puppetlabs.puppetdb.command$broadcast_cmd$attempt_exec__39600$fn__39603.invoke(command.clj:577)
                      at puppetlabs.puppetdb.command$broadcast_cmd$attempt_exec__39600.invoke(command.clj:576)
                      at puppetlabs.puppetdb.command$broadcast_cmd$fn__39623$fn__39624$fn__39625.invoke(command.clj:590)
                      ... 4 common frames omitted
      

      When turning on debug logging, I could see that the first attempt of each command would fail with a message like

      2020-05-29 14:12:31,273 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 1
      org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
        Where: SQL statement "SELECT 1 FROM ONLY "public"."certnames" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
      

      and the subsequent retries would fail with

      2020-05-29 14:12:31,290 DEBUG [cmd-broadcast-thread-13] [p.p.command] Exception throw in L1 retry attempt 2
      java.sql.SQLException: Connection is closed
              at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:515)
      

      until it finally threw the error in the first code block.

      This is because exec-replace-catalogs and exec-replace-facts end up nesting with-connection macros

      (jdbc/with-monitored-db-connection db status
        <catch and attempt retry>
        (jdbc/with-transacted-connection
          <attempt storage>))
      

      When the initial storage attempt fails, with-transacted-connection marks the connection as closed, but it is still bound by the wrapping with-monitored-db-connection, so it attempts to use the same connection, that is now marked as closed, for its retries.

      There is a similar issue in the macro retry-with-monitored-connection

      (retry-sql 5
                 (with-monitored-db-connection db-spec status
                   (with-db-transaction [:isolation isolation]
                     (f))))
      

      will get a new connection for each retry. This is used by "store report", which is why we don't see the transient ConnectionClosed errors with that command.

        Attachments

          Activity

            People

            Assignee:
            rob.browning Rob Browning
            Reporter:
            austin.blatt Austin Blatt
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Zendesk Support