Uploaded image for project: 'PuppetDB'
  1. PuppetDB
  2. PDB-4948

Improve report/resource_event GC coordination with in flight queries

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Normal
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: PDB 6.14.0, PDB 7.1.0
    • Component/s: None
    • Labels:
      None
    • Template:
    • Team:
      HA
    • Story Points:
      5
    • Sprint:
      HA 2021-01-27, HA 2021-02-10
    • Release Notes:
      Enhancement
    • Release Notes Summary:
      Hide
      Added a query-bulldozer which is spawned during periodic GC when PuppetDB attempts to drop partitioned tables. The bulldozer will cancel any queries blocking the GC process from getting the AccessExclusiveLocks it needs in order to drop a partition. See the https://puppet.com/docs/puppetdb/latest/configure.html#experimental-environment-variables section of the docs for infomation on the PDB_GC_QUERY_BULLDOZER_TIMEOUT_MS setting which allows users to disable the query-bulldozer if needed.
      Show
      Added a query-bulldozer which is spawned during periodic GC when PuppetDB attempts to drop partitioned tables. The bulldozer will cancel any queries blocking the GC process from getting the AccessExclusiveLocks it needs in order to drop a partition. See the https://puppet.com/docs/puppetdb/latest/configure.html#experimental-environment-variables section of the docs for infomation on the PDB_GC_QUERY_BULLDOZER_TIMEOUT_MS setting which allows users to disable the query-bulldozer if needed.
    • QA Risk Assessment:
      Needs Assessment

      Description

      In a hotfix targeted at 2019.8 we added an interrupter thread to help coordinate report/resource_event GC with sync queries in this PR. While we hope that this change allows PDB sync to avoid full deadlocks with GC as seen in PE-30087 it's still possible that GC could conflict with other long running queries outside of sync. It's also possible that GC could get unlucky and need multiple tries to delete a partition which could cause multiple errors in the logs while sync gets cancelled repeatedly.

      A more complete solution would be to allow GC to "bulldoze" other in flight queries which are blocking the AccessExclusiveLock it needs to drop an old partition. This could be accomplished by using pg_cancel_backend(<pid>) in coordination with querying pg_locks to see which queries are blocking the lock GC needs. Doing something along these lines would protect against all queries and not just those being performed by the local PDB during sync.

      If we do this we'll want to audit the error handling /retry behavior of all queries we can think of in PDB and PE because this could cause GC to kill any inflight query.

        Attachments

          Activity

            People

            Assignee:
            zachary.kent Zachary Kent
            Reporter:
            zachary.kent Zachary Kent
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Zendesk Support