Details
-
Improvement
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
None
Description
In a hotfix targeted at 2019.8 we added an interrupter thread to help coordinate report/resource_event GC with sync queries in this PR. While we hope that this change allows PDB sync to avoid full deadlocks with GC as seen in PE-30087 it's still possible that GC could conflict with other long running queries outside of sync. It's also possible that GC could get unlucky and need multiple tries to delete a partition which could cause multiple errors in the logs while sync gets cancelled repeatedly.
A more complete solution would be to allow GC to "bulldoze" other in flight queries which are blocking the AccessExclusiveLock it needs to drop an old partition. This could be accomplished by using pg_cancel_backend(<pid>) in coordination with querying pg_locks to see which queries are blocking the lock GC needs. Doing something along these lines would protect against all queries and not just those being performed by the local PDB during sync.
If we do this we'll want to audit the error handling /retry behavior of all queries we can think of in PDB and PE because this could cause GC to kill any inflight query.