Uploaded image for project: 'PuppetDB'
  1. PuppetDB
  2. PDB-2415

PuppetDB Should Reduce Use of Cascading Deletes and Updates In PostgreSQL

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Normal
    • Resolution: Won't Do
    • Affects Version/s: PDB 3.2.2
    • Fix Version/s: PDB 4.1.0
    • Component/s: None
    • Labels:

      Description

      The Problem

      While using the following to clean up some nodes that I had terminated I ran into an extreme slowdown in performance.

      node-ttl = 12h 
      node-purge-ttl = 4h 
      

      I traced it back to multiple delete statements that were running and I believe they were coming from foreign-key constraints on the certnames table.

      pe-puppetdb=# \d+ certnames
                                                                Table "public.certnames"
            Column      |           Type           |                       Modifiers                       | Storage  | Stats target | Description
      ------------------+--------------------------+-------------------------------------------------------+----------+--------------+-------------
       id               | bigint                   | not null default nextval('certname_id_seq'::regclass) | plain    |              |
       certname         | text                     | not null                                              | extended |              |
       latest_report_id | bigint                   |                                                       | plain    |              |
       deactivated      | timestamp with time zone |                                                       | plain    |              |
       expired          | timestamp with time zone |                                                       | plain    |              |
      Indexes:
          "certnames_transform_pkey" PRIMARY KEY, btree (id)
          "certnames_transform_certname_key" UNIQUE CONSTRAINT, btree (certname)
      Foreign-key constraints:
          "certnames_reports_id_fkey" FOREIGN KEY (latest_report_id) REFERENCES reports(id) ON DELETE SET NULL
      Referenced by:
          TABLE "catalogs" CONSTRAINT "catalogs_certname_fkey" FOREIGN KEY (certname) REFERENCES certnames(certname) ON DELETE CASCADE
          TABLE "factsets" CONSTRAINT "factsets_certname_fk" FOREIGN KEY (certname) REFERENCES certnames(certname) ON UPDATE CASCADE ON DELETE CASCADE
          TABLE "reports" CONSTRAINT "reports_certname_fkey" FOREIGN KEY (certname) REFERENCES certnames(certname) ON DELETE CASCADE
      Options: autovacuum_vacuum_scale_factor=0.20, autovacuum_analyze_scale_factor=0.10
      

      My understanding is that instead of a bulk deletions of reports that are for certnames that are no longer active. We’re simply deleting reports, catalogs, and factsets using a cascading delete from certnames.

      This means that if I have 200 certnames to delete I think that postgresql will simply shoot out 200 queries to delete the reports that look something like

      DELETE from reports where certname = $1;
      And another 200 queries to delete catalogs and another 200 to delete fact_sets 
      

      I think I proved that to myself by getting the query plan for deleting two certnames

      pe-puppetdb=# EXPLAIN ANALYZE DELETE from certnames where certname IN ( 'mono-scale-agent-61.us-west-2.compute.internal', 'mono-scale-agent-60.us-west-2.compute.internal');
                                                                             QUERY PLAN
      --------------------------------------------------------------------------------------------------------------------------------------------------------
       Delete on certnames  (cost=8.84..16.45 rows=2 width=6) (actual time=0.116..0.116 rows=0 loops=1)
         ->  Bitmap Heap Scan on certnames  (cost=8.84..16.45 rows=2 width=6) (actual time=0.050..0.053 rows=2 loops=1)
               Recheck Cond: (certname = ANY ('{mono-scale-agent-61.us-west-2.compute.internal,mono-scale-agent-60.us-west-2.compute.internal}'::text[]))
               Heap Blocks: exact=2
               ->  Bitmap Index Scan on certnames_transform_certname_key  (cost=0.00..8.84 rows=2 width=0) (actual time=0.045..0.045 rows=2 loops=1)
                     Index Cond: (certname = ANY ('{mono-scale-agent-61.us-west-2.compute.internal,mono-scale-agent-60.us-west-2.compute.internal}'::text[]))
       Planning time: 0.171 ms
       Trigger for constraint catalogs_certname_fkey on certnames: time=0.711 calls=2
       Trigger for constraint factsets_certname_fk on certnames: time=0.468 calls=2
       Trigger for constraint reports_certname_fkey on certnames: time=4063.079 calls=2
       Trigger for constraint catalog_resources_catalog_id_fkey on catalogs: time=159.127 calls=2
       Trigger for constraint factset_id_fk on factsets: time=1.617 calls=2
       Trigger for constraint certnames_reports_id_fkey on reports: time=491.102 calls=154
       Trigger for constraint resource_events_report_id_fkey on reports: time=197.813 calls=154
       Execution time: 4914.278 ms
      (15 rows)
      

      I first made sure those certnames existed and checked that they had reports.

      pe-puppetdb=# select count(*) from reports where certname = 'mono-scale-agent-60.us-west-2.compute.internal';
       count
      -------
          77
      (1 row)
      

      pe-puppetdb=# select count(*) from reports where certname = 'mono-scale-agent-61.us-west-2.compute.internal';
       count
      -------
          77
      (1 row)
      

      Assuming that the delete the constraint fires looks like below then we can see the delete from reports also cascades to another two tables ( back to certnames and then also resource_events )

      pe-puppetdb=# EXPLAIN ANALYZE DELETE from reports where certname = 'mono-scale-agent-62.us-west-2.compute.internal';
                                                                     QUERY PLAN
      ----------------------------------------------------------------------------------------------------------------------------------------
       Delete on reports  (cost=0.55..189.25 rows=46 width=6) (actual time=49.588..49.588 rows=0 loops=1)
         ->  Index Scan using reports_certname_idx on reports  (cost=0.55..189.25 rows=46 width=6) (actual time=0.064..0.423 rows=77 loops=1)
               Index Cond: (certname = 'mono-scale-agent-62.us-west-2.compute.internal'::text)
       Planning time: 0.068 ms
       Trigger for constraint certnames_reports_id_fkey: time=179.471 calls=77
       Trigger for constraint resource_events_report_id_fkey: time=333.978 calls=77
       Execution time: 563.123 ms
      (7 rows)
      

      While all of these are executing fast now that’s likely because I have more memory than the size of the database currently. I can see how all of these small queries would grind to a halt if they had to go to disk for any reason.

      Suggestion:

      Remove all cascading DELETEs and UDPATEs and instead opt for 1-time bulk queries that resolve the issues the constraints are meant to address.

      In some cases, such as reports and resource_events, the situation would resolve itself when the reports are deleted via report-ttl and that would be an efficient bulk deletion.

      We could also move to a delete that looks something like this.

      DELETE FROM reports 
      WHERE reports.id IN ( select reports.id 
      from reports 
      join certnames 
      on reports.certname = certnames.certname 
      where ( certnames.deactivated is not null or certnames.expired is not null) 
      and (certnames.deactivated < now() - node_purge_ttl 
              or  certnames.expired < now() - node_purge_ttl)
      );
      

      We would need to create deletes for catalogs, resource_events and other tables that currently have cascading delete.

      Other Thoughts

      I think this is likely uncommon as I’m not sure how many people use node_purge_ttl and if they do then they probably don’t have 100s of nodes to purge at a time. Even so I think there are benefits to moving to bulk deletes and updates or just doing nothing at all in some cases instead of letting these cascading deletes and updates happen as they generate a lot of small queries that can back up when the database is busy.

      The Next Problem

      Upon further observation I realized that the bulk deletion from reports when they hit the report-ttl is also causing cascading deletes to other tables which means that in the very large environment I'm testing in the delete takes 8-9 hours to complete to remove 800,000 rows from reports.

      Here's an example, at the time 1-27-2016 was just past the report-ttl.

      pe-puppetdb=# select count(*) FROM reports WHERE producer_timestamp < '1-27-2016';
       count
      --------
       822626
      (1 row)
       
      pe-puppetdb=# SELECT pid, age(query_start, clock_timestamp()), usename, query FROM pg_stat_activity WHERE query != '' AND query NOT ILIKE '%pg_stat_activity%' AND query ILIKE '%DELETE%' ORDER BY query_start desc;
       pid  |      age      | usename  |                       query
      ------+---------------+----------+---------------------------------------------------
       4910 | -07:27:47.787 | mYpdBu3r | DELETE FROM reports WHERE producer_timestamp < $1
      (1 row)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                nick.walker Nick Walker
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Zendesk Support