[PDB-1227] Test new table structure and fact path/value GC at scale Created: 2015/02/10  Updated: 2015/04/14  Resolved: 2015/03/31

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: None
Fix Version/s: PDB 2.3.1

Type: Bug Priority: Normal
Reporter: Ryan Senior Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to PDB-1031 ERROR: update or delete on table "fac... Closed
relates to PDB-1224 Move fact path reference from the fac... Closed
Template:
Story Points: 2
Sprint: PuppetDB 2015-02-25, PuppetDB 2015-03-11, PuppetDB 2015-03-25

 Description   

Once we have PDB-1224 and PDB-1225 in place and know what the query will look like for PDB-1226 (even when the background process isn't complete), we should test the new structure and queries at a heavy load level. We can run the disassociated value as a separate cronjob like process for now, just to simulate the effect of the query on the running system.

We should be able to use Ken Barber's setup for this



 Comments   
Comment by Wyatt Alt [ 2015/03/27 ]

This patch looks great based on my performance testing.

On stable we can reproduce the fact_paths problem with 30,000 simulated nodes checking in every 30 minutes with a random percentage of 10. The problem does not occur on stable at 20,000 nodes, so the breaking point is somewhere between 20k and 30k.

The CPUs on the postgres server are pegged on deletes more or less as soon as commands switch from being processed with add-facts! to update-facts!, i.e once the database stores 30k certnames. The gc issues at this point will cause the queue size to explode and all kinds of problems to occur.

This patch on the other hand can handle 30k and 50k nodes without breaking a sweat – in fact, using the -i and -n settings we have not been able to submit factsets fast enough to back up the queue. When we set -n to 100000, only 17 commands per second were being submitted, which means the benchmark tool was underperforming (or our metrics are wrong). Queue size remains at or around 0 during both additions and updates for any of those numbers.

The -n and -N flags of the benchmark tool allow you to submit commands unthrottled. Doing this will back up the queue both here and on stable, but is not a realistic scenario.

Migration of 30,000 nodes of data from stable to this patch took two and a half minutes, which is fine. All in all this seems to completely solve our problem.

Generated at Sat Dec 14 17:00:59 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.