[PDB-1809] Slow response times to aggregate-event-counts endpoint queries Created: 2015/07/16  Updated: 2018/04/03  Resolved: 2015/07/28

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: PDB 3.0.0
Fix Version/s: PDB 3.0.2

Type: Bug Priority: Normal
Reporter: Michal Ruzicka Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Story Points: 3
Sprint: PuppetDB 2015-07-29


We see slow response times to the aggregate-event-counts endpoint queries. Here is how they clocked on a performance testing system with 3214057 records in the PDB's resource_events table:

[root@lei-puppetdb ~]# time curl -g -G http://localhost:8080/pdb/query/v4/aggregate-event-counts --data-urlencode 'distinct_resources=true' --data-urlencode 'distinct_start_time=2015-07-13T15:55:30Z' --data-urlencode 'distinct_end_time=2015-07-16T15:55:30Z' --data-urlencode 'query=["and",["<","report_receive_time","2015-07-16T15:55:30Z"],["not",["=","containing_class",null]]]' --data-urlencode 'summarize_by=containing_class'
[ {
  "summarize_by" : "containing_class",
  "successes" : 74,
  "failures" : 2,
  "noops" : 0,
  "skips" : 1,
  "total" : 74
} ]
real 0m22.236s
user 0m0.001s
sys 0m0.010s
[root@lei-puppetdb ~]# time curl -g -G http://localhost:8080/pdb/query/v4/aggregate-event-counts --data-urlencode 'distinct_resources=true' --data-urlencode 'distinct_start_time=2015-07-13T15:55:30Z' --data-urlencode 'distinct_end_time=2015-07-16T15:55:30Z' --data-urlencode 'query=["<","report_receive_time","2015-07-16T15:55:30Z"]' --data-urlencode 'summarize_by=certname'
[ {
  "summarize_by" : "certname",
  "successes" : 5022,
  "failures" : 4,
  "noops" : 0,
  "skips" : 3,
  "total" : 5022
} ]
real 5m26.177s
user 0m0.005s
sys 0m0.013s
[root@lei-puppetdb ~]# time curl -g -G http://localhost:8080/pdb/query/v4/aggregate-event-counts --data-urlencode 'distinct_resources=true' --data-urlencode 'distinct_start_time=2015-07-13T15:55:30Z' --data-urlencode 'distinct_end_time=2015-07-16T15:55:30Z' --data-urlencode 'query=["<","report_receive_time","2015-07-16T15:55:30Z"]' --data-urlencode 'summarize_by=resource'
[ {
  "summarize_by" : "resource",
  "successes" : 73596,
  "failures" : 94,
  "noops" : 0,
  "skips" : 1,
  "total" : 73598
} ]
real 5m26.542s
user 0m0.000s
sys 0m0.017s

Comment by Wyatt Alt [ 2015/07/20 ]

There's a patch up for this here:

Scott Walker Karel Brezina Michal RuzickaNicholas Smyth

This will require the CM team to change the SI query to align with this:

or, if you're uncomfortable dropping the not-null:

I would recommend dropping the check on receive_time since it seems redundant with end_time and I was seeing a perf hit due to it yesterday, but currently on my PC I'm not seeing a perf difference, so if we want to be conservative even this would be much better than the status quo:

If we go with that last query, that means the only required change will be to use the combined summarize_by param.

Comment by Michal Ruzicka [ 2015/07/22 ]

The ["not", ["=","containing_class",null]] test is only supposed to be applied to the query summarizing by containing class - the idea is to filter out events on resources not contained in any class (e.g. included in a node container in the site.pp). But according to my testing even such resources are 'contained' in a main class which is perhaps substituted here for the main stage?
I.e. omitting the filter does not really make a difference.

Generated at Tue Jan 21 20:45:44 PST 2020 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.