[SERVER-1735] Performance testing with http client metrics enabled Created: 2017/02/28  Updated: 2017/06/28  Resolved: 2017/06/28

Status: Closed
Project: Puppet Server
Component/s: None
Affects Version/s: None
Fix Version/s: SERVER 5.0.0

Type: Task Priority: Normal
Reporter: Ruth Linehan Assignee: Ruth Linehan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to SERVER-1824 Audit profiler metrics Accepted
relates to TK-443 clj-http-client: Add setting to turn ... Resolved
relates to SERVER-1823 Remove certnames from http client met... Closed
relates to TK-442 Bump dropwizard/metrics version Closed
Template:
Epic Link: Http client metrics
Team: Systems Engineering
Sub-team: Server
Story Points: 2
Sprint: Server 2017-04-05, Server 2017-04-19, Server 2017-05-03, Server 2017-05-31
Release Notes: Not Needed
QA Risk Assessment: No Action

 Description   

Run a gatling A/B performance test to ensure that adding http client metrics hasn't dramatically affected performance.



 Comments   
Comment by Ruth Linehan [ 2017/05/24 ]

For this ticket I did two types of testing, perf testing with Gatling and memory testing using a curl script.

For the performance testing I set up Gatling jobs with http-client-metrics enabled and disabled and ran them on our perf hardware. The two jobs are described in this branch: https://github.com/rlinehan/gatling-puppet-load-test/tree/SERVER-1735-http-client-metrics

The job reports can be seen on http://puppetserver-perf-driver68-dev.delivery.puppetlabs.net:8080/job/http-client-metrics/. I did two runs through the whole suite, which contained 500, 1000, and 1250 agents. Overall, in each scenario the mean agent response time between the runs with metrics enabled and metrics disabled was within a standard deviation of each other. Thus, it seems that having the metrics enabled doesn't affect performance.

In addition, I did some memory testing by running a curl script after installing Puppet Server + PuppetDB on two separate vmpooler instances. One one I had http client metrics enabled, on the other metrics were disabled.

On each machine, I ran the following script, which curled the master's /puppet/v3/catalog endpoint with a different agent each time. After 500 agents it dumped the output of the status service to a json file. This was run 150 times, for a total of 75,000 catalog requests.

 
for j in {1..150}
do
  for i in {1..500}
  do
    z=$(($i*$j))
    curl --cacert /etc/puppetlabs/puppet/ssl/certs/ca.pem \
      --cert /etc/puppetlabs/puppet/ssl/certs/`hostname -f`.pem \
      --key /etc/puppetlabs/puppet/ssl/private_keys/`hostname -f`.pem \
      https://`hostname -f`:8140/puppet/v3/catalog/agent$z?environment=production &>/dev/null
  done
  echo $j
  curl -k -s https://localhost:8140/status/v1/services/status-service?level=debug | python -m json.tool >> output.json
done

(Note that for this to work you need to change the trapperkeeper auth.conf to make the /puppet/v3/catalog directive more permissive, e.g.

 
{
            # Allow nodes to retrieve their own catalog
            match-request: {
                path: "/puppet/v3/catalog"
                type: path
                method: [get, post]
            }
            allow: "*"
            sort-order: 500
            name: "puppetlabs catalog"
 }

)

Each catalog request in this script with a different agent generates 3 http-client metrics: 1) with-metric-id puppetdb.facts.find.<certname>, 2) with-url pdb/query/v4/nodes/<certname>/facts, 3) with-url-and-method pdb/query/v4/nodes/certname/facts.GET. (Note that were the catalog to compile, additional url metrics - e.g. sending the catalog to puppetdb - would be generated. However, in these requests the catalog did not actually compile.)

In an actual agent run, 9 http client metrics are generated per certname: 5 with-metric-id, 2 with-url, and 2 with-url-and-method

[:classifier :nodes :<node name>] - POST /v1/classified/nodes/<certname>
[:puppetdb :facts :find :<node name>] - GET /pdb/query/v4/nodes/<certname>/facts
[:puppetdb :command :replace_catalog <certname>]
[:puppetdb :command :replace_facts <certname>]
[:puppetdb :command :store_report <certname>]

(metric ids are in the vectors)

Since each request in my simulation created 3 metrics, and I ran 75,000 requests, I was simulating differences in memory usage from http client metrics for 25,000 nodes.

The data I collected can be found in a Google Docs spreadsheet here: https://docs.google.com/spreadsheets/d/1Yun7uuxMRGUl0T19MrXHuXNNHzchWTM73BMu94KtgXU

Ultimately, this data shows a couple things: 1) there's about an increase of around 300-500 MB in heap memory used when http client metrics are enabled, 2) there's an increase in GC total time of about 22%, 3) GC CPU averaged over the second half of runs (averaging over the second half since there was a lot of variation in the first half) is 9% when http client metrics are enabled and 6% when disabled.

Initially, I thought that since we don't have any real use for the with-url metrics in puppetserver (since we use the metric id metrics), perhaps it would make sense to add the ability to disable the automatic creation of those to the http client library. However, unfortunately that turned out to be slightly more difficult than I expected (still doable, but not the 2 hours I was hoping).

Furthermore, eliminating those would eliminate 4 of the 9 http client metrics we currently create per-certname. Another option for reducing the number of metrics we create would be to not include the certname in any of the metric ids. Furthermore, in addition to these metrics we create quite a few other per-certname, or per-resource, per-puppetdb query metrics via the puppet profiler metrics. It might be best for us to do an audit of all of the profiler metrics we currently provide an whittle some of those out.

Comment by Ruth Linehan [ 2017/06/01 ]

I did some further memory testing and looked at the heap dumps with YourKit to see whether the additional memory used when http client metrics are turned on will ultimately get cleaned up when there's enough memory pressure - i.e. is it from strong references, weak references, etc.

With the same setup as above, I ran the following script:

for j in {1..150}
do
  for i in {1..500}
  do
    z=$(($i*$j))
    curl --cacert /etc/puppetlabs/puppet/ssl/certs/ca.pem \
      --cert /etc/puppetlabs/puppet/ssl/certs/`hostname -f`.pem \
      --key /etc/puppetlabs/puppet/ssl/private_keys/`hostname -f`.pem \
      https://`hostname -f`:8140/puppet/v3/catalog/agent$z?environment=production &>/dev/null
  done
  echo $j
  curl -k -s https://localhost:8140/status/v1/services/status-service?level=debug | python -m json.tool >> output.json
done
runuser -l puppet -c 'jmap -dump:live,format=b,file=/tmp/<name>.hprof <pid>'

(I needed to modify the puppet user to give it a login shell first).

The hprofs generated can be found here, with enabled.hprof being when http client metrics were enabled.

With http client metrics enabled, 587 MB was used, with 496 MB (shallow size) / 584 MB (retained size) reachable via strong references.

With http client metrics disabled, 387 MB was used, with 293 MB (shallow size) / 383 MB (retained size) reachable via strong references.

This is unfortunately a pretty sizeable difference , and looking at the difference between the two, the additional objects present do seem to all come from metrics (rather than some fluke in the run).

In order to combat this additional memory usage and leave http client metrics enabled my default, I have filed 4 tickets. The first two should be handled for Puppet Server 5, the second two are later improvements:
1. SERVER-1823: remove certnames from http client metric ids (will reduce number of metrics created per-certname by 50%)
2. TK-442: bump dropwizard/metrics version (there's a memory leak in our version of metrics which we may be hitting)
3. TK-443: add a setting to clj-http-client to turn off url metrics (to further reduce the number of metrics created per-certname)
4. SERVER-1824: audit profiler metrics (because we create a lot by default and we may not have any use for many of them)

Generated at Thu Nov 21 03:14:24 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.