[PUP-6344] metrics.time.total does not correlate to time reported in log Created: 2016/05/23  Updated: 2018/04/23  Resolved: 2018/04/17

Status: Closed
Project: Puppet
Component/s: None
Affects Version/s: None
Fix Version/s: PUP 5.5.1, PUP 6.0.0

Type: Bug Priority: Normal
Reporter: Adam Bottchen Assignee: Jacob Helwig
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to PUP-7520 Puppet profiling timers do not use a ... Accepted
relates to PUP-8661 Total time metric is nil on failed ca... Closed
relates to PUP-920 Add catalog application information t... Closed
Template: PUP Bug Template
Acceptance Criteria:

The total time should represent the amount of time spent by the agent between waking up to start a run and submitting the report at the end of the run.

Epic Link: Puppet Performance
Sub-team: Coremunity
Team: Coremunity
Sprint: Platform Core KANBAN
Method Found: Customer Feedback
CS Priority: Reviewed
Release Notes: Bug Fix
Release Notes Summary: Total time now reports the measured time of the run instead of a sum of other run times.

 Description   

The time reported by a `puppet agent -t` run for "Applied catalog in" does not appear to correlate to the times reported in the Puppet::Util::Metric "time" array. The applied catalog time should be the time used to apply the resources in the catalog, and the times returned to the report processor in the Puppet::Util::Metric "time" array should be the amount of time spent applying each resource, yet the two do not add up. Sometimes the applied catalog line is larger and sometimes the total time is larger.

# cat lib/puppet/reports/myreport.rb
require 'puppet'
Puppet::Reports.register_report(:myreport) do
  desc "Testing"
 
  def process
    Puppet.notice "METRICS: Process Puppet Report"
    Puppet.notice "METRICS: self.logs.last.message: #{self.logs.last.message}"
    self.metrics['time'].values.each { |val|
      Puppet.notice "METRICS: #{val[1]} : #{val[2]}"
    }
  end
# puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for pe-201612-master.puppetdebug.vlan
Info: Applying configuration version '1464042693'
Notice: Applied catalog in 9.18 seconds
# tail -n 21 /var/log/puppetlabs/puppetserver/puppetserver.log
2016-05-23 22:31:53,920 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Process Puppet Report
2016-05-23 22:31:53,920 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: self.logs.last.message: Applied catalog in 9.18 seconds
2016-05-23 22:31:53,920 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Augeas : 1.017779943
2016-05-23 22:31:53,920 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Config retrieval : 10.645895659
2016-05-23 22:31:53,920 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Cron : 0.001115222
2016-05-23 22:31:53,920 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Exec : 0.169413854
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: File : 0.45819617799999984
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Filebucket : 0.000103239
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Group : 0.000411452
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Package : 0.7939065209999999
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe anchor : 7.1151e-05
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe file line : 0.0023166209999999996
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe hocon setting : 1.4407772629999998
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe ini setting : 0.020410037
2016-05-23 22:31:53,921 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe ini subsetting : 0.0049442060000000005
2016-05-23 22:31:53,922 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe java ks : 1.454764838
2016-05-23 22:31:53,922 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Pe puppet authorization hocon rule : 1.8567533360000004
2016-05-23 22:31:53,922 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Schedule : 0.00041616200000000003
2016-05-23 22:31:53,922 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Service : 0.35225476499999997
2016-05-23 22:31:53,923 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: Total : 18.221604979000002
2016-05-23 22:31:53,923 INFO  [qtp1308528397-63] [puppet-server] Puppet METRICS: User : 0.002074532

In this case, the catalog applied in 9.18 seconds, but the (Total) - (Config retrieval) is 7.57 even though the documentation would seem to indicate they are measuring the same thing.



 Comments   
Comment by Henrik Lindberg [ 2016/05/23 ]

This looks familiar to other reported discrepancies in reported profiling data.

Comment by Charlie Sharpsteen [ 2017/05/17 ]

Could be related to Puppet not using a monotonic clock when taking profiling measurements, so things like NTP syncs can skew the results (PUP-7520).

Comment by Josh Cooper [ 2018/03/12 ]

Also related Melissa Stone

Comment by Melissa Stone [ 2018/03/13 ]

Total here is simply the sum of all the time metrics we've recorded, so it doesn't actually report the total time. There are likely functions that aren't wrapped in a time block. I'm looking at fixing this

The first part of this ticket is a dup of PUP-6343, but the second is covered by PUP-8511. I'm going to close PUP-8511 in favor of this one, but keep PUP-6343 open to cover the work to add the catalog application time to the report.

Comment by Charlie Sharpsteen [ 2018/03/13 ]

Total here is simply the sum of all the time metrics we've recorded

If it truly is a sum, then it could be more than the actual total time as the following code:

Puppet.profile('outer') do
  Puppet.profiler('inner') do
    ...
  end
end

Would double-count the time taken for the inner operation since that time is also measured as part of the outer operation.

Comment by Melissa Stone [ 2018/03/13 ]

Yup, that is exactly why I opened PUP-8511, that, or it could be less than the actual time because there is the possibility there is a function we haven't collected time metrics for at all

Comment by Melissa Stone [ 2018/03/20 ]

Charlie Sharpsteen I'm digging into this, and I added a time catch for run_internal, but I'm getting a lot of unaccounted for time

$ time sudo bundle exec ./bin/puppet agent -t --summarize
.
.
.
Notice: Applied catalog in 5.00 seconds
Changes:
            Total: 1
Events:
          Success: 1
            Total: 1
Resources:
          Changed: 1
      Out of sync: 1
        Restarted: 1
            Total: 331
Time:
         Schedule: 0.00
          Package: 0.00
   Property list key: 0.00
          Printer: 0.01
             Cron: 0.01
             File: 0.07
          Service: 0.08
   Profile manager: 0.28
   Node retrieval: 0.32
   Convert catalog: 0.34
      Plugin sync: 0.94
             Exec: 1.65
   Internal run time: 15.89
         Last run: 1521584425
   Config retrieval: 4.06
   Transaction evaluation: 4.90
   Catalog application: 5.00
   Fact generation: 5.22
       Filebucket: 0.00
        Pe anchor: 0.00
            Total: 38.78
Version:
           Config: 1521584416
           Puppet: 5.5.1
 
real    0m30.481s
user    0m9.362s
sys     0m3.535s

Ignore total in the above summary, that's still the sum of all the times gathered. Internal run time is what I added, which is reporting 15.89 seconds, whereas the time wrapper around the command is giving me very different times. Do you have any other thoughts about where else to collect these times?

Comment by Maximilian Philipps [ 2018/03/23 ]

I have no idea how you all get such fast puppets runs, but I at least can confirm the issue.

Sometimes we see a large discreptancy between the time reported in the puppet-agent log and in the last_summery.yaml:

In this case here the last_summery.yaml reports a total of  2755.9764387860105 seconds, but the puppet-agent log claims to have "only" taken 920.03 seconds.

 

version:
  config: 5a84cbdc4c7277446118e5c5847bc7ffeee017f3-dirty
  puppet: 5.5.0
resources:
  changed: 0
  corrective_change: 0
  failed: 0
  failed_to_restart: 0
  out_of_sync: 0
  restarted: 0
  scheduled: 0
  skipped: 0
  total: 1971
time:
  acl: 0.18600726
  anchor: 0.000360885
  apt_key: 0.0013260910000000002
  augeas: 1.481265272
  catalog_application: 920.0311244819313
  concat_file: 0.000213852
  concat_fragment: 0.00025730999999999996
  config_retrieval: 31.165227114222944
  convert_catalog: 1.5596217680722475
  cron: 0.12194795899999997
  exec: 7.108916220000003
  fact_generation: 2.159287616610527
  file: 867.946438568999
  file_line: 0.0024020359999999998
  filebucket: 8.9651e-05
  group: 0.001327714
  node_retrieval: 0.30567236337810755
  package: 3.9270125899999986
  plugin_sync: 1.1844260040670633
  schedule: 0.0007071429999999999
  service: 0.8189130199999999
  ssh_authorized_key: 0.0034457330000000003
  total: 2755.9764387860105
  transaction_evaluation: 917.9692359287292
  user: 0.001212204
  last_run: 1521802258
changes:
  total: 0
events:
  failure: 0
  success: 0
  total: 0

It seems like catalog_application includes the time for transaction_evaluation, which includes the time for file.

 

Comment by Charlie Sharpsteen [ 2018/03/23 ]

Melissa Stone Seems like the question is whether the missing 5 seconds is occurring before the call to run_internal or after it. A quick way to measure would be adding something like puts Time.now.to_f to:

/opt/puppetlabs/puppet/bin/puppet

Adding one before Puppet is loaded via `require 'puppet'`, before CommandLine is executed, and then after would give some good indications along with similar puts statements around the run_internal method.

Comment by Kris Bosland [ 2018/03/27 ]

I think the missing 5 seconds is in the puppet requires, and maybe .5 seconds or so of ruby itself starting up.  It appears that the acceptance criteria of this ticket is to list the "total" time based on a certain calculation or measurement.  Charlie Sharpsteen, are you the right person to clarify the AC here or can you tell me who to ask?

Comment by Charlie Sharpsteen [ 2018/03/27 ]

Wow, looks like I misread Melissa Stone's results. It looks like we're missing 15 seconds, which is the difference between the "Internal run time" and the the 30.4 second wall time reported by Bash's time command. That's way too much to be Ruby requires, which should take a second or two at most.

We'll always miss a chunk of the run time that occurs after the report is finalized and submitted to the master, but it seems like there is a significant operation occurring that we aren't capturing by instrumenting Puppet::Configurer#run_internal.

Comment by Charlie Sharpsteen [ 2018/03/27 ]

And to clarify the acceptance criteria: Currently total is a simple sum of all the other times recorded to the report. This is misleading and unreliable as it double-counts and times that may have measured nested operations. For example, catalog_application, transaction_evaluation, and the per-resource type timings are all nested within each other, so the resources get counted against the total three times and the transaction_evaluation gets counted twice.

The total time should represent the amount of time spent by the agent between waking up to start a run and submitting the report at the end of the run.

Comment by Kris Bosland [ 2018/03/27 ]

About the "missing" seconds: I was getting 5 seconds with a very simple setup, and Josh Cooper mentioned to me that Facter can be called during the 'require' part, so with multiple environments and custom facts I'm sure there could be a lot of time spent there.

So it sounds like we want 'total' to represent as closely as possible what you would get with the wall clock, e.g. 0m30.481s in Melissa Stone's example?

Comment by Charlie Sharpsteen [ 2018/03/27 ]

So it sounds like we want 'total' to represent as closely as possible what you would get with the wall clock, e.g. 0m30.481s in Melissa Stone's example?

Yup! Although realistically, we won't capture the time for report submission at the end of the run — have to stop the clock in order to submit the time with the report

Comment by Kris Bosland [ 2018/03/27 ]

As I have been gathering information, I have heard that the puppet agent is a long running process which regularly executes the puppet run.  This will make it more challenging to match any reasonable time measurement to what you would see with a start-and-exit-process kind of run (some operations will be once per process work).  Another issue is the variety of ways that Puppet can arrive at a catalog load-and-apply, which the report needs to be consistent across.  At this point, I am planning to replace the incorrect "total" with what Melissa Stone is measuring above with Internal run time, which will be labeled "total".  Release notes will be added explaining what that total means, and I will see if Garrett Guillotte would like to make any relevant doc changes.

 

Comment by Kenn Hussey [ 2018/04/11 ]

Kris Bosland are you still working to get the PR for this merged by EOD today?

Comment by Kris Bosland [ 2018/04/11 ]

Yes Kenn Hussey, it is in review today.  Thanks.

Comment by Jacob Helwig [ 2018/04/11 ]

Merged to 5.5.x in 33cf907d55.

Comment by Josh Cooper [ 2018/04/13 ]

Merged to 5.5.1_release in https://github.com/puppetlabs/puppet/commit/4c12d149cefe20edfc55c86e7743e33ff85e480e

Comment by Josh Cooper [ 2018/04/15 ]

This fails CI, because puppet apply using a manifest with a circular dependency no longer exits with 1. To repro:

notify { 'foo':
  require => Notify['bar']
}
 
notify { 'bar':
  require => Notify['foo']
}

$ bundle exec puppet apply cycle.pp
Notice: Compiled catalog for XXX in environment production in 0.02 seconds
Error: Found 1 dependency cycle:
(Notify[bar] => Notify[foo] => Notify[bar])\nTry the '--graph' option and opening the resulting '.dot' file in OmniGraffle or GraphViz
Error: Failed to apply catalog: One or more resource dependency cycles detected in graph
$ echo $?
0

To repro acceptance test failure:

$ cd puppet
$ git checkout 5.5.1_release
$ cd acceptance
$ bundle install --path .bundle
$ GEM_SOURCE=http://rubygems.delivery.puppetlabs.net SHA=c47a97544f7126e20d63e857350f1d8f71d4823c SERVER_VERSION=5.1.5 OPTIONS='--test-tag-exclude=server' bundle exec rake ci:test:aio TEST_TARGET=ubuntu1604-64a TESTS=tests/reports/finalized_on_cycle.rb

Comment by Kris Bosland [ 2018/04/16 ]

Merged to 5.5.1_release in https://github.com/puppetlabs/puppet/commit/aae6717917be0446e3e3d966a5d8b4a7778c920c.

Generated at Mon Oct 21 08:43:23 PDT 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.