[PUP-3647] Puppet stores files forever in state.yaml Created: 2014/11/07  Updated: 2018/10/02  Resolved: 2018/09/26

Status: Closed
Project: Puppet
Component/s: None
Affects Version/s: None
Fix Version/s: PUP 5.5.7, PUP 6.0.1

Type: Bug Priority: Major
Reporter: Peter Bakkum Assignee: Jacob Helwig
Resolution: Fixed Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by PUP-4460 'statefile' is never pruned of files ... Closed
Relates
relates to PUP-3043 Recursive File resources operating on... Accepted
Template: PUP Bug Template
Epic Link: Puppet Performance
Team: Coremunity
Sprint: Platform Core KANBAN
Method Found: Customer Feedback
CS Priority: Major
CS Frequency: 3 - 25-50% of Customers
CS Severity: 3 - Serious
CS Business Value: 4 - $$$$$
CS Impact: Many users would like to use recurse. Having it lead to poor performance isn't good.
Release Notes: Bug Fix
Release Notes Summary: Previously, the state.yaml file could grow unbounded. The new `statettl` setting controls how long entries are cached (default: 32 days). If you use resource schedules, please see the `statettl` documentation to see how this setting interacts with the schedule type.

 Description   

We noticed the following today on one of our production clusters: https://cloudup.com/cgJm3eD4wu4. Essentially Puppet Agent has been using gradually more CPU and Memory over time and was taking longer and longer to execute. On some nodes its been using 2G during each run.

Digging in, we discovered why. We have configuration like the following:

file {
	"/var/db/":
		ensure => directory,
		group => "dbuser",
		owner => "dbuser",
		recurse => true;
}

It turns out that the /var/db directory on this machine sometimes includes temporary files, because of the data system using this directory. This means there are a few new files in /var/db every time we run puppet agent.

We discovered today that each time puppet sees a file, it writes a line to /var/lib/puppet/state/state.yaml. Since there were always new files each time puppet ran, this file grew and grew. Its about 125MB on some of our machines. With a state.yaml file of that size, puppet agent took this long:

real 6m31.792s
user 5m56.434s
sys 0m33.867s

After deleting /var/lib/puppet/state/state.yaml it dropped to:

real 0m25.232s
user 0m16.073s
sys 0m4.112s

We've removed our use of `recurse` for now, which wasn't really necessary in this case, but we consider this a dangerous situation and a bug in Puppet; it means that we need to be sure Puppet isn't touching new files over time. Happy to provide more details if its helpful.



 Comments   
Comment by Eric Sorenson [ 2017/06/26 ]

Hm, I filed the linked duplicate bug PUP-4460 where it seemed like this is a vestigial file but didn't figure out exactly what the consequences would result from ditching it entirely. Charlie Sharpsteen is this something you could put on your list for cryptoentomological exploration?

(If it turns out to actually be necessary, I'm not sure what the right approach would be to solve this problem specifically - seems like we'd need to put some pretty complicated logic in to prune a new state file of resources that used to be managed in (current_catalog - 1) but are no longer in (current_catalog)....)

This bit from lib/puppet/resource/catalog.rb appears to be the only usage of the statefile routines....

  def apply(options = {})
    Puppet::Util::Storage.load if host_config?
 
    transaction = create_transaction(options)
 
    begin
      transaction.report.as_logging_destination do
        transaction.evaluate
      end
    ensure
      # Don't try to store state unless we're a host config
      # too recursive.
      Puppet::Util::Storage.store if host_config?
    end
 
    yield transaction if block_given?
 
    transaction
  end

Comment by Charlie Sharpsteen [ 2017/06/26 ]

Looks like the ResourceHarness is the primary user of data stored in the statefile. Statefile data seems to come into play in two places:

Determining schedules. The :checked timestamp is used to determine whether a resource falls within a schedule:

https://github.com/puppetlabs/puppet/blob/4.10.4/lib/puppet/transaction/resource_harness.rb#L40-L45

Providing input for event generation. The cache is used to create historical_values:

https://github.com/puppetlabs/puppet/blob/4.10.4/lib/puppet/transaction/resource_harness.rb#L291-L301

Which are referenced in the event generation code:

https://github.com/puppetlabs/puppet/blob/4.10.4/lib/puppet/transaction/resource_harness.rb#L123-L127

I suspect the main use here may be to provide input for audit events.

A simple resolution to the growth of the file may be to enforce a time-to-live when the state is flushed out to disk:

https://github.com/puppetlabs/puppet/blob/4.10.4/lib/puppet/util/storage.rb#L80

For example: filtering out all entries that are older than some multiple of the agent runinterval.

Comment by Josh Cooper [ 2017/06/26 ]

Also copying a comment from Michael Smith from the duplicate ticket, we load the state.yaml file twice.

Comment by Henrik Lindberg [ 2018/03/14 ]

I started to read code to figure out what is going on - beginning with why the state file is loaded twice. The two loads (in the order they occur) come from:

  • configurer.rb in a call to client.init_storage
  • catalog.rb in its apply method does this Puppet::Util::Storage.load if host_config?

That raises the question - why is catalog doing this only if "host_config?" and why is it doing that? Is it a valid reload because it should start from the original stored state? Can someone explain what "host_config?" is?

A bit further down in catalog.rb we can read this:

      # Don't try to store state unless we're a host config
      # too recursive.
      Puppet::Util::Storage.store if host_config?

So, clearly the cache is never written by an apply unless it is a "host_config?"

The load caused by the configurer seems to be because it uses the cache, and attempt to rescue a bad situation:

  # Initialize and load storage
  def init_storage
      Puppet::Util::Storage.load
      @compile_time ||= Puppet::Util::Storage.cache(:configuration)[:compile_time]
  rescue => detail
    Puppet.log_exception(detail, _("Removing corrupt state file %{file}: %{detail}") % { file: Puppet[:statefile], detail: detail })
    begin
      Puppet::FileSystem.unlink(Puppet[:statefile])
      retry
    rescue => detail
      raise Puppet::Error.new(_("Cannot remove %{file}: %{detail}") % { file: Puppet[:statefile], detail: detail }, detail)
    end
  end

The line that sets @compile_time seems to be why it needs to read it from loaded state. Anyone know what this is about?

The code seems to be from 2007 written by Luke.

I think the two loads comes from first applying the settings catalog, and then applying the real catalog. I base this on seeing catalog.apply being called three times, and the first time host_config? is false, the second time it is true, third time it is false. It is a guess though. I also wonder what the third apply is for?

The handling of errors from the load that is done in configurer looks wrong - there is similar handling in Catalog that seems to do a better job - renaming the file with a "bad" so it can be examined, and also handling the issue of not being able to remove (say on windows).

Interestingly - if I brute force just return in configurer init_storage to this:

  # Initialize and load storage
  def init_storage
    return  ### SUPERHACK - WHAT HAPPENS
  end

It still passes all tests in the puppet repo. (Which makes me sad...)

Comment by Charlie Sharpsteen [ 2018/03/14 ]

That raises the question - why is catalog doing this only if "host_config?"

The biggiest effect of checking host_config? should be to avoid loading the state when applying the "settings catalogs" that Puppet uses to create cache directories and manage other files related to settings in puppet.conf, or to download plugins by applying a recursive file resource:

The whole reasoning behind the load in init_storage seems to be to retrieve the "compile_time", which appears to never be used, and to clean the state file out if it happens to be corrupted. If we're not using the compile_time, or interacting with the state in Puppet::Util::Storage until after a catalog has been retrieved from the Puppet Server (or loaded from cache), then it seems like we can get rid of the init_storage call in the configurer.

As for the state file its self, adding a time to live on the entries that is used to prune data when we write the file back at the end of a run seems like a good way to get rid of the bloat problem.

Comment by Henrik Lindberg [ 2018/03/15 ]

Charlie Sharpsteen Thanks for additional input. See if someone has any more to say about @compile_time.

For the second part of the problem - if it prunes based on write time it needs to touch all entries that are used - otherwise the cache would not have entries that originated from 2 x run-interval time ago. It would also mean that if a sequence of catalogs are used with some smaller intermediate catalogs then those would result in cache evictions. Is that acceptable?

Comment by Charlie Sharpsteen [ 2018/03/15 ]

I'm assuming that if a resource is in the catalog or generated agent-side as part of something like a recursive file resource, then it gets written to state.yaml with an updated :checked timestamp. If this is the case, then we could apply a simple filter during Puppet::Util::Storage.store that drops all entries where :checked is older than some time-to-live:

pruned_state = @@state.reject {|k, v| v[:checked] < Puppet[:statettl]}
 
Puppet::Util::Yaml.dump(pruned_state, Puppet[:statefile])

For :statettl, we could probably just pick something like "7 days" as a default value and solve 80% of the problem. Things that generate a lot of catalog entries, like recursive file resources that return 10s or 100s of thousands of directory entries, will still be a problem. But, these sorts of resources should be avoided anyway as they cause performance problems beyond the size of the statefile.

Comment by C Lang [ 2018/08/15 ]

We were just bitten by this one in a scenario where Tidy was used to purge a handful of temp files every run - after 6 months, the state file was 70M and the agent hammered the CPU for 2 minutes before doing anything.

I'm surprised this isn't a higher priority after four years.  If it can't be fixed, maybe Tidy should be removed or at least have a warning that it will eventually bring the agent to its knees?

Thanks.

Comment by Stefan Strandberg [ 2018/08/30 ]

This bit us where we were briefly using a Tidy to clear puppet reports older than 10 days on our puppetserver.  We realized that the Tidy was causing issues several months ago and changed it to just a cron job, which resolved an immediate issue, but puppet client runs on the puppetserver were taking >5 minutes still.  I finally spent time today diagnosing and ultimately ran into this thread and discovered that the state file was over 300MB despite the problematic Tidy being taken out a very long time ago.  After removing it, puppet runs are back down to sub-6 seconds of actual time, 16 seconds total.  Puppetserver load which had been pinned around 3-5 dropped immediately to <0.5 as well, improving performance across the board.

I'd like to suggest a higher priority for this as well because it's not inherently obvious that there is a state file, that it tracks every file puppet was ever aware of, and that it doesn't stop tracking files that it should no longer care about, leading it to grow forever and ever until it bogs down the system.  Our planned "workaround" is going to be a cron job that kills it once a month everywhere, but I worry about unintended consequences with that and would much prefer an internal solution as described above.

Thanks!

Comment by Jarret Lavallee [ 2018/09/19 ]

The Schedule resource uses the state file to determine if the schedule matches, so a short ttl could break longer schedules. If we implement a :statettl, please be sure to update the Schedule documentation to denote that it needs to be configured for a longer time. Alternatively, we could check to see if the resource has a schedule while pruning and skip those items.

https://github.com/puppetlabs/puppet/blob/master/lib/puppet/transaction/resource_harness.rb#L45
https://github.com/puppetlabs/puppet/blob/master/lib/puppet/type/schedule.rb#L237

Comment by Kris Bosland [ 2018/09/24 ]

Merged into 5.5.x at b4901b3.

Comment by Josh Cooper [ 2018/09/26 ]

Passed CI in e2f7ccfb7

Generated at Thu Dec 12 02:45:39 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.