[PUP-3077] When puppet agent fails to retrieve catalog from master, cached catalog is used without regard for the puppet.conf-declared environment Created: 2014/08/18  Updated: 2016/06/15  Resolved: 2016/02/17

Status: Closed
Project: Puppet
Component/s: None
Affects Version/s: PUP 3.6.2
Fix Version/s: PUP 4.4.0

Type: Bug Priority: Normal
Reporter: David Gwilliam Assignee: Unassigned
Resolution: Fixed Votes: 3
Labels: TSE
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux, CentOS 6.2


Issue Links:
Relates
Template:
Epic Link: (Burnside) Direct Puppet: Client Static Catalog
Story Points: 3
Sprint: Client 2016-01-27, Client 2016-02-10, Client 2016-02-24
Release Notes: Bug Fix
Release Notes Summary: Puppet agent and apply will no longer cache its catalog during a noop run. This ensures the next non-noop run won't fallback to using the cached catalog and deploy code that was never meant to be "live". Also if the agent falls back to using its cached catalog and the catalog was compiled for a different environment than the agent is currently in, then the agent will fail the run. This way the server remains authoritative about which environment the agent is supposed to be in.

 Description   

Steps to reproduce:

  • do not disable pe-puppet service
  • successfully compile catalog from non-Production environment using puppet agent -t --noop --environment manage_meta, catalog is cached locally
  • puppet master fails to compile catalog in agent's assigned environment (e.g. because a new class is assigned for testing non-Production code)
  • on next scheduled run, agent enforces non-Production state contained in cached catalog in violation of environment = production explicitly declared in puppet.conf
  • This is accompanied by the warning message indicating that the ENC is overriding the agent as authoritative source for environment

Obviously, this could have very bad side effects and, while consistent and predictable, falls squarely under the category of undesirable (and unexpected) behavior, in my opinion.



 Comments   
Comment by Zack Smith [ 2014/08/28 ]

This is pretty bad as this is standard workflow and would cause code to be deployed that was meant to be noop'ed and cause and extremely unintuitive result to occur.

Comment by Daniel Dreier [ 2015/12/08 ]

Ops ran into this internally; ping me or [~rw] for more details if it would be useful and we can write it up.

Comment by Branan Riley [ 2015/12/08 ]

I agree this is unintuitive behavior, but it fits with what "--noop" actually implies in the agent, which is "set the default value of the noop metaparam to true for any resources that don't have it specified".

This ticket falls squarely into the "what does noop actually mean" bucket of questions, a topic I have been increasingly grumpy about the more I learn about our current behaviors. I don't think we can just "fix" this without re-evaluating what noop means and how it is implemented in the agent

Comment by Daniel Dreier [ 2015/12/09 ]

Branan Riley At risk of jumping into architecture ahead of time – my assumption had been that when the agent does a noop run, it adds a "noop => true" metaparameter to all resources in the catalog it writes. I believe that would allow us to skip the question of what noop should mean, while making behavior a bit more predictable in this scenario.

Comment by Lexa Whitehurst [ 2015/12/09 ]

Branan Riley to add to what Daniel Dreier said, puppet also probably shouldn't be applying a cached catalog for an environment different from the one it was attempting to get a catalog for. That didn't happen in our scenario, but the ticket does discuss that situation. Regardless, IMO puppet either shouldn't be caching a catalog for a --noop or should actually be modifying the catalog it caches to set the noop metaparameter. I can't immediately think of any reasonable workflow that would rely on the current behavior, but I can think of multiple that rely on the current behavior not happening.

Comment by Josh Cooper [ 2015/12/09 ]

I agree with what [~rw] said, "puppet ... shouldn't be applying a cached catalog for an environment different from the one it was attempting to get a catalog for". Either the cache needs to be made per-environment aware, or we need to treat mismatched environments as a cache miss. This is true regardless of whether noop is set or not.

Also agree that puppet should either not cache noop catalogs or it should preserve the catalog's noop-ness. I have a mild preference for the first, because of the way noop is implemented. Also I think of it more as associated with the transaction and not the catalog.

One use case I can think of for the current behavior is if you wanted to prime the agent's cache in noop mode, make sure nothing blows up, and have subsequent runs apply the cached catalog not in noop mode. But there are inherent race conditions, e.g. what if the scheduled run happens after you've cached the catalog, but before you've verified nothing blew up. I think it'd be safer to instead download a copy of the catalog, and run that in noop mode, so you don't leave the cache in a bad state, e.g.

$ bundle exec puppet catalog find --terminus rest > catalog.json
$ cat catalog.json | jq . | head
{
  "tags": [
    "settings",
    "foo",
    "class"
  ],
  "name": "...",
  "version": 1449690639,
  "code_id": null,
  "environment": "production",
$ bundle exec puppet apply --noop --catalog catalog.json --default_file_terminus rest
Notice: /Stage[main]/Foo/File[/tmp/foobar]/ensure: current_value absent, should be file (noop)
Notice: Class[Foo]: Would have triggered 'refresh' from 1 events
Notice: Stage[main]: Would have triggered 'refresh' from 1 events
Notice: Applied catalog in 0.07 seconds

So I'd be in favor of just not caching catalogs on a noop run.

Comment by Daniel Dreier [ 2015/12/09 ]

will this impact Direct Puppet? I haven't followed the details of how it's implemented - my concern is that a noop run on a node using direct puppet would cause that node to basically permanently switch to the noop cached catalog, until a new code version comes out. Eric Sorenson do you know if that's likely? Seems like it might change how this issue is prioritized.

Comment by Lexa Whitehurst [ 2015/12/11 ]

Apparently the following should execute a --noop run without caching the catalog:

puppet agent -t --noop --catalog_cache_terminus=''

Comment by Josh Cooper [ 2016/02/04 ]

Daniel Dreier [~rw] it would be great to get your feedback on this, now that it's merged to master.

Comment by Lexa Whitehurst [ 2016/02/04 ]

Josh Cooper based on the description and example posted in the PR, the behavior seems ideal to me.

Comment by Eric Thompson [ 2016/02/17 ]

validated on rhel7 at master SHA: 101a6d8f5b0550ae4be9ce2fd425a4cce14b1e26

[root@gkg3svq14qmxp8z environments]# puppet agent -t --noop --environment manage_meta --server $(hostname -f)
Info: Using configured environment 'manage_meta'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Notice: /File[/opt/puppetlabs/puppet/cache/lib/facter]/ensure: removed
Notice: /File[/opt/puppetlabs/puppet/cache/lib/puppet]/ensure: removed
Notice: /File[/opt/puppetlabs/puppet/cache/lib/puppet_x]/ensure: removed
Info: Applying configuration version '1455743771'
Notice: /Stage[main]/Main/Notify[manage_meta]/message: current_value absent, should be manage_meta (noop)
Notice: Class[Main]: Would have triggered 'refresh' from 1 events
Notice: Stage[main]: Would have triggered 'refresh' from 1 events
Notice: Applied catalog in 0.02 seconds
[root@gkg3svq14qmxp8z environments]# puppet agent -t --noop --environment manage_meta --server $(hostname -f)
Info: Using configured environment 'manage_meta'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Applying configuration version '1455743776'
Notice: /Stage[main]/Main/Notify[manage_meta]/message: current_value absent, should be manage_meta (noop)
Notice: Class[Main]: Would have triggered 'refresh' from 1 events
Notice: Stage[main]: Would have triggered 'refresh' from 1 events
Notice: Applied catalog in 0.01 seconds
[root@gkg3svq14qmxp8z environments]# cat /etc/puppetlabs/code/environments/production/manifests/site.pp
notify {"production":}
include motd
[root@gkg3svq14qmxp8z environments]# cat /etc/puppetlabs/code/environments/manage_meta/manifests/site.pp
notify{'manage_meta':}
[root@gkg3svq14qmxp8z environments]# puppet agent -t --noop --server $(hostname -f)
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Notice: /File[/opt/puppetlabs/puppet/cache/lib/facter]/ensure: created
Notice: /File[/opt/puppetlabs/puppet/cache/lib/facter/facter_dot_d.rb]/ensure: defined content as '{md5}878e161fc0c3682fb1a554fe28b8be60'
Notice: /File[/opt/puppetlabs/puppet/cache/lib/facter/package_provider.rb]/ensure: defined content as '{md5}1473dd2347de164f583bbf66dcfc956a'
Notice: /File[/opt/puppetlabs/puppet/cache/lib/facter/pe_version.rb]/ensure: defined content as '{md5}60d47406026c8201e51394227ddf780d'
[...]
Notice: /File[/opt/puppetlabs/puppet/cache/lib/puppet_x/puppetlabs/registry.rb]/ensure: defined content as '{md5}e81ba7665aedbd5cb519d75a4a8dbbd2'
Notice: /File[/opt/puppetlabs/puppet/cache/lib/puppet_x/puppetlabs/registry/provider_base.rb]/ensure: defined content as '{md5}7d511743b5cec6b375b684b88396838a'
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Evaluation Error: Error while evaluating a Function Call, Could not find class ::motd for gkg3svq14qmxp8z.delivery.puppetlabs.net at /etc/puppetlabs/code/environments/production/manifests/site.pp:2:1 on node gkg3svq14qmxp8z.delivery.puppetlabs.net
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

(did not use the cached catalog for the other environment)

Comment by Shane Madden [ 2016/06/14 ]

Thanks a lot for getting this fixed, but is there any word on whether this will be fixed in the 3.8 maintenance releases? We've had compatibility issues blocking the 4.x upgrade, so we're still forced to run with `ignorecache` as a workaround.

We were very surprised today when a catalog from an environment that had never been run without noop applied itself without warning or user input and took down a production system; confidence in the noop flag is extremely important for confidence in Puppet in general, so it'd be great if this fix made it into the 3.x releases (which will be around at least a little while longer).

Comment by Kylo Ginsberg [ 2016/06/14 ]

Shane Madden since 3.x is on maintenance only, this will probably take a community-driven backport to move it forward. Also, note that there is no scheduled 3.x release at this time (although, since as you say, it will be around for a bit longer, I won't be surprised if there does end up being another 3.x release at some point.)

Comment by Shane Madden [ 2016/06/15 ]

Kylo Ginsberg Gotcha, thanks for the info! Looking at the changes, I think I'd be able to muddle my way through backporting the code itself - but I'm afraid I don't really have my brain wrapped around the tests in puppet's core at all, I'd be useless on that half of the change.

Comment by Daniel Dreier [ 2016/06/15 ]

Josh Cooper based on the PR description, the new behavior (not saving cached catalogs during noop runs) seems like a reasonable fix to me.

Generated at Wed Sep 30 00:24:11 PDT 2020 using Jira 8.5.2#805002-sha1:a66f9354b9e12ac788984e5d84669c903a370049.