[SERVER-94] Environment Isolation Created: 2014/10/21  Updated: 2018/02/07  Resolved: 2018/02/06

Status: Closed
Project: Puppet Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Epic Priority: Normal
Reporter: Chris Price Assignee: Eric Sorenson
Resolution: Done Votes: 41
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File foreman_smartproxy.pp    
Issue Links:
Blocks
is blocked by SERVER-1200 Classifier synchronization can pull u... Closed
Relates
relates to SERVER-1697 Manual configuration of per-environme... Open
relates to MODULES-4357 puppetlabs-f5 virtualserver can't pro... Closed
relates to PUP-4885 Master and Agent should not share libdir Closed
relates to MODULES-5802 puppetlabs-concat : Server Error: no ... Resolved
relates to PUP-731 Masters cannot reliably distinguish b... Closed
Epic Name: Environment Isolation
Template:
Team/s:
Platform Core
Sub-team: Emerald, emerald
Epic Status: Done
CS Priority: Reviewed
QA Contact: Erik Dasher

 Description   

Puppet Environments provide a way to use different versions of the same modules for different populations of nodes. Our Code Management workflow maps git branches onto environments, allowing administrators to manage the flow of change towards their production estate with software development best practices like pull requests, code review, and merge-ups.

But there are a few problems with the current implementation of environments on the puppet server and catalog compiler. Code can "bleed through" from one environment to another, and compilation problems in one environment can affect others. Now that we have the power of Greyskull – errr, Jetty – at our disposal, we have an opportunity to address these problems; this epic covers improvements and bugfixes around separating environments from one another.



 Comments   
Comment by Susannah Axelrod [ 2015/06/06 ]

Eric Sorenson Please add a description to this ticket.

Comment by Eric Sorenson [ 2015/10/07 ]

Talking with Nan Liu about this - it's a big issue for the customers he's working with and is limiting their ability to implement puppetserver and modern PE versions. He'd be willing to sacrifice startup/instantiation time for a new jruby per environment, if that simplified pool management. Because they do a lot of ruby extensions, the puppet-server code starts to need testing, like reports, functions, and libraries that call into the underlying jruby, because there were implementation differences between MRI and JRuby.

Comment by Eric Sorenson [ 2016/01/08 ]

Original bug w/history of the problem

Comment by Eric Sorenson [ 2016/01/12 ]

Breaking this down a little, there are a couple of areas of work needed. First, some mapping of incoming requests to available jruby instances, and second a mapping of a given jruby to the underlying filesystem state which it uses to compile catalogs from.

Some spitball solution constraints, open for debate:

  • the solution must work for existing content on the Forge and in the world (that is, must not require a new type API)
  • the end user must not need to write a new configuration file to describe the jruby->environment mappings
  • the solution may require the user to enable a global runtime option, effectively running puppet-server in a different non-production mode
  • the end user may be required to hit a HTTP endpoint or run a command to signal the underlying environments have changed
Comment by Chris Price [ 2016/01/12 ]

Eric Sorenson can you elaborate on the 'new config file' part? Is it acceptable for there to be new (presumably optional) settings exposed in an existing configuration file that would be used for performance tuning related to different environments? (e.g. for a user to indicate that 'production' was an important environment that needs to be as fast as possible, while other for environments it is acceptable for there to be some startup costs to get a new JRuby running on-demand when a request comes in for that environment?)

Also, the last time I discussed this whole thing in the presence of Josh Cooper, he suggested that we were going to have to solve the "master adds agent libdir to ruby load path" problem before this was going to be tractable, so we need to sort that out and determine whether the user-facing implications of that are going to violate any of your constraints.

Comment by Henrik Lindberg [ 2016/01/13 ]

If we analyze each environment we can detect if they have "the same ruby content" - if so, they can share one JRuby instance. I imagine that the "no configuration file should be required" requirement would be met by that. Ideally this would be triggered as an environment is changed, comparing SHAs etc. but a simple HTTP endpoint would perhaps be enough for a first release; say disabling environments that have any change until they have been analyzed; which is manually triggered on demand (if the user has opted in to this behavior).

Eric Sorenson Would it be ok for users to have one file per environment that contains a "ruby code fingerprint" ? You run a tool to update it. The server runs everything with the same fingerprint in one JRuby instance. That would be the simplest possible thing I can imagine. You run testing etc. without the feature turned on. When you are done with testing, you bless that version by generating that fingerprint. You then use a production server with the feature turned on. The only thing the feature does is allocating an environment to a particular JRuby instance. There is no other management, no endpoints etc.
The fingerprint itself can go into environment.conf. Until there is a tool that generates the fingerprint, you can hobble along by manually changing it.

From there it is a matter of finding a good process, a good way to detect changes, and a way to produce the fingerprint as quickly as possible.

Comment by Jo Rhett [ 2016/01/13 ]

warning: outsider PoV

In general I like the idea that Henrik Lindberg proposed.

The fingerprint itself can go into environment.conf. Until there is a tool that generates the fingerprint, you can hobble along by manually changing it.

May I suggest a slight variation?

  • The fingerprint itself can go into a file location specified in environment.conf. Until there is a tool that generates the fingerprint, you can hobble along by manually changing it.

It would be trivial to add automation that populated a new file with this SHA. None of the companies I work with today regenerate environment.conf dynamically with each code push, and that would require some extra auditing and authorization hooks in several of those firms.

Comment by Chris Price [ 2016/01/13 ]

Eric Sorenson Would it be ok for users to have one file per environment that contains a "ruby code fingerprint" ? You run a tool to update it. The server runs everything with the same fingerprint in one JRuby instance.

Why would that need to go into a file at all? Couldn't the server just compute it and keep it in memory?

Also, we can't run everything with the same fingerprint in one JRuby instance because that would be a big bottleneck for environments like production that will have multiple requests in flight simultaneously.

And... I think we need to run down the story on this agent libdir issue before we get into the weeds of the implementation details for the JRuby instance management :/

Comment by Henrik Lindberg [ 2016/01/14 ]

Chris Price It could certainly be kept in memory when the server can compute the fingerprint. I was describing what I thought would be the absolute least to implement but still offer the key features (i.e. manually poking a fingerprint number into a file).

Also naturally; the fingerprint will tell the server which pool of JRuby instances to use rather than individual JRuby instance.

And, as you point out, how to find all the ruby code that matters for the fingerprint calculation is important.

Comment by Chris Price [ 2016/02/29 ]

A handful of folks have pinged me lately asking for notes about the implementation details of how we might do the JRuby pooling. We've put a decent amount of thought into it and I have a few ideas/proposals that I just need to put into writing. I think they are simpler than some of the things that have been batted around in the past, in this ticket and elsewhere.

I've been hoping to hold off until we have more of a concrete plan on PUP-4885 before spending the time to write up other proposals, because they may be impacted by it. That said, it's obvious that people are interested in more detail on this so I'll make sure it happens sometime soon.

Comment by Chris Price [ 2016/05/20 ]

I ran into a different kind of reproducer for this yesterday. I'm sure that others on this ticket already knew this, but it was new to me, so I'm capturing it for posterity because it will make a good acceptance test.

Discussion can be found here: https://github.com/voxpupuli/puppet-archive/issues/105

But basically, if you install the voxpupuli puppet-archive module into a non-production environment, and then try to do agent runs against that environment without ever having installed the module into the production environment, you will get errors like this:

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Could not autoload puppet/type/archive: Could not autoload puppet/provider/archive/curl: Could not find parent provider ruby of curl at /etc/puppetlabs/code/environments/develop/modules/archive/manifests/nexus.pp:53:3 at /etc/puppetlabs/code/environments/develop/modules/ui_broker/manifests/init.pp:33 on node rnode01.test.com
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

There are seemingly two problems at play here:

1. the server is trying to load the provider, which shouldn't be necessary, and presumably would be fixed by the proposal that Henrik Lindberg has been talking about.
2. if the server is going to try to load the provider (or any other ruby code, really), the autoloader should be smart enough to put the same directory on the load path that it attempted to load the child provider from, or it will never be able to load the parent provider. it's nuts that it would find the child but then not look in that directory any longer to find the parent.

Presumably this is one of the issues that gets "fixed" when the master does a pluginsync, assuming that the master's agent libdir is in the master's load path. (facepalm - PUP-4885).

It seems like we could look into some tweaks to the autoloader that would at least find the parent provider in this case. It wouldn't solve the whole environment isolation / ruby class re-loading issue, but in an r10k workflow where you're just trying to introduce a new module in a test environment and then promote that up to production, it would get things working.

Comment by John Simpson [ 2016/06/03 ]

Also ran into this with two environments using versions 4.6.0 and 4.7.0 of the puppetlabs/postgresql module. (We have Puppet environments corresponding to different versions of the software we provide to our clients.) The error message wasn't immediately obvious...

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: no parameter named 'connect_settings' at /etc/puppetlabs/code/environments/env_with_4_7_0/modules/postgresql/manifests/server/role.pp:66 on Postgresql_psql[CREATE ROLE repl ENCRYPTED PASSWORD ****] at /etc/puppetlabs/code/environments/env_with_4_7_0/modules/postgresql/manifests/server/role.pp:66 on node hostname.domain.xyz

... but after some digging I was able to figure out that postgresql::server::role has a connect_settings parameter in 4.7.0 which didn't exist in 4.6.0, and it looked like part of the 4.7.0 module was calling code from the 4.6.0 module, even though they're not in the same environment.

I opened support ticket 19208 with Puppetlabs (using the same name for the product and the company is already causing confusion, grrr...) who identified the problem as this issue, and offered two possible work-arounds:

  • Set up a separate compile master for the environments using the older version
  • Upgrade all of our environments to use the same version

I'm already planning to look into using newer versions of the forge modules anyway, so I suspect we'll end up using 4.7.1 in all environments.

Comment by ITler [ 2016/06/07 ]

Referring to the last comment of John Simpson. Maybe this helps anyone facing same issue.
I've upgraded my puppet dev environment hosted in a local vagrant box. It uses puppetlabs-postgresql 4.7.1. The puppetserver box only serves one environment (dev). Agent run on a second vagrant box run without problems.
When putting my changes to test environment, I also run into problems with connect_settings, as mentioned above.

The confusing thing is, that my puppet test server is a separate machine also with only one environment (test). (There is a production folder within environmentspath, but it is only the skeleton). I've upgraded libraries to be on 4.7.1, too. (I'm always using librarian-puppet).

Only after restarting puppetserver service on test node, my (dedicated) test client machine was able to get and apply the catalogue without problems.

Comment by Trevor Vaughan [ 2016/07/25 ]

Eric Sorenson and Henrik Lindberg I'm following up on this based on some recently bygone conversations on the chat room.

We're seeing this crop up more and more as an issue and I was wondering how quickly a solution could be dropped that just spins up an isolated JRuby instance per environment and serves the appropriate hosts from there.

I get that this isn't a "good" solution due to system resources but it should be relatively quick and easy to test and FAR better than spinning up tons of compile masters.

Comment by Eric Sorenson [ 2016/07/27 ]

Note for the watchers of this ticket -

While we are continuing to investigate the solution described in this ticket (isolating environments by restricting a given jruby to only compile catalogs for one environment), the immediate solution for custom Types is taking a different approach. We're going to prevent the compiler from loading the Ruby types at all, by modifying its loader to look for environment-safe representations of the types first, and only fall back to loading Ruby if that description is not found. You can follow this work at PUP-5946.

A plea for your data: it would be extremely helpful for the continued jruby investigation to understand your usage of environments, and specifically: how many *active* environments do you have (agents checking in at least once an hour or so), and what is the distribution of agents across those environments? Chris made a little script to trawl the access logs for this info: https://gist.github.com/cprice404/e97c260c70ff3088a73dcfdfad0bfe7a

Feel free to anonymize your environment names before posting, or just email me at eric0@puppet.com – thanks!

Comment by Trevor Vaughan [ 2016/07/27 ]

Eric Sorenson Thanks for the update Eric. I would like to note that a lot of the issue is with custom functions as well. I understand that there are now functions in Puppet 4 but it will take a LONG time to port existing work to pure Puppet functions across the board and may never happen completely due to needing new features.

This is particularly critical since a lot of new users will jump into using functions prior to touching a custom type and, of course, will have your standard dev/test/prod/feature_branch in place according to currently advertised best practice.

Will functions be handled during this update as well, or just Types at this point?

Comment by Eric Sorenson [ 2016/07/27 ]

Trevor, we're just handling types since there is an existing environment-safe Function API.

If there are functions in Forge modules that fall into this category, can you let me know which ones? We'd be willing to devote development energy to getting them moved over to the new API and sending PRs to the module maintainers, both to help build some reference implementations of them as well as help the downstream users of the modules.

Comment by Trevor Vaughan [ 2016/07/27 ]

Eric Sorenson Hey Eric....I would start with these https://github.com/puppetlabs/puppetlabs-stdlib/tree/master/lib/puppet/parser/functions

If you pin two different environments to two different versions, very unexpected results could occur at runtime.

Comment by Chris Price [ 2016/07/27 ]

Henrik Lindberg I could have sworn that we reworked the underlying implementation of parser functions, even for the old API, to where they were stored in environment-specific maps and no longer suffered from this problem. Is that wrong?

Comment by Trevor Vaughan [ 2016/07/27 ]

Chris Price It's certainly possible. But, I haven't been able to find it authoritatively closed in any ticket and haven't been able to grok the code enough to pin it down solidly and we've seen things that are indicative of this in the field. However, it's possible that the code in use in various customer environments isn't...um...quite as up to speed with the Puppet 4 new car smell as they should be.

If nothing else, can we get a document that specifically calls out what does, and does not, work in the official Puppetserver docs.

Comment by Trevor Vaughan [ 2016/07/27 ]

Chris Price So, if you're using the new (non-backward compatible) function code, this almost works.

Per Henrik's post http://puppet-on-the-edge.blogspot.com/2015/01/the-puppet-4x-function-api.html

Helper Logic
 
You can have as many helper methods you like in the function - it is only the methods being
 dispatched to that are being used by the 4x function API. You are however not allowed to define 
nested ruby classes, modules, or introduce constants inside the function definition. If you have 
that much code, you should deliver that elsewhere and call that logic. Note that such external 
logic is static across all environments.

But, as clearly stated, external Ruby logic is static across all environments.

This is bad since I can no longer have production and any other environment on the same compile master that may involve external Ruby and/or testing.

Also, while I fully expect stdlibmageddon to hit come Jan 1 2017, porting and testing all existing functions to P4 is not trivial in many cases, particularly when you already have external Ruby helper logic (see above).

Comment by Henrik Lindberg [ 2016/07/28 ]

Chris Price As I understand 3.x functions they are loaded per environment (and since PUP-5813 also evicted, meaning that all versions 4.0 to 4.4.0 leaked all loaded functions memory wise). The 3.x functions are thought to be safe if they do not bind new Ruby constants, or Ruby require/load any code, but I cannot swear to that.

Comment by Chris Price [ 2016/07/28 ]

OK, I think I understand it now. 3.x functions are safe as long as they don't do any fancy Ruby things, but for any that do do fancy ruby things, they're still subject to all of the known issues.

Then, it follows that if any of the stdlib functions or common/popular modules are doing fancy Ruby things, then having the JRuby-per-environment pooling in place would still have a pretty high value proposition, even with PUP-5946 addressed.

Comment by Trevor Vaughan [ 2016/07/28 ]

Chris Price Anything that uses your PuppetX namespace is probably subject to this issue across the board.

I don't see it being used offhand in stdlib but I don't know if any of the other items in stdlib might be creating objects, etc...

Comment by Henrik Lindberg [ 2016/07/28 ]

Chris Price Agree there is a high value proposition for isolating environments with separate JRuby instances when "non puppet defined" (i.e. "safe to load") ruby code differs. It would be very nice if it was possible to boot a JRuby, then boot Puppet, and then snapshot for quick start.

Comment by Chris Price [ 2016/07/28 ]

I don't think the 'snapshot' part of that is possible, but I think there are other ways we can handle that. e.g. keeping a few "warm" instances ready, and, ultimately, improving the Puppet settings code, which I think could probably reduce the startup time by 50% or more.

Comment by Trevor Vaughan [ 2016/07/28 ]

Chris Price and Henrik Lindberg Passenger used to pre-spin back-end Rubies at start time and then start new ones and kill the old ones periodically to prevent memory issues (and other Ruby nonsense).

Could you spin up a back end Ruby per environment (to some sane/configurable limit) and then cycle them over time?

The place where you're really going to run into issues is places with 100+ environments (yes, we've seen them in the wild).

In this case, I guess you would want to have a weighted queue for them where the most used stay up and the least used cycle out based on the number of clients using them over time.

Comment by Chris Price [ 2016/07/28 ]

Trevor Vaughan yes, we already have max-requests-per-instance for cycling them over time. What we don't have yet, and what I had in mind when I originally created this ticket, was just to expand our JRuby pooling to have pools-per-environment rather than just a single global pool of JRubies.

What I would like to do is to allow the pooling strategy to be toggled via configuration. The default, for some period of time, would probably be the current strategy, so that we can get some testing / mileage on the new strategy/strategies before making them the default.

I'd envision that the first new "strategy" we might add support for would be a configuration-driven strategy. In this approach, the user would be responsible for setting up their configuration to list what the most common (and performance-sensitive) environments were, and probably tell us how many JRubies to keep around dedicated to those environments. Then, there'd be a sort of 'overflow' pool for uncommonly used environments. The user could also configure the size of this pool. Whenever we got an agent request from an environment that isn't on the 'dedicated' list, we'd use this overflow pool, which we'll flush instances out of using an LRU strategy. We can keep a couple of JRubies warm to make the process of cycling things in and out of this overflow pool faster.

The config might look something like this (credit to Nan Liu as this was his idea, probably like 2 years ago now):

{
  jruby-puppet:
    {pooling-strategy: configured-environments
     dedicated-environments: {
        production: {pool-size: 4}
        test: {pool-size: 4}
        staging: {pool-size: 4}
     }
     overflow-pool-size: 4
     pre-warmed-pool-size: 2
}     

The upside of this approach is that the implementation should be fairly simple and easy to debug, and the configurability should be sufficient for tuning for most users needs. The downside is that it will require some knowledge and configuration on the part of the user. But, since it should be faster to get out the door, and should serve as a valuable stepping stone towards any other strategies we want to implement, it seems like that'd probably be the best first step.

After that I can imagine us implementing a second new strategy that was purely LRU. That would work pretty much the way you're describing, @trevor. The initial configuration, from the user perspective, would be simpler, but the implementation will probably be trickier and it seems likely that for certain users it'd be hard to get the maximum performance, because you could easily end up in situations where JRubies are being flushed and recreated more frequently than would be ideal. I'd guess that it might take a few releases of tweaking the LRU strategy and getting feedback from users before we got to the point where we felt like we had this strategy nailed down.

So, those are the rough ideas that I've had in mind. Not sure when we might get approval to work on them, though.

Comment by Kendall Moore [ 2016/08/02 ]

Chris Price I just wanted to chime in and say that I like the solution you mentioned above. I agree that it's maybe going to require a bit more knowledge and configuration by the user but I think that's a worthwhile sacrifice to make, especially if it means getting a solution in place for this.

Eric Sorenson you mentioned an "immediate solution" for custom Types regarding this issue and gave us a ticket to track (very much appreciated). I see that this solution is slated for release in 5.y and is in development right now, but do you think you could shed some light on a more specific timeline for release if that's possible? This issue has really been popping up a lot for me lately, especially with customers who have the goals of combining Code Manager + File Sync + Multi-tenancy. Basically it's causing these customers to not use Code Manager + File Sync and revert back to pure r10k so that they can isolate environments to specific compile masters. In some case I've seen PS fail to start because a type got loaded into the wrong environment after File Sync put a feature environment on a compile master, and in other cases I have had customers just unwilling to even risk it because they view their production environments as too important to possibly have impacted by this. Whichever the case, I think it's worth addressing because it certainly seems like Code Manger + File Sync are the way of "the future" and it would be nice not to have to tell customers they can't use them right now.

Comment by Henrik Lindberg [ 2016/08/02 ]

Kendall Moore I will answer the part directed to Eric Sorenson - The intermediate solution is in an epic that is indeed targeting 5.y as the ambition longer term is to be able to write puppet resource types in the puppet language (when appropriate). We will ship an experimental feature that is one step in that direction in Puppet 4.6.0 - the tickets contained in that epic we plan to deliver have Puppet 4.6.0 as the target.

In brief, the support is that there is a tool "generate type" that writes out meta data in puppet language form about all the resource types in an environment. The puppet compiler will then use this information instead of the ruby implementation of the resource types. The loading of that meta data is guaranteed to be isolated to the particular environment using those resource types. The meta data is based on the puppet types system; which is widened to include the notion of "object", and we have given that part of puppet the name "pcore" as the data types (including resource types and all under kinds of "types") are the core of puppet.

I hope that helps navigating the tickets and providing a small piece of context around that feature set.
We will most likely release this new feature ("resource types described in pcore by the tool called 'generate type'") as experimental as it is a brand new feature and it needs to meet real code written by users to prove itself. For that reason the feature is opt-in (if you run the generate tool you have opted in).
More information should be available as we are getting closer to release.

Comment by Kendall Moore [ 2016/08/03 ]

Henrik Lindberg This certainly sheds some light on the direction for Puppet Types going forward. Frankly, I'm excited for this change as I think it will be a huge improvement. I'll definitely be on the "opt-in" side of that release.

Some follow-up questions more specific to earlier discussions about this ticket. I know that early on, one of the requirements for a solution was that it had to work with the current type system. Is that still a plan? I know people seem to think the solution Nan Liu posted would be a good one, but I haven't seen any follow-up tickets or comments confirming that would actually happen.

Second question: Does this solve the less common case of having bleed-over from Providers? I haven't read through everything with the new Puppet Type system yet, but it doesn't seem like it would.

My overall concern here is that even if I'm able to opt-in to writing Puppet Types in Puppet and not Ruby, I still can't control the code I'm getting in other updates, be it Forge modules or otherwise. This leaves me open to running into this issue no matter how careful I am on my own, and why I'm still hoping for a solution that is bigger picture.

Appreciate you all still being so responsive about this ticket as I'm sure everyone is excited to have it in the past.

Comment by Henrik Lindberg [ 2016/08/03 ]

Pardon my ignorance, but I am not aware of what Nan Liu proposed as a solution. Is there a pointer?

Clashing Providers can certainly also occur (as noted, this is less common). We will take the same approach with providers and generate meta data the compiler can use. Some during the 3.x series the compiler started needing to load providers. This had the unfortunate effect that providers not suitable for the master could end up being loaded, and if two environments had different versions of the one and same provider they would clash. Unfortunately we ran out of time to also fix the isolation problem for providers for the 4.6.0 release - expect that to come in a subsequent release. Essentially, the compiler only needs to know the set of providers available so it can validate its explicit use.

The long term ideas are: resource types are just part of the puppet type system and they as well as providers can be written in the puppet language. When needed, providers can be written in Ruby or some other language.

Kendall Moore you are right that even you are careful with your own code, you still need to consume other people's code. The solution we opted for, i.e. generating meta data and using that instead of the implementation should solve that as it generates the meta data for everything in the environment including other people's code.

Comment by Nicholas Fagerlund [ 2016/08/03 ]

Kendall Moore The stuff we're working on for 4.6 expects that NOBODY is writing types in the Puppet language yet — there's a tool you run out-of-band that analyzes all available Ruby types, and creates equivalent Puppet language objects for all of them. (Almost all. There are some less common features that can make Ruby types untranslatable.)

So, yes, this intermediate solution will work with the current type system. (That's what you meant, right?)

Comment by Chris Price [ 2016/08/03 ]

Also, the server (compiler) should really not need to be loading providers at all, and Josh Cooper said that he thinks we can fix that without too much trouble. If we can achieve that then most (all?) of the environment-related issues around providers should go away. IMO that should be the next thing we aim for after the current puppet type stuff lands.

Comment by Josh Cooper [ 2016/08/03 ]

the server (compiler) should really not need to be loading providers at all, and Josh Cooper said that he thinks we can fix that without too much trouble.

The problematic code is here. This code path is called by the agent and server. However, the server/compiler does not normally need to load providers at all. The server doesn't need to figure out the default provider, since the default provider will be different for different agents, e.g. windows vs redhat vs ubuntu. The agent will figure out the default provider when it converts the master (JSON) catalog into an agent (RAL) catalog.

Ideally we could just skip those methods when running on the server. But we can't quite make that change, because the server manages its internal settings using a "settings catalog". The server will create a catalog containing file resources to manage the owner, group, and mode for its file-based settings, e.g. ssldir. The server then applies the catalog, causing it to execute what are normally agent code paths.

So I think we need to figure out how to manage settings on the server without having it load providers for all types. Maybe we only load user, group and file providers on the master? That should be okay because those are core types (so we don't have different versions of them in different environments).

Or we figure out how to use all of the file settings once when the server starts and make Puppet.settings.use a noop after that? That might be generally a good thing for the server, since we really shouldn't be applying settings catalogs during catalog compilation for performance reasons.

Comment by Henrik Lindberg [ 2016/08/04 ]

When a pcore based resource type is used the problematic code path (that loads all providers) will not be executed. And if I read the code right (referenced by Josh in the previous comment), only the providers for the resource type in question (the resource type being loaded/defined) are loaded. Thus it seems that as a side effect of taking the pcore route to loading, there would be no problem with clashing providers (unless there is logic elsewhere that deals with the default provider). As we do not generate pcore for puppet's core resource type implementations (only those in modules) we may already in practice have isolation for providers...

Comment by Josh Cooper [ 2016/08/04 ]

Thus it seems that as a side effect of taking the pcore route to loading, there would be no problem with clashing providers (unless there is logic elsewhere that deals with the default provider).

I think pcore will definitely help, but I'm not sure it's sufficient. Since the master applies settings catalogs, it will call Catalog.to_ral, and that will create instances of Puppet::Type::{User,Group,File} and load their respective providers. The part I'm not sure about is whether we end up loading all providers for all types through some other side-effect?

Comment by Henrik Lindberg [ 2016/08/04 ]

Josh Cooper we worry about exactly the same thing:

The part I'm not sure about is whether we end up loading all providers for all types through some other side-effect?

We need more experiments and tests to assert what we think (hope) wrt. what is going on with providers during compilation and to_ral.

Comment by Trevor Vaughan [ 2016/09/19 ]

Out of curiosity, has there been any movement/another ticket created focusing on the brute force solution of spinning up multiple JRuby instances?

Comment by Chris Price [ 2016/09/27 ]

Trevor Vaughan AFAIK this is still the only ticket we have for that. I'd be happy to create a separate one but I Eric Sorenson is running the show on environment isolation.

Comment by Alexander Fisher [ 2016/10/04 ]

Would setting max-requests-per-instance = 1 mitigate this issue at all?

Comment by Henrik Lindberg [ 2016/10/04 ]

With max request per instance set to 1 it will be the slowest way to work around the problem.

Comment by Alexander Fisher [ 2016/10/04 ]

A colleague just had to run puppet 8 times in a row before hitting an instance that knew about a recently added type parameter. That was pretty slow too!
Where I am, most people are running puppet adhoc and not as a daemon. I have a pair of puppetservers (for redundancy) which are mostly idle. The performance hit might be acceptable.

Comment by Chris Price [ 2016/10/04 ]

I believe that the server ends up making a 'request' to the jruby instances during initialization (to read the settings from puppet.conf), so I think that if you set it to 1 it would flush the jrubies before they were even able to handle an agent request. Setting it to 2 might be more in line with what you're hoping to I achieve, but the performance penalty will be so severe that your server will not be able to handle any meaningful load.

If the puppet generate workflow isn't solving the issues for people then there really isn't a good workaround; we need to do the work to isolate the jrubies per environment in the puppetserver codebase.

Comment by Chris Price [ 2016/10/04 ]

Alexander Fisher well, you could certainly try it out! In the situation you described though, it seems like what you really want is to flush the JRuby pool when you've made a change to a type. You can do this via the admin API.

Comment by Alan Schwarzenberger [ 2016/10/11 ]

Here we have well over a thousand puppet environments ... I'd be happy to provide details (I saw mention of a log file scanning tool above) or discuss our use case further. We have very carefully controlled server side ruby as a result, and am very much looking forward to environment isolation coming one day.
Yes, we have a puppet class to flush the Jruby pool whenever it changes as Chris Price commented.

Comment by Chris Price [ 2016/10/11 ]

Alan Schwarzenberger thanks for the info. If you get a chance to try out the `puppet generate` stuff that Eric Sorenson linked in previous comments, it'd definitely be helpful to know what percentage of your issues that solves.

Comment by Alan Schwarzenberger [ 2016/10/11 ]

Note that we guarantee via a controlled build pipeline and automated release process that every one of these environments have precisely the same version of all modules that contain types or functions, ie that contain server side Ruby, as we put these modules onto the base_module_path and prevent any other modules being released which include server side ruby.
Will look back and run the tool on a test box when I can, likely a couple of weeks unless there is some urgency from your side.

Comment by Chris Price [ 2016/10/11 ]

Alan Schwarzenberger no urgency per se - if it solves most or all of your problems that would be awesome. If it doesn't, the info might help us make a case for implementing the jruby pooling isolation that is discussed in this ticket.

Comment by Joel Wilson [ 2016/10/14 ]

I was thinking about this issue a few weeks ago and thought that perhaps we could take advantage of the dynamic nature of Ruby itself (which I think is still even available within JRuby) to help with this. I saw mention of code fingerprints above, which is better than the solution I was thinking of, which was to have the developers create signature files of the modules with the version information in it. I guess the modulespec files do that already, but not everyone uses them.

Anyway, instantiate a generic Object for each module and then load into an instance hash a "singleton" instance for each version of the module. It would be used like public class methods instead of instance methods (in otherwords, not intended to store state), but could provide the versioned behavior depending on what version the calling manifest is expecting to answer (this implies some dispatcher intelligence).

However, there is a lot of flailing about to solve this and it largely has to do with the centralized nature of Puppet. Are the code artifacts to create the catalog so large that they couldn't just be sync'd down to the nodes themselves and the nodes just take care of themselves as if they weren't using Puppet Enterprise at all? Exported resources could even be sent down with that.

Comment by Trevor Vaughan [ 2016/10/20 ]

Eric Sorenson or Henrik Lindberg Is there a SERVER-94 test case reproducer out there? I think I've come up with a method for successfully working around the Ruby loading issue in the module space but would like to have a known repeatable test case for validation if possible.

Comment by Henrik Lindberg [ 2016/10/21 ]

Trevor Vaughan An acceptance test have been added to puppet for environment isolation - ping Eric Thompson (I don't have the ticket number handy atm).

Comment by Eric Thompson [ 2016/10/21 ]

PUP-6779 confirms env. isolation with a custom type with an added attribute in one environment. it ensures puppet generate alleviates this isolation issue. the test is in puppet master branch

Comment by Trevor Vaughan [ 2016/10/24 ]

At PuppetConf last week, I managed to work out something that appears to work around the issue for custom Functions, Types, and Providers that include external Ruby code (aka puppet_x).

Details over at https://www.onyxpoint.com/fixing-the-client-side-of-multi-tenancy-in-the-puppet-server/.

It seems to work from my preliminary testing.

Comment by Henrik Lindberg [ 2016/10/25 ]

Trevor Vaughan It is really hard to manage additional code that requires binding to ruby names. All points accessing them need to bind to them using some other lookup than simply referring via a Ruby symbol (that is indeed how the 4.x loaders work when loading functions). There is one additional issue when one environment is changed - i.e. new version of code in the same environment - meaning that bindings to constants will not work.

Last, but not least, if any additional code is needed it can clash.

Since modifying loading of Puppet_x as per your experiment is basically the same as loading a 4.x function - you can get the same effect by simply including the logic needed in a 4.x function (and following the rules documented in the 4.x function api specification).

Comment by Joel Wilson [ 2016/10/25 ]

I think the method that I described in comment-359539 may address some of these issues, but I don't know enough about the guts of the system to try testing it.

Comment by Tommy McNeely [ 2016/10/25 ]

I realize that people have moved away from the "passenger" architecture, with puppetserver, but I wonder if something similar could be setup in "java land" ...

One of my rails devs was working on implementing a rails5 feature called "cable" and sent me this: https://www.phusionpassenger.com/library/config/nginx/action_cable_integration/ ... which leads me to https://www.phusionpassenger.com/library/config/nginx/reference/#passenger_app_group_name ...while the example is a bit different than the issue we are experiencing, it does feel quite similar. This would effectively be the same thing as launching one "appserver" per environment with a little config magic, but with the added benefit of them launching on demand and having some sane total max process limit, probably even at a per-env level.

Again, I realize that we are not using passenger/ruby anymore, but passenger is opensource and maybe that feature/logic could be adapted to whatever is "dispatching" application requests in puppetserver? If not, and I am being completely dense, thats fine, go about your day and ignore this. Discussion about this at the contributor summit and this recent traffic just had me thinking about a way to implement it in our setup (which is actually passenger, for another few weeks at least).

~tommy

Comment by Chris Price [ 2016/10/25 ]

Tommy McNeely that actually maps pretty well to what I described above about pooling the JRuby instances and isolating each pool to correspond to a specific puppet environment. We'd need to decide what we want the config to look like but the resulting behavior, I think, would be very similar to what you've linked above.

Comment by Trevor Vaughan [ 2016/10/26 ]

Henrik Lindberg Thanks for the feedback. I get that the v4 API is supposed to fix some of this but I couldn't find a clear example on how to replace a complex Class with a function or set of functions.

Do you have any working examples of using the new function API to spawn helper classes/functions that can hold state over time that are simpler to use and understand than what I posted? I can take it for a whirl and update my post to reflect my new findings if they hold true.

Also, do you have a spec testing guide for the new API function calls with instance and/or class variables? I know how to test the Ruby mangling but I haven't a clue how to hold state in the new functions and also test them.

Also, I understand that we'll have issues with files being updated on disk. Rails wipes all autoloaded constants when that happens and perhaps that should be adopted here as well http://guides.rubyonrails.org/autoloading_and_reloading_constants.html#constant-reloading.

I was actually looking at auto-wiping my constants but there was no good way to know when we would be finished using them.

Comment by Henrik Lindberg [ 2016/10/26 ]

Trevor Vaughan As with all things global and no clear transaction boundaries it is impossible to manage global things correctly. Puppet once had slow and buggy code that tried to overcome that. It is simply not a path that leads to success.

The design criteria for 4.x functions did not include maintaining state so there is no nice API for that, but the function instance itself is a state that is managed. Need to do some experiments to come up with a safe (and understandable) way to do this - basically it should be possible to use instance variables in the function, but care has to be taken to not step on those already in use.

In general it is a bad idea to maintain state (many small caches/loaded data leads to high memory consumption that creates a worse performance problem than what the maintained state/cache tried to optimize). Naturally there are cases where it really matters. Specifically, a function's life cycle is not officially defined, and the life of a function instance is tied to the environment in which it is used, thus it may be used in multiple compilations in the same environment. Here we need to discuss what we can/should specify. (The specification is in the 4.x function API document in the specifications repo).

Lindsey Smith At this point the 4.x function's ability to contain additional logic should be documented in our official docs - I know there is 4.x function API documentation in the works. The rest needs a bit of work before being made official.

Comment by Chris Price [ 2016/10/26 ]

In general it is a bad idea to maintain state (many small caches/loaded data leads to high memory consumption that creates a worse performance problem than what the maintained state/cache tried to optimize). Naturally there are cases where it really matters. Specifically, a function's life cycle is not officially defined, and the life of a function instance is tied to the environment in which it is used, thus it may be used in multiple compilations in the same environment.

Also worth noting that in both Puppet Server and the old Passenger stack, multiple, isolated ruby interpreter instances are created in order to maximize performance and also to avoid some other problems that would arise related to global state. This means that if you due try to cache state (inside of a function definition, or anywhere else), that state is isolated and distinct per Ruby interpreter - so there will always be multiple instances of it in memory. It's not global to the entire Puppet Server (or Passenger) instance.

Comment by Trevor Vaughan [ 2016/10/26 ]

Chris Price That's OK. When I was looking at saving state, it was really only for the duration of a compile. Everything can be tossed after that. In general, instance variables work just fine. I've only had to resort to class variables when dealing with making things recursive that weren't really designed to be called that way (or I'm just doing it wrong :-D ).

Also, as painful as it is, we're still going to have to support some 3.X stacks into next year. A LOT of time has gone into just getting code migrated and ready in a lot of environments that I've visited.

In terms of saving global state, this goes back to the ticket that I keep meaning to work on but never have time. I do save state in some of my types and providers and a global state registry that is self-cleaning would be ideal to keep that sort of nonsense at bay. In theory, I'm OK since I'm explicitly killing my state variables as I move forward, but it really needs to be hooked into the language.

I suppose that the double resource hook could be used to get around this and have the second resource follow the first and gather everything from the guts of the first set of resources but...ew.

Comment by Henrik Lindberg [ 2016/10/26 ]

State on the agent (when doing an apply) is something very different (in types and providers) from the state maintained when compiling.

Have now looked into what would be a safe and useful way of handling state in a function. Will create a ticket for that - essentially the support consists of using a so called Data Adapter that associates a hash with the Compiler. This can be done manually today, but we want to make this available with an official API. Basically a 4.x. function (in Ruby, using the SystemFunction support) can ask for a cache to be injected into calls. That hash maintains the state through a compilation (and then disappears).

Trevor Vaughan Would that be sufficient for the use cases you were thinking about?

Comment by Trevor Vaughan [ 2016/10/26 ]

Henrik Lindberg That sounds exactly like what I want, yes.

Also, definitely post when there's a better description of how to use the v4 functions to replace the mixed in modules in multiple classes. The only thing that's coming to mind is a case where I end up with 500 functions of dubious naming quality and I'm 90% that I've got that wrong.

Thanks!

Comment by John Simpson [ 2016/12/13 ]

So does PE 2016.5 solve this issue? It lists "Environment Isolation" as one of the highlights of the release.

Comment by Eric Sorenson [ 2016/12/13 ]

John Simpson yes, it auto-generates the puppet language type descriptions, isolating different versions of the same type across environments. more detail here: https://docs.puppet.com/pe/latest/code_mgr.html#environment-isolation-metadata

Comment by John Simpson [ 2016/12/13 ]

Eric Sorenson Does this mean that, in order to have my environments be isolated from each other, I have to use this "code manager" thing? I ask because all of our PE servers are monolithic, so we've never had a real need for it (unless I'm totally missing the point of what it does - it just ensures that the environment directories are identical across multiple compile masters, yes?)

Also, we're using a custom r10k script to build complete environment directories... will these new *.pp files that we're not supposed to touch be created automatically somehow, or is there some other command that our custom r10k script needs to run in order to generate them after creating a new environment directory? Will it hurt anything if these files are deleted? (Our custom r10k script deletes and creates each environment directory from scratch whenever a branch for which an environment exists has been updated.)

Comment by Henrik Lindberg [ 2016/12/14 ]

John Simpson

  • the .pp files describing the resource types are created by the command puppet generate types - it syncs the descriptions with the resource type implementations in the environment the command is executed for.
  • the generated files can be removed and recreated at will (that is also what the command does when executed)
Comment by Alan Schwarzenberger [ 2016/12/14 ]

Is this feature PE only, or also available in Puppet open source? If it is available in open source, I would appreciate some guidance on how to integrate this into our enterprise scale system that has well over 5000 puppet environments.

Comment by Nate McCurdy [ 2016/12/14 ]

It is available in open-source Puppet as well. Here's the docs on using it:

https://docs.puppet.com/puppet/4.8/environment_isolation.html#enable-environment-isolation-in-open-source-puppet

Comment by Thomas Mueller [ 2016/12/14 ]

Alan Schwarzenberger I can confirm this works with OS Puppet. I've tested with the elasticsearch module in different versions in different environments which failed with puppet 3.8 and works with 4.8 and this puppet generate types operation.

Comment by Alan Schwarzenberger [ 2016/12/14 ]

Many thanks. Will create a story internally for someone to review the docs and plan a way forwards. Anyone who knows the internals of this see a reason this would not work with the number of puppet environments in the 5000 to 10000 range ?

Comment by Andreas Paul [ 2016/12/16 ]

After finally getting the puppet generate types to work on a normal Puppet client instead of a Puppetserver, because we use a central Puppet environment code rsync server in front of all Puppetservers.
Needed to add the environmentpath config parameter to the Puppet client puppet.conf:

  ini_setting { "add_environmentpath_for_generate":
    ensure    => present,
    path      => '/etc/puppetlabs/puppet/puppet.conf',
    section   => 'main',
    setting   => 'environmentpath',
    value     => "/var/data/puppet_sync/",
  }

Which, by the way, would have been way easier if puppet generate did support to configure this on the CLI with a parameter like --environmentpath or an environment variable of some sort.

Then we encountered the following problem with the https://github.com/theforeman/puppet-foreman module:

puppet parser validate /tmp/.resource_types/foreman_smartproxy.pp
Error: Could not parse for environment production: Syntax error at '/' at /tmp/.resource_types/foreman_smartproxy.pp:34:37

I attached the generate file to this ticket.
What I can gather is that the ruby internal URI.regexp pattern from the foreman module (https://github.com/theforeman/puppet-foreman/blob/master/lib/puppet/type/foreman_smartproxy.rb#L33) does create a pattern which the Puppetserver refuses to use.

Now I'm not sure if
(a) the puppet generate did something wrong with the URI.regexp or
(b) the guys from the foreman-puppet module should change something.

I will create a issue on their Github page and link it back here for anyone interested, but I'm unsure if they've even heard of puppet generate types yet.
Edit: Github issue link: https://github.com/theforeman/puppet-foreman/issues/512

I'm also wondering why puppet generate types does create something that the puppet parser validate can't validate in the first place.

So @Alan Schwarzenberger if you have full control of all your Puppet environment code and can disable the generated types for some parts as the official documentation mentions https://docs.puppet.com/puppet/4.8/environment_isolation.html#troubleshooting-environment-isolation.
Then yes it should be able to work with thousand of different environments.

In my case I can't generate the types for all environments if I can't be sure that the resulting file works or fails the Puppet run.
I could do a puppet parse validate for each generated file and only include those that looks good, but I have to think about that one, because we are removing more and more logic away from the Puppetserver.

Comment by Henrik Lindberg [ 2016/12/16 ]

The puppet generate types should not generate faulty .pp code. It should instead either back off from trying to generate something (if that is acceptable), or error and not generate the .pp file for that resource type.

Andreas Paul it would be great if you could log a separate (PUP) ticket with what you posted above and also include the generated .pp representation of the resource type.

On your other comment, regarding --environmentpath, since that is a setting, you should be able to set that on the command line just like any other setting.

Comment by Eric Sorenson [ 2016/12/21 ]

Folks - I'd like to ask you to take conversations to two places:

1. If you are having issues with the puppet generate types, please open a ticket in the PUP project with the "Generate types" component.
2. If you're interested in following the design and implementation of jruby pool management, please move over to SERVER-1697

Comment by Justin Stoller [ 2018/02/06 ]

As Eric says, please take additional discussion elsewhere.

Comment by Lucas Yamanishi [ 2018/02/07 ]

Does the closing of this ticket mean the bug is fixed?

Comment by Justin Stoller [ 2018/02/07 ]

It's fixed -ish.

 

tl;dr of this ticket is:

Improper environment isolation is an artifact of Puppet's architecture and fully fixing would require a re-write. We've gone a fair bit in that direction and have two options for folks who are affected by it:

  1. Use new Puppet 4 style function interfaces written within the Puppet language and puppet generate types which combined allow Puppet to process environment code without actually loading Ruby code that modifies the Ruby runtime in ways that bleeds through the environments
  2. Set the max-requests-per-instance setting in Puppet Server to 1. Which means we'll trash the Ruby runtime between every request, ensuring every request has a clean environment, but will cause terrible performance implications.

Discussion of improving or changing those two approaches can should follow the avenues that Eric mentions in his last comment.

Generated at Wed Nov 13 18:13:01 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.