[SERVER-1213] Improve UX for upgrades when users have modified bootstrap.cfg Created: 2016/03/16  Updated: 2016/12/13  Resolved: 2016/08/15

Status: Closed
Project: Puppet Server
Component/s: Puppet Server
Affects Version/s: SERVER 2.y
Fix Version/s: SERVER 2.5.0

Type: Epic Priority: Normal
Reporter: James Ralston Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
relates to SERVER-1238 Prevent users from modifying webrouti... Closed
relates to SERVER-1240 Establish pattern for deploying non-u... Closed
relates to SERVER-1542 Unable to start puppetserver after up... Resolved
Epic Name: bootstrap.cfg upgrade UX
Sub-team: Emerald, emerald
Epic Status: Done
Release Notes: Bug Fix
Release Notes Summary: Joe will be adding some notes on this to the docs in the repo; we created SERVER-1470 to capture the write-up.


NOTE: repurposing this ticket as an epic to capture smaller chunks of work. Modifying the original description to capture our proposed path forward and some acceptance criteria. Original description follows afterward.

Problem Statement:

OSS users who have modified their bootstrap.cfg file to support a multi-master setup (and thus have disabled the CA on their compile masters) are subject to a painful upgrade experience, because if the new package contains any changes to bootstrap.cfg, then the packaging system can't overlay the new config file due to the user's local changes. If the user accepts the packaged version of the config file, their CA configuration will be broken, and if they opt to keep their local changes then the server may not start due to missing or invalid services that changed in the new package.

The crux of the issue is that bootstrap.cfg currently contains some configuration that we don't expect users to ever change, and some configuration that we do expect them to change. We need a better mechanism for separating those two classes of configuration.

Acceptance Criteria:

  • Users who are not managing / modifying their bootstrap.cfg file have been able to upgrade seamlessly, and the changes that result from this ticket must continue to allow for a seamless upgrade for these users.
  • Users who have been experiencing this issue will have to deal with it one more time when we roll out the changes described by this epic, but in subsequent releases we should be able to deliver a seamless upgrade experience.
  • We should at least look into some modules such as jlambert121-puppet, and see if we can propose a change that will be compatible with both older and newer releases. It seems like we'd need a way to detect the puppetserver version number in order to make that work, so I'm not 100% sure how feasible it is, but we should look into it.

Proposed Solution:

  • Trapperkeeper will be updated to support loading and combining multiple bootstrap .cfg files, including from a conf.d-style directory containing multiple files
  • All of the base services that are required to run Puppet Server, which we do not expect a user to need to modify, will be moved to a puppetserver-bootstrap.cfg file, somewhere in the /opt directory, hidden from the user's normal interaction with configuration files.
  • We will remove the file /etc/puppetlabs/puppetserver/bootstrap.cfg, and in its place we will create a directory, /etc/puppetlabs/puppetserver/services.d, which Puppet Server will use (in addition to the file in /opt) when loading services at boot time. By default this directory will contain a single file, ca.cfg, that will contain the default CA service. If users wish to disable the CA (e.g. for compile masters), they would modify this file but the file in /opt would remain intact.

On the first upgrade after these changes are released, users who have disabled the CA will need to tweak how they are managing their CA settings to be aware of this new ca.cfg file instead of the old bootstrap.cfg file, but in subsequent releases, any packaged changes to the puppetserver-bootstrap.cfg file in /opt should apply cleanly without interfering with any CA customizations.

The TL;DR version: the monolithic bootstrap.cfg file makes Puppet server upgrades a flag-day nightmare to perform for sites that employ more than a single Puppet server.

So here's the problem:

Each Puppet server version has a particular set of services that must be declared in the bootstrap.cfg file, or else the Puppet server will fail to start.

For example, Puppet server 2.3.0 needs this new service:


If this service is not specified in the bootstrap.cfg file, Puppet server 2.3.0 will fail to start.

But Puppet server has no forward compatibility when it comes to services specified in the bootstrap.cfg file, because the Puppet server will also fail to start if a service is specified in the bootstrap.cfg file that it doesn't know how to load.

This places sysadmins in a "damned if you do, damned if you don't" situation when it comes to Puppet server upgrades: if the bootstrap.cfg file is being managed by Puppet, it is impossible to create a single bootstrap.cfg file template that will work across Puppet 2.3.0 and Puppet < 2.3.0 servers: the absence of the versioned-code service definition will cause Puppet server 2.3.0 to die, and the presence of the versioned-code service definition will cause Puppet server < 2.3.0 to die.

The fundamental problem here is that the bootstrap.cfg is erroneously commingling two completely different types of data:

  1. required service definitions that depend on the specific version of the Puppet server installed on the system, and must not be removed or added by the sysadmin
  2. optional service definitions that depend on the intended use of the Puppet server, and are intended to be removed or added by the sysadmin

The versioned-code service definition is an example of #1, but the certificate-authority and certificate-authority-disabled services are examples of #2. And as long as both types of service definitions are commingled in the same file, it is only possible to upgrade that file cleanly (e.g., when a new version of Puppet server changes the required set of service definitions) if the file has never been modified by the sysadmin.

Except, oops: sites that have multiple Puppet servers have no choice but to modify the bootstrap.cfg file on all Puppet servers except the CA server.

Fortunately, the way out of this mess is simple: split the bootstrap.cfg into a directory of individual files, a la the conf.d directory. All *.service files in the subdirectory will be loaded, in lexicographical order. Required, version-specific service definitions can be placed into a single file, while service definitions that a sysadmin may wish to modify should be placed into separate files. E.g.:

  • /etc/puppetlabs/puppetserver/services.d/core.service
  • /etc/puppetlabs/puppetserver/services.d/ca.service

For sites that have never modified the bootstrap.cfg file, this change should be transparent. For sites that have modified the bootstrap.cfg file, every Puppet server upgrade is already a hands-on PITA anyway, so it's not going to be any worse than what we already have to deal with. And after the transition, future upgrades should no longer have this issue.


Comment by Chris Price [ 2016/03/17 ]

James Ralston thanks a ton for the thoughtful write-up.

We were just discussing this issue as we were finalizing the release yesterday, and we all had a sad about the fact that we hadn't come up with a solution for it yet. This ticket will help us light a fire to get it fixed.

Your solution sounds pretty reasonable at first glance. We'll discuss this in one of our triage meetings in the next week or so, and maybe bat around other ideas, but one way or the other I'll try to get some kind of fix scheduled. I'm optimistic we can get something done before our next Y release.

Comment by Andy Fraley [ 2016/03/17 ]

I just ran into this issue when I upgraded some masters. While the main master worked fine, the compile masters suffered from this as I had previously edited bootstrap.cfg. Luckily #puppet was quick to help. Google was of no help as this ticket and the 2.3.0 release notes are not indexed yet.

Comment by Rob Nelson [ 2016/03/17 ]

Glad I'm not the only one caught out by the lack of search indexing!

Same problem here, and the module I use to manage the puppetserver, jlambert121/puppet, does not yet offer a parameter for the bootstrap.cfg template, so I'm stuck on 2.2.1 for now. This looks like a good way forward.

One possible addition: for backup compatibility, look for bootstrap.cfg first, or in addition to the `services.d` files, as this would smooth the upgrade process, similar to the moving `r10k.yaml` and such. Remove that backup location with the release of v3.0.0.

Comment by Chris Price [ 2016/03/18 ]

I"m not sure that looking for the old file would help, would it? Because if there are any changes to the default version of the file in the next release, and you have a modified version of the file, then you'll have the same issue that you had during this upgrade, where we leave your file intact but it's invalid and the server won't start. Correct?

Comment by Chris Price [ 2016/03/18 ]

If we do go a `services.d` route for this, it'd be nice to try to minimize the number of required files in the .d directory. For starters we could maybe get away with just two; a "main" or "base" file that would contain things that we don't expect users to ever change, and a "ca" file which would contain the CA services. AFAIK, CA is the only thing that users are actually changing in the bootstrap.cfg file today. And we don't have any plans to change any names of the CA-related services any time soon; if we ever did we'd probably do that on a major version boundary.


Comment by Jeremy Barlow [ 2016/03/18 ]

I think your approach of splitting into two files where one represents the stack of services that we don't expect to be customized vs. ones that more frequently are seems fine.

Assuming we're going to be making changes around the bootstrap.cfg, I think this might be an opportunity to do what SERVER-16 called for by HOCONifying the bootstrap entries for consistency with the rest of our configuration. Might be nice to have a format where each service entry could have a map rather than just a namespace - e.g., for future metadata that we might want to add in on a per-service basis.

Like we have for other Trapperkeeper config files in general, it might be nice to have the flexibility to combine all of the service entries into one file (like we do today) or split them into as many discrete files in a directory as desired (1 per service, 2 total with 1 having a set of standard infrastructure services vs. 1 having a set of more frequently customizable services).

Comment by Chris Price [ 2016/03/18 ]

Jeremy Barlow - Kevin Corcoran mentioned that same idea and ticket this morning. IMO that is significantly more work and will need some much more thoughtful design consideration, and will need to happen on a major version boundary. I think the path described in this ticket is a tractable shorter-term strategy that won't require nearly as much work, hopefully won't require a major version bump, and will eliminate a lot of the pain that users are experiencing on upgrades today. So in my mind these are two separate features / changes to pursue.

Comment by James Ralston [ 2016/03/18 ]

Rob Nelson: what Chris Price said. There's no point in keeping the bootstrap.cfg file: if the sysadmin never modified it, then it can be replaced transparently with the services.d directory structure at package upgrade. But if the bootstrap.cfg file has been modified, then the sysadmin is well aware that upgrades have to be performed manually, so migrating the changes to bootstrap.cfg to the new services.d structure isn't really an additional burden.

Chris Price: I agree that it would be simpler to place service definitions that shouldn't be modified in the same file. Going with a one-service-per-file approach would just clutter the services.d directory with a bunch of files that shouldn't be modified.

Also, this is probably a dumb question, but I'll go ahead and ask it: why is it even necessary to specify mandatory service definitions?

Metaphorically speaking, this seems like buying a car that has a toggle switch on the dashboard to enable/disable the spark plugs, and the owner's manual says something like, "This switch controls whether the spark plugs are enabled. The spark plugs must always be enabled. Always leave this switch in the On position."

My reaction would be, "If the spark plugs must always be enabled, then why is the user given the option to disable them?"

If the response is something along the lines of, "Because of the way Java/Trapperkeeper/Clojure work, the service definitions have to be placed someplace," then my retort is, "Fine. Stuff the required service definitions in a file under /opt/puppetlabs/server someplace, so it'll be clear that they're not files that should ever be edited. Then reduce the bootstrap.cfg file to only service definitions that need to be customized. And furthermore, don't treat it as a fatal error if an unknown service is seen in the bootstrap.cfg file; just emit a warning, and continue the startup procedure."

But this seems too simple, so I'm guessing there's a good reason why this approach wasn't taken…

Comment by Chris Price [ 2016/03/18 ]

James Ralston w/rt to hiding the settings that we don't intend for the user to modify off in /opt somewhere, we totally agree, and we have a ticket for this. It's just a feature that needs to be added to Trapperkeeper and our packaging system, and we haven't had the bandwidth to prioritize that over feature work so far. I'll find the ticket and link it.

We want to take a holistic approach to doing that, and do it for the `conf.d` files as well. It needs some careful thought and design work and will probably have to land in a major version boundary, but it is definitely where we'd like to head eventually.

w/rt ignoring unknown services, that is worth considering, but wouldn't have solved the issue for the 2.3.0 upgrade since the problem was a missing service rather than an unknown one. Will still keep that suggestion in mind, though.

Comment by Chris Price [ 2016/03/18 ]

James Ralston see UXD-782 and/or UXD-781.

Comment by James Ralston [ 2016/03/18 ]

Chris Price, I don't have permission to access either UDX-782 or UXD-781, but I'll take your word for it.

Also, ignoring unknown services would still be a benefit. Here's why: we run a beta/test Puppet server that always automatically updates to the latest available puppetserver package. When it blows up, that's typically how we know puppetserver has rolled to a new minor revision, and we go inspect it to see what changes are required to get it working again.

Therefore, before we update our production Puppet servers, we already know in advance what new services will need to be defined. But as long as Puppet blows up on services it doesn't know about, even though we know exactly what needs to be added to make the new Puppet server version happy, we can't actually add the new service definitions in advance.

If Puppet server simply ignored services it didn't recognize, then we wouldn't have to hand-nurse every single Puppet server upgrade: once we used the beta server to figure out what new services are needed, we could add the new service definitions, and then allow the Puppet server package updates to stage and apply automatically on the production Puppet servers.

So, in our book, only having to update the beta server manually would be a significant improvement from where we are now, which is having to update every Puppet server manually, and disable management of the bootstrap.cfg file until every single Puppet server is running the same version again.

Comment by Chris Price [ 2016/03/18 ]

Well, crud. Lori Landesman it might be useful for those tickets to be publicly visible; I didn't realize that the UX project was private. Any ideas what the best way to handle that would be? We could move the tickets to TK?

James Ralston thanks for the additional context, let me mull that over and I'll add some more thoughts next week.

Comment by Chris Price [ 2016/03/21 ]

James Ralston just now had time to read through your last comments. That makes a lot of sense. We are going to have a design meeting in the next day or so to discuss next steps on this, I'll make sure that we discuss your suggestion of ignoring unrecognized services.

Comment by Chris Price [ 2016/03/22 ]

James Ralston we had a meeting to scope some work related to this today. I'm going to convert this ticket into an epic that we can use to break the work into smaller tickets.

The tl;dr is that we are leaning toward doing a services.d-style thing, as you've suggested, but the 'main' or 'base' .cfg file that we don't expect users to edit will probably be packaged in /opt somewhere instead of in the new services.d directory under `/etc`. Assuming this path, we talked through the workflow that you described about ignoring invalid service names, and it seems like your particular problem would be solved without us needing to tolerate invalid service names. If a new service lands and shows up on your canary server, as long as this new service name only exists in the 'main' or 'base' cfg file, and assuming that you haven't modified that file, then the production masters will get the new file with all of the necessary services as part of their package upgrade - correct? Am I missing anything?

Comment by Alex Dreyer [ 2016/03/22 ]

I'm concerned about adding more complexity to bootstraps without fixing TK-211.

Comment by Ruth Linehan [ 2016/03/22 ]

Alex Dreyer I don't think all the jira-ing has been completed yet, but when we talked about this earlier today that was one of the tickets that we thought we probably needed to do along with this work. I definitely agree with you that TK-211 seems pretty necessary to fix if we're going to add more complexity.

Comment by Chris Price [ 2016/03/22 ]

Agree, I will pull TK-211 into the epic right now.

Comment by James Ralston [ 2016/03/23 ]

Chris Price, while I agree that isolating services that users shouldn't edit into a separate file is a good idea, dying on unknown service definitions still breaks forward compatibility.

Here's an example. Let's say another service along the lines of the ca service is added, where there are two different service definition options, and one (and only one) must be defined. We test this on our beta Puppet server, and determine how we want to deploy this setting to the rest of our Puppet servers.

But now we are trapped in the exact same situation as before: if we push out the new service.d file to Puppet servers before we upgrade the puppetserver package, we will break them, because they will choke and die on the unrecognized service (whichever definition we chose for that particular server). And if we upgrade the puppetserver packages before we push out the new service.d file, we also break them, because without the new service.d file, the updated Puppet servers will fail to start the default service.d file that the new puppetserver package will drop may (from our perspective) misconfigure the server, which is often worse than just outright breaking it.

I can't help but think there's a certain irony here: Puppet is an amazing tool to make system administrators' lives easier, but the Puppet server itself is—at least to a degree—special snowflake software that demands hand-holding for updates. And the Puppet server's stubborn refusal to admit that its future self might know how to support services that it doesn't recognize is the #1 reason why.

Throw a huge warning if you want. Print all sorts of nasty messages in the log. Hell, cause the Puppet agents emit a warning on every catalog compile. But please, pretty please, don't die if the Puppet server encounters an unknown service at startup. The only situation related to services in which the Puppet server should outright die on startup is when a critical service doesn't exist—meaning, there is no way that the server can function without the missing service. For any other condition, obey the robustness principle and carry on.

Otherwise, sooner or later, you're going to force Puppet admins into yet another round of kabuki-dance Puppet server upgrades. Because the paradox of forward compatibility is by the time you realize you need it—it's too late.

Comment by Chris Price [ 2016/03/28 ]

James Ralston thanks for the additional example. That helps. From a technical perspective the change you're suggesting is not difficult, so we'll talk through it again and we can probably go ahead and add it unless anyone comes up with an example of a problem it might cause.

Comment by Kevin Corcoran [ 2016/03/31 ]

a problem it might cause

... would be a user who typos a service name. If that service is not required to satisfy a dependency, Trapperkeeper will just go ahead and start up, and I can imagine it being quite difficult in certain cases to figure out why something wasn't working (like, say, an HTTP endpoint being unavailable) and tracing that all the way back to the fact that the service name had a typo in it.

Unfortunately, I don't have an alternative solution to offer at this point in time, but considering Trapperkeeper as a standalone library, the proposed change seems like a bad one to me.

Comment by Ruth Linehan [ 2016/03/31 ]

... would be a user who typos a service name. If that service is not required to satisfy a dependency, Trapperkeeper will just go ahead and start up, and I can imagine it being quite difficult in certain cases to figure out why something wasn't working (like, say, an HTTP endpoint being unavailable) and tracing that all the way back to the fact that the service name had a typo in it.

Unfortunately, I don't have an alternative solution to offer at this point in time, but considering Trapperkeeper as a standalone library, the proposed change seems like a bad one to me.

I think this could be remediated though by a warning being logged, and I think that's a better scenario than what we currently have. If we think about how Trapperkeeper could potentially be used as a framework, there's several use cases for when someone might be editing a bootstrap.cfg:

1. developing an application
a. developing an application, want to add a new trapperkeeper service that is not a dependency and typo it. This is the scenario that I think fits with what you're referring to. In this case, not having TK error out might make things a bit more difficult, but I think logging a warning would make this less of a problem. The user is most likely either running this from a repl or with `lein run` and the logs are just output to STDOUT so they would actually see this warning.

b. developing an application, adding a trapperkeeper service that is a dependency and typoing it. In this case, TK will error out anyway, because there is a dependency not being fulfilled.

2. deploying a trapperkeeper application - I think in most cases where a trapperkeeper application is deployed, it's going to have pre-written bootstrap.cfgs specifying all the dependencies. The one case where this might not be true is multiple implementations of a trapperkeeper service that the user can decide on, and they have already done so before upgrading. This is the scenario that James Ralston is talking about. Other tickets in this epic are dedicated to helping this situation some, but I do think that allowing users to specify a service that didn't exist before without having it error provides an important fix in this workflow.

Some other scenarios: If a user is choosing between two service implementations and typos one of them but also leaves the other in (which isn't the one they wanted... and yet is for some reason left in), then they won't get an error, but they would get a warning in the log. Otherwise, if they typo one of them and comment the other out, they will get an error because the dependency isn't fulfilled. If they keep both in, the service will error once we have fixed TK-211.

Are there other scenarios I'm not thinking of?

Comment by Chris Price [ 2016/03/31 ]

a. developing an application, want to add a new trapperkeeper service that is not a dependency and typo it. This is the scenario that I think fits with what you're referring to. In this case, not having TK error out might make things a bit more difficult, but I think logging a warning would make this less of a problem. The user is most likely either running this from a repl or with `lein run` and the logs are just output to STDOUT so they would actually see this warning.

This is the only one of these that really still gives me pause, but after James Ralston's most recent comments I've come around to agreeing that this scenario (with error logged, at least) is a justifiable tradeoff to achieve the forward-compatibility story that he describes.

Meta-note: if we want to continue the discussion on whether or not to do TK-349, we should probably do so on TK-349. For now, I'm operating on the assumption that we will do that work as part of this epic.

Comment by Karen Van der Veer [ 2016/06/20 ]

QA is waiting for the next release so they can view the new behavior.

Comment by James Ralston [ 2016/08/12 ]

I just kicked the tires on this in 2.5.0 While the update to 2.5.0 will still require manual intervention for sites that have modified the CA services, future upgrades should go a lot smoother, thanks to these changes.

Thanks, all!

Comment by Chris Price [ 2016/08/12 ]

James Ralston np - thanks for the original suggestion and for helping us sort through options for a long-term solution!

Generated at Mon Dec 09 14:04:41 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.