[TK-149] Runtime refresh of Jetty CRL (jetty 1.8) Created: 2014/05/06  Updated: 2018/03/30  Resolved: 2017/08/23

Status: Resolved
Project: Trapperkeeper
Component/s: None
Affects Version/s: None
Fix Version/s: TK-JETTY9 1.8.1

Type: New Feature Priority: Major
Reporter: Jeremy Barlow Assignee: Unassigned
Resolution: Fixed Votes: 2
Labels: AWS1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
blocks ENTERPRISE-9 Puppet Cert Revoke requires pe-http Ready for Engineering
blocks SERVER-373 Quick method needed to reload the CRL. Closed
blocks SERVER-1866 Copy cacrl to hostcrl file immediatel... Resolved
is blocked by SERVER-975 Look into using OCSP for cert revocat... Accepted
is blocked by SERVER-1916 In 2.x only do CRL reload if FS watch... Resolved
Duplicate
Relates
relates to TK-451 Support runtime refresh of Jetty CRL ... Closed
relates to TK-448 Support non-recursive directory watch... Resolved
relates to SERVER-15 The puppet server should not die when... Closed
Template:
Epic Link: No downtime decommissioning of nodes
Team: Froyo
Story Points: 3
Sprint: Server 2017-07-25, Platform Core 2017-08-08, Platform Core 2017-08-22
Release Notes: New Feature
QA Contact: Erik Dasher
QA Risk Assessment: Manual
QA Risk Assessment Reason: Fairly large change, but well covered with clojure integration tests as doced. Test once manually with multiple agents.

 Description   

The CRL option which can be set on the SslContextFactory in Jetty (see PE-3914) only allows the CRL to be established at Jetty server start time, by default. For backward compatibility with the CRL behavior in Puppet Server, only loading the CRL at service start time may be sufficient. However, we may want to improve upon this further by allowing the certs loaded into the CRL to be used by a Jetty server right away – without needing to restart Puppet Server and its underlying Jetty. Run-time reloading of the CRL could be desirable for other components built on the trapperkeeper-webservices-jetty9 service - including PuppetDB and PE Console Services.

Ticket scope

Here is a proposal for the scope of this ticket:

1) Investigate approaches for safely, concurrently swapping out the SslContextFactory in the Jetty webserver without needing to restart a currently started Jetty ServerConnector.

2) Use the trapperkeeper-filesystem-watcher service to register a watcher for the CRL file (and, optionally, the keystore, truststore, and associated parameters which map to files on disk).

3) On receipt of a filesystem-watcher notification for a changed file, invoke some functionality to refresh Jetty's SslContextFactory instance(s) for the appropriate Server / connectors (without the need for a server restart).

Additional Jetty background

The entries in the CRL cert are only processed by Jetty when doStart() is called on the SslContextFactory. From this point on, the CRL entries are referenced by the factory in memory.

I was able to force Jetty to reload the CRL entries from the cert on disk by having the Jetty webserver service create a custom, derived SslContextFactory which overrides a method that Jetty calls when handling a new client connection, newSSLEngine. Here is the implementation I used:

(defn- ssl-context-factory-with-crl-refresh []
  (proxy [SslContextFactory] []
    (newSSLEngine
      ([] (proxy-super newSSLEngine))
      ([^InetSocketAddress address]
        (.stop this)
        (.start this)
        (proxy-super newSSLEngine address)))))


The derived class is referenced in place of the Java SslContextFactory by changing the following line in jetty9-core's ssl-context-factory function:

  [context (ssl-context-factory-with-crl-refresh)]

For the overload of newSSLEngine() which includes an address parameter, the underlying SslContextFactory is stopped and restarted. This causes the doStart() method on the class to be run again, loading entries from the CRL cert on disk back into memory and attaching them to a new SslContext object.

This is a naive implementation which would recreate the SslContext for each connection that the server services. The performance of this solution at scale would probably not be desirable. To optimize this further, it may make sense to create a file watcher around the CRL file and only regenerate the SslContext when the content of the file has actually changed. If this were done asynchronously from the handling of a request on the server side, the impact to clients could be significantly reduced.

Additionally, any solution in this area would need to handle concurrency. The example above does not do that. The Jetty server may be servicing several requests simultaneously and, so, any changes which would need to be made to the SslContext would need to be protected for concurrent access so as to avoid temporary problems in connection handling – e.g., connection failed or (even worse) unauthorized connection succeeds because the CRL hasn't been fully reattached to the context before a client request is validated against it.



 Comments   
Comment by Lindsey Smith [ 2015/02/17 ]

Additionally commentary from Anna:

For me, the length of time the master takes to restart isn't the primary issue - it's that I'm trying to coordinate node life cycles from a remote orchestrator tool that won't be granted the privileges needed to perform such a restart

Comment by Lindsey Smith [ 2015/02/17 ]

Given the comment above, would it make sense to put the refresh on a configurable timer? Would that sidestep concurrency issues?

Comment by Jeremy Barlow [ 2015/02/18 ]

Lindsey Smith The concurrency issues I was referring to above were around how the SSLContext that Jetty uses to enforce the CRL is managed. We'd need to be able to update it at runtime without adversely affecting Jetty's use of it during the processing of incoming connections. This concurrency issue would exist regardless of what trigger we'd use to update it. It should be relatively straightforward to handle the concurrency; I just hadn't put any time into it when looking at this many months back.

I think the goal should be for Jetty to start using the new CRL to validate incoming connections as soon as possible after the CRL is updated - e.g., with a newly revoked serial number. One possibility would be to just register a watcher on the CRL file and, any time it changes, reload the CRL with Jetty. The watcher approach makes me a little nervous from a timing perspective - e.g., we start trying to reload the CRL while the file is still only partially written. I'd prefer to update Jetty after the user modifies the CRL through known actions - e.g., after a revoke is performed by the console through the certificate_status API. It would be harder to implement such a hook behind the Ruby-based cert revoke CLI tool but this could get easier once we get traction on converting these tools over to Clojure. To make the update more generically re-usable by other components built on Jetty like PuppetDB and PE Console Services, though, it may make sense to just go with the file watcher approach, building in some robustness for retrying the CRL load in the event of temporary failures around partially written files, and supplementing with a separate webservice protocol function which would allow dependent services to explicitly request a reload, as needed.

Comment by Lindsey Smith [ 2015/02/18 ]

Jeremy Barlow thanks for the additional context.

No problem with this being public.

w/r/t to knowing when to reload the CRL, I suspect in an environment with lots of ephemeral nodes you would get bursts of revocations that come in one at a time from a single delete action performed serially, which could make watching the CRL file problematic.

Comment by Karen Van der Veer [ 2017/07/14 ]

Past Haus Can you review these?

Comment by Past Haus [ 2017/07/14 ]

Karen Van der Veer Yea I can definitely run with these. Should I also be updating the PRs as needed, or is there someone else who will be on that?

Comment by Erik Dasher [ 2017/07/18 ]

Joel Allen IMHO, the right way for QA to test this (should QA conclude this is a high risk ticket) is to create a beaker test that adds an agent's cert to the CRL, then tries an agent run. Expected Result = agent run files with a code and message that indicate the certificate is no longer valid.

I kind of thought we'd have a test that does this already, but with a puppet server restart after the CRL is incremented. Maybe we are doing this as a clojure integration test. If we are doing this as a clojure integration test, there's no need for QA to get involved with a beaker test.

Comment by Past Haus [ 2017/07/18 ]

Erik Dasher There is already a clojure test in both the puppetserver PR and the tk jetty PR that does exactly that. I'd prefer to not duplicate this test in beaker.

The puppetserver PR also adjusts one of the existing beaker tests to expect the CRL refresh to not require a service restart/reload.

Comment by Erik Dasher [ 2017/07/18 ]

Found clojure integration test that covers this modification:
https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/pull/177/files#diff-a1d74a2a9e21cffe9067894637a1fbe0R379

Discussed briefly with @haus on HipChat in the "Enterprise in the Cloud etc" room. We think this is medium risk and that QA should have a manual test, once, to validate behavior with multiple agents.

Comment by Moses Mendoza [ 2017/07/18 ]

merged to trapperkeeper-webserver-jetty9/1.x at https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/commit/a9aaf57122cfdc905a9c7023a0af30b9c980f249

Comment by Moses Mendoza [ 2017/07/26 ]

The unit test added with puppetserver 2.x PR https://github.com/puppetlabs/puppetserver/pull/1461/files fails on JDK 7 on Ubuntu 14.04 . Need to assess impact / support for JDK 7

The specific failure is:

     ERROR in (crl-reloaded-without-server-restart) (FileDispatcherImpl.java:-2)
node request after revocation fails
expected: (loop [times 30] (println "Errored at least once") (cond (ssl-exception-for-request?) true (zero? times) false :else (do (Thread/sleep 500) (recur (dec times)))))
  actual: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0 (FileDispatcherImpl.java:-2)
    sun.nio.ch.SocketDispatcher.read (SocketDispatcher.java:39)
    sun.nio.ch.IOUtil.readIntoNativeBuffer (IOUtil.java:223)
    sun.nio.ch.IOUtil.read (IOUtil.java:197)
    sun.nio.ch.SocketChannelImpl.read (SocketChannelImpl.java:384)
    org.apache.http.nio.reactor.ssl.SSLIOSession.receiveEncryptedData (SSLIOSession.java:449)
    org.apache.http.nio.reactor.ssl.SSLIOSession.isAppInputReady (SSLIOSession.java:503)
    org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady (AbstractIODispatch.java:122)
    org.apache.http.impl.nio.reactor.BaseIOReactor.readable (BaseIOReactor.java:164)
    org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent (AbstractIOReactor.java:339)
    org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents (AbstractIOReactor.java:317)
    org.apache.http.impl.nio.reactor.AbstractIOReactor.execute (AbstractIOReactor.java:278)
    org.apache.http.impl.nio.reactor.BaseIOReactor.execute (BaseIOReactor.java:106)
    org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run (AbstractMultiworkerIOReactor.java:590)
    java.lang.Thread.run (Thread.java:745)

Comment by Moses Mendoza [ 2017/07/26 ]

I'm wondering if this might be something about the libraries or implementation that varies between JDK 7 and 8 or the platforms..

Note that we expect an exception to occur in the test - specifically, a ConnectionClosedException: https://github.com/puppetlabs/puppetserver/blob/2.x/test/integration/puppetlabs/services/certificate_authority/certificate_authority_int_test.clj#L330

The full trace of that exception, when we get it, is:

caught exception org.apache.http.ConnectionClosedException: Connection closed
org.apache.http.ConnectionClosedException: Connection closed
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:341)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:263)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:123)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:164)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:339)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:317)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:278)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:106)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:590)
	at java.lang.Thread.run(Thread.java:745)

Note that the variation in the traces begins at at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:123

Could our expected exception be too strict given the circumstances? Past Haus Justin Stoller

Comment by Moses Mendoza [ 2017/08/07 ]

The outcome from the discussing the failures associated with Java 7:

  • The library used by file system watch service is known to be buggy on Java 7
  • Puppet Server 2.x open source supports Java 7 and 8, while PE ships with Java 8 only so this is less of an issue
  • To avoid possibly destabilizing puppet server 2.x series so late in its life, we will make the file system watch service optional, and not include it by default
  • In PE where we ship Java 8, we will enable the watch service by default
  • Users of FOSS 2.x who desire the CRL reload behavior and are on Java 8 can enable it by uncommenting a line in their bootstrap.cfg

A PR has been raised to puppetserver/2.x to make the reload behavior optional and disabled by default:
https://github.com/puppetlabs/puppetserver/pull/1475

A separate ticket in PE has been filed to enable the watch service by default in PE: PE-21894

Comment by Moses Mendoza [ 2017/08/07 ]

Separately from the test failure noted in this PR, we also had additional failures in the pe-puppet-server-extensions acceptance suite, wherein during the lifecycle of those tests the puppetserver service would unexpectedly fall over.

We ultimately determined that this was due to the creation of an unreadable root-owned directory in the ssl dir that we were watching, because the watch service was attempting to traverse the new directory to recursively watch new files.

This means TK-448, which was originally thought to be a "nice-to-have" is actually a blocker for this functionality as well. A separate PR has been raised to support non-recursive watching, here:
https://github.com/puppetlabs/trapperkeeper-filesystem-watcher/pull/17

An additional PR is open against trapperkeeper-webserver-jetty9 to use the non-recursive form of the watch service, here: https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/pull/182.

An additional PR is open against puppetserver/2.x to use the non-recursive form of the watch service, here: https://github.com/puppetlabs/puppetserver/pull/1476

Comment by Moses Mendoza [ 2017/08/18 ]

We again encountered unexpected issues in CI. The pe-pse file-sync acceptance suite is failing. It appears that the way we set up compile masters in those tests results in a ssl dir owned by root/root and nontraversable by other users. somewhere in all the crazy implicit management of ssl dir we end up trying to start pe-puppetserver without the dir being propertly owned by pe-puppet, which results in a failure to start. we addressed this issue previously in the pe_install module by explicitly managing the ownership of the ssl dir before the pe-puppetserver service started.

Link https://cinext-jenkinsmaster-enterprise-prod-1.delivery.puppetlabs.net/view/pe-puppet-server-extensions/job/enterprise_pe-puppet-server-extensions_integration-system_file-sync-glisan/28/LAYOUT=redhat7-64mdcla-64compile_master.fa-64compile_master.fa-64a-64a,LDAP_TYPE=default,SAUCE=default,UNEEDED=UNEEDED,label=beaker/consoleFull

Actual failures appear to be caused by the fact that after installing the test version of pe-puppetserver, the pe-puppetserver service fails to start. Same access denied error as before with the $ssldir/ca/private dir, except this time its actually just $ssldir itself that's raising the error.

possible options:

  • manage the ssl dir in the master profile of the puppet_enterprise module in the same way we do during pe_install
  • just chown the directory in the pe-puppetserver post
  • manipulate the somehow to ensure that puppet has reset the directory permissions before we try to start the service (assuming this is safe because we're in a very contrived case here that a real user wouldn't hit)
Comment by Moses Mendoza [ 2017/08/19 ]

New PR Is up that causes us to chown the directory in the post of the pe-puppetserver install.

https://github.com/puppetlabs/pe-puppetserver/pull/215

Comment by Moses Mendoza [ 2017/08/23 ]

released in tk-jetty9 1.8.1

Generated at Sat Feb 22 06:14:23 PST 2020 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.