[TK-426] Investigate apparent breakage of "scan" for logback configuration Created: 2016/12/14  Updated: 2018/09/17  Resolved: 2018/09/17

Status: Closed
Project: Trapperkeeper
Component/s: None
Affects Version/s: None
Fix Version/s: SERVER 6.0.0

Type: Bug Priority: Normal
Reporter: Jeremy Barlow Assignee: Charlie Sharpsteen
Resolution: Fixed Votes: 0
Labels: cspapercuts, usability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to SERVER-2206 Improve default logging configuration... Closed
relates to TK-424 Reinitialize logback during reload Accepted
relates to DOCUMENT-642 Docs on logging say I don't need to r... Closed
relates to PDB-3884 Improve default logging configuration... Closed
Template:
Story Points: 5
CS Priority: Normal
CS Frequency: 4 - 50-90% of Customers
CS Severity: 2 - Annoyance
CS Business Value: 4 - $$$$$
CS Impact: Support interactions would be much smoother if we just set this to scan every 60 seconds. It would remove the need to ask customers to restart puppetserver during troubleshooting which can be lengthy at many customers.
QA Risk Assessment: Needs Assessment

 Description   

In the logback.xml configuration files for various projects, we've been setting the "scan" attribute to "true" in hopes of having logback automatically notice / reload configuration file changes on the fly. We had observed that it was working for Trapperkeeper-based applications at one point in time but it no longer appears to be working, most recently tested with logback-classic version 1.1.7. We should investigate this further to see if there's something we need to change in Trapperkeeper to continue supporting the feature and/or if there's a bug in logback itself.



 Comments   
Comment by Jeremy Barlow [ 2016/12/20 ]

I also tested this with the most recent logback version, 1.1.8, built into puppetserver package for CentOS 7. I saw the same problem where changes I made to the logback.xml file did not take effect until after I restarted the full puppetserver process. I had triggered dozens of actions which log messages over several minutes time without seeing the logback scan reconfiguration kick in.

Comment by Justin Stoller [ 2017/01/30 ]

One thing I noticed when reviewing logback's changelog is that 1.1.9 increased the threadpool size from 2 to 8 after they saw some long running processes could exhaust their threadpool of 2 easily and cause other services to never run. I'm not familiar with how we use logback in tk or how slf4j uses it, but it's possible that we're hitting this issue and that upgrading to 1.1.9 will help.

I have a PR to bump to logback to 1.1.9 here: https://github.com/puppetlabs/clj-parent/pull/30
Logback changelog: https://logback.qos.ch/news.html
Logback ticket in question: https://jira.qos.ch/browse/LOGBACK-1238

Comment by Jeremy Barlow [ 2017/03/01 ]

I played around with a mergely build off of the puppetserver master branch, 2.7.2.master.SNAPSHOT.2017.03.01T1110, which had been bumped up to logback 1.1.9. On CentOS 7 at least, it still doesn't appear that the "scan" element in the logback configuration is allowing for configuration changes to be dynamically reloaded. I started the puppetserver service, bumped log levels in the logback configuration up to debug, and did a number of puppet agent runs. Even after several hundred agent runs - producing a half-dozen log messages each - and several minutes, no debug-level log messages were written to the log. Only after a full restart of the puppetserver service did debug-level log messages start being written to the log.

There have been 3 more releases of logback since the 1.1.9 bump - 1.1.10, 1.2.0, and 1.2.1. It would be good to try with the later versions to see if scan started working again but I don't see anything obvious in the logback release notes which highlights a fix related to this issue.

Comment by Adam Bottchen [ 2017/03/02 ]

Joe Pinsonault Excellent find. If we could get that scanPeriod option added to logback.xml for the next LTS z, that would work for me. I don't know what the default used to be, but 60 seconds seems a bit long to me. Is there a lot of overhead with having that file scanned frequently?

Comment by Garrett Guillotte [ 2017/03/02 ]

Joe Pinsonault Should I add this workaround to the live docs and close out DOCUMENT-642? Seems like a near-term workaround would unblock it.

Comment by Joe Pinsonault [ 2017/03/03 ]

Adam Bottchen The default is 60 seconds according to the documentation. As for the performance, it's a little complicated/confusing. Those docs mention that there could be some performance impact to frequent scanning, and they talk about some hoops they have to jump through to keep the scan feature from bogging down applications.

I would hesitate to recommend lowering it down very much if there's not a compelling reason, but I also have very little idea of what the performance impact would actually be

Garrett Guillotte Yeah, this workaround should work for that ticket

Comment by Charlie Sharpsteen [ 2017/11/07 ]

I think we should just put "60 seconds" in the configuration. It fixes this issue and makes the existance of the setting visible to users. Explicit is better than implicit in this case.

Comment by Justin Stoller [ 2017/11/07 ]

We've added the documentation to puppetserver. We should validate that this works as expected and if so update the default values in our logback.xml and resolve this, I think.

Comment by Charlie Sharpsteen [ 2017/11/07 ]

It'd need to be a mass update to logback.xml in all PE and Open Source projects — do we have a centralized list of where all these config files are? Last time I looked, some were in resources/ directories, others in ezbake directories and still others in pe-specific packaging.

Comment by Justin Stoller [ 2017/11/07 ]

I don't know of a canonical list.

Comment by Maggie Dreyer [ 2018/02/13 ]

This seems like this just needs a little follow-up on which spots need to be updated. Charlie Sharpsteen do you have any update on that, or should we look into it?

Comment by Charlie Sharpsteen [ 2018/02/13 ]

This should be a trivial change to the logback.xml files — main blocker is finding an authoritative list of where all of those live and making sure that pe- variants of things like Puppet Server and PuppetDB are updated along with PE services like orchestration and console.

Another one that would be good to knock out at the same time is PE-13490, which asks for time zones to be added to our Java logs. Currently everything is timestamped with server local time but know time zone — which makes logs hard to interpret without supporting information that provides the offset from UTC.

Comment by Lizzi Lindboe [ 2018/03/29 ]

Charlie Sharpsteen made PRs to fix this issue, marking as resolved.

Comment by Charlie Sharpsteen [ 2018/03/29 ]

Re-opening for the time being since bits have landed in pe-puppetserver, but I haven't gotten around to filing PRs for puppetserver yet.

Comment by Charlie Sharpsteen [ 2018/05/14 ]

PRs up for Puppet Server 6 and PuppetDB 6.

Comment by Charlie Sharpsteen [ 2018/09/17 ]

Closing as fixed since PRs for Puppet Server and PuppetDB are merged.

Comment by Charlie Sharpsteen [ 2018/09/17 ]

Also, for posterity and the benefit of future generations, the "authoritative list" of where these files live in the various and sundry repositories that combine into PE and Open Source packages boils down to:

  • Read the EZBake docs and source code. Figure out how it's config merging works.
Generated at Thu Nov 21 02:39:36 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.