[PCP-833] pxp-agent init script will start more than one process if pidfile is missing Created: 2018/03/01  Updated: 2019/12/05  Resolved: 2019/01/24

Status: Closed
Project: Puppet Communications Protocol
Component/s: None
Affects Version/s: None
Fix Version/s: pxp-agent 1.9.7, pxp-agent 1.10.5, pxp-agent 1.11.0

Type: Bug Priority: Normal
Reporter: Erik Hansen Assignee: Lucy Wyman
Resolution: Fixed Votes: 0
Labels: resolved-issue-added
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to PCP-862 pxp-agent connection flapping exhaust... Resolved
relates to MODULES-8533 Windows upgrades are failing due to p... Closed
Template:
Team: Bolt
Sprint: Bolt Kanban
Method Found: Customer Feedback
CS Priority: Normal
CS Frequency: 1 - 1-5% of Customers
CS Severity: 2 - Annoyance
CS Business Value: 4 - $$$$$
CS Impact: Unclear the impact of this beyond two running processes. More information needed on the impact of this. Does the broker flap between them? Do commands get executed twice? How does it end up without the PIDfile in the first place?
Zendesk Ticket IDs: 34307
Zendesk Ticket Count: 1
Release Notes: Bug Fix
Release Notes Summary: This modifies the pxp-agent service to kill all pxp-agent processes when the service is restarted, rather than just the current pxp-agent process.

 Description   

OS / Version: Confirmed in Centos 6.7 and SLES 11

Reproduction:

/opt/puppetlabs/puppet/bin/pxp-agent --version
 1.8.1
 
#pxp-agent process running normally:
 
service pxp-agent status
 pxp-agent (pid 16099) is running...
 
rm /var/run/puppetlabs/pxp-agent.pid
 
service pxp-agent start
 Starting PXP agent: [ OK ]
 
ps aux | grep pxp-agent | grep -v grep
root 16099 1.6 0 1 379536 9776 ? Sl 21:21 0:02 /opt/puppetlabs/puppet/bin/pxp-agent
root 16297 9.9 0.0 379536 7744 ? Sl 21:23 0:02 /opt/puppetlabs/puppet/bin/pxp-agent

Expected behavior:

The init script should only allow one running pxp-agent process

Actual behavior:

With a missing pidfile the init script will start another pxp-agent process



 Comments   
Comment by Michael Smith [ 2018/03/01 ]

Yes... why is the pid file missing? I believe that's how it's checking for a previous instance.

Comment by Michael Smith [ 2018/03/01 ]

I'm open to suggestions on how to improve this, but I thought what we were doing - using a pid file to check for a prior process - was fairly standard.

Comment by Erik Hansen [ 2018/03/01 ]

Michael Smith - We're not sure why the pidfile is missing, but this doesn't happen with some of our other services.  This doesn't happen with pe-puppetserver, for instance, because there's a separate mechanism to check for running processes beyond the pidfile and its contents:

https://github.com/puppetlabs/ezbake/blob/db74b11e18f6f6de72297f0ed526585dc210b62b/resources/puppetlabs/lein-ezbake/template/global/ext/redhat/init.erb#L49-L53

Similarly, pe-activemq will recreate a pidfile if it finds that there's already a running process without a pidfile:

service pe-activemq start
INFO: Loading '/etc/sysconfig/pe-activemq'
INFO: Using java '/opt/puppetlabs/server/apps/java/lib/jvm/java/jre/bin/java'
INFO: a new pidfile created for a running instance : '/var/run/puppetlabs/activemq/activemq.pid' (pid '31472')
INFO: Process with pid '31472' is already running

Comment by Rob Braden [ 2018/07/06 ]

ok Erik Hansen's comment mentioned Ubuntu 16.04 - might be a different issue?

Comment by Michael Smith [ 2018/07/06 ]

The only suggestion I can come up with is use systemd, or avoid remounting /var/run without restarting the service. I'm not sure what other options make sense. There is a pxp-agent --pidfile option if you need to use a different path that's not a tmpfs.

Comment by Lucy Wyman [ 2019/01/14 ]

Kim Oehmichen This still needs tests, and verification on platforms other than Centos 6, but I was able to verify a fix for this issue in https://github.com/puppetlabs/pxp-agent/pull/738. Signing off for today, will add tests and manually verify on more platforms tomorrow.

Comment by Lucy Wyman [ 2019/01/15 ]

Kim Oehmichen Yep - we're currently testing https://github.com/puppetlabs/pxp-agent/pull/738, mostly working out the acceptance test syntax. Once that's in it will go through the puppet agent promotion pipeline, and be included in the next puppet agent release! We plan to tackle PCP-862 next and set a limit to the number of file descriptors, and start failing agents if it's getting too large. We have some prior art here where another tool sets the cap at 1200, so we'll start there and test it locally to see how it handles.

PS I don't think this comment needs to be limited in visibility, but figured better safe than sorry. Let me know if you need me to change the permissions though

Comment by Martin Jackson [ 2019/01/15 ]

In some environments, 1200 may be too low a number - for us, we're running > 1200 filedescriptors in use on each master for all the puppetserver functions exclusive of the pcpbroker.  Would it be possible for the process to use a percentage of its ulimit as a ceiling (just as an idea)?  That way it wouldn't have to be another tunable to manage.

Comment by Lucy Wyman [ 2019/01/16 ]

Martin Jackson I'm so sorry - I had the number wrong - in the pe-puppetserver service, we set a default of 12000 file descriptors not 1200, so that's where we'd start. We think PCP-862 might be an issue with pxp-agent waiting for a response it will never get when it closes a connection though, so are going to try `disconnect` instead of `close when a connection is superseded before limiting the descriptors.

Comment by Martin Jackson [ 2019/01/16 ]

Thanks!  That makes sense.  

Comment by Kenn Hussey [ 2019/02/14 ]

Lucy Wyman please provide release notes for this issue if needed, thanks!

Generated at Fri Jan 24 23:46:26 PST 2020 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.