[PCP-862] pxp-agent connection flapping exhausts resources on broker Created: 2019/01/13  Updated: 2020/03/27  Resolved: 2019/03/01

Status: Resolved
Project: Puppet Communications Protocol
Component/s: None
Affects Version/s: None
Fix Version/s: pcp-broker 1.5.3

Type: Bug Priority: Critical
Reporter: Erik Hansen Assignee: Lucy Wyman
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
relates to PCP-833 pxp-agent init script will start more... Closed
Team: Bolt
Sprint: Bolt Kanban
Method Found: Customer Feedback
Release Notes: Bug Fix
Release Notes Summary: This fixes PCP broker to close superseded connections with `disconnect` in place of `close!`, which frees resources (namely file descriptors) much more quickly, avoiding using system resources unnecessarily when there's a flapping connection or multiple pxp-agents competing for connections.
QA Risk Assessment: Needs Assessment



If a condition exists where pxp-agent connections are flapping (see PCP-833) this can exhaust resources on the broker


SLES11 agents are susceptible to PCP-833, remove the pxp-agent pidfile and start a second pxp-agent process:


rm /var/run/puppetlabs/pxp-agent.pid
service pxp-agent start
## confirm two pids:
pgrep pxp-agent


On the broker, note the number of connection in TIME_WAIT:


netstat -tunap | grep 8142 | grep TIME_WAIT 
tcp6 0 0 TIME_WAIT - 
tcp6 0 0 TIME_WAIT - 
tcp6 0 0 TIME_WAIT - 
tcp6 0 0 TIME_WAIT -

Also note the number of file descriptors for the broker service increasing:

watch 'ls /proc/$(systemctl show -p MainPID pe-orchestration-services|cut -d= -f2)/fd | wc -l'

This will continue until the service either runs out of file descriptors or the heap for the service is exhausted.

If the broker runs on a compile master, this interrupts the puppetserver service as well.

Note: As part of this work, update Kim Oehmichen when this ticket is in merging and once it's merged.

Comment by Michael Smith [ 2019/01/14 ]

We should investigate if there's anything we can do to ensure resources are freed when a connection is superseded. We could also add an exponential backoff to pxp-agent when the server closes the connection to limit flapping.

Comment by Michael Smith [ 2019/01/14 ]

http://www.eclipse.org/jetty/javadoc/9.4.5.v20170502/org/eclipse/jetty/websocket/api/Session.html#close-org.eclipse.jetty.websocket.api.CloseStatus- is what we use now to close a Superseded connection. That can take awhile to resolve, especially under load. We may want to consider following up with a disconnect, that may help release resources quicker.

Comment by Michael Smith [ 2019/01/23 ]

For testing, I’d suggest setting up a master and agent hosts in vmpooler with https://github.com/puppetlabs/pe_acceptance_tests/tree/2018.1.x/setup/scale. Once setup, the agent hosts can be scaled to run thousands of pxp-agent instances, and we can modify LimitNOFILE in /etc/systemd/system/multi-user.target.wants/pe-puppetserver.service to reduce the limit.

Comment by Justin Stoller [ 2020/01/10 ]

In the description you say to note the number of connections in TIME_WAIT. Is there a normal absolute number, or number relative to the number of non-TIME_WAIT connections that would be open to the broker, or is any TIME_WAIT connections a symptom of this?

You also mention Superseded connections. Is that a one to one correlation to log lines such as:

INFO  [qtp1779066441-88928] [p.p.b.core] pcp://<hostname>/agent disconnected 4000 Superseded.

Would there be a relation between TIME_WAIT connections and Superseded log lines?

Comment by Michael Smith [ 2020/01/10 ]

Most of the time there shouldn't be any, TIME_WAIT is usually related to waiting on a response to connect/disconnect (which should be fast).

"Superseded" is the message we'd get in pcp-broker logs. There shouldn't really be a correlation, lots of "Superseded" messages imply there are multiple pxp-agents using the same certificate (which shouldn't happen most of the time).

Generated at Sat Sep 26 15:37:03 PDT 2020 using Jira 8.5.2#805002-sha1:a66f9354b9e12ac788984e5d84669c903a370049.