[PDB-1721] Possible Memory leak with PuppetDB? Created: 2015/07/02  Updated: 2016/01/11  Resolved: 2015/08/04

Status: Closed
Project: PuppetDB
Component/s: None
Affects Version/s: PDB 2.3.4
Fix Version/s: PDB 3.0.2

Type: Bug Priority: Normal
Reporter: Simon Oxwell Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

CentOS7, Postgres 9.2.7

Attachments: PNG File puppetdb-metrics.png     PNG File puppetdb-metrics2.png    
Issue Links:
Story Points: 5
Sprint: PuppetDB 2015-07-29, PuppetDB 2015-08-12


We're seeing PuppetDB regularly run out of memory every 3-4 weeks still.

I did report something similar back in PDB-1484, but that was when we had troubles with 'replace facts' failing (PDB-1448).

We're getting this in our logs, as a first sign:

2015-07-02 08:05:05,553 ERROR [c.p.p.command] [08271ea0-ad7c-42c7-a0e5-3810fa668118] [replace catalog] Fatal error on attemp
t 1
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at clojure.lang.PersistentArrayMap.create(PersistentArrayMap.java:57) ~[puppetdb.jar:na]
	at clojure.lang.PersistentArrayMap.assoc(PersistentArrayMap.java:208) ~[puppetdb.jar:na]
	at clojure.lang.APersistentMap.cons(APersistentMap.java:36) ~[puppetdb.jar:na]
	at clojure.lang.RT.conj(RT.java:562) ~[puppetdb.jar:na]
	at clojure.core$conj.invoke(core.clj:83) ~[puppetdb.jar:na]
	at schema.utils$result_builder$conjer__1981.invoke(utils.clj:139) ~[na:na]
	at schema.core$map_walker$fn__2760$fn__2763$fn__2764.invoke(core.clj:731) ~[na:na]
	at clojure.core.protocols$fn__6086.invoke(protocols.clj:143) ~[puppetdb.jar:na]
	at clojure.core.protocols$fn__6057$G__6052__6066.invoke(protocols.clj:19) ~[puppetdb.jar:na]
	at clojure.core.protocols$seq_reduce.invoke(protocols.clj:31) ~[puppetdb.jar:na]
	at clojure.core.protocols$fn__6078.invoke(protocols.clj:54) ~[puppetdb.jar:na]
	at clojure.core.protocols$fn__6031$G__6026__6044.invoke(protocols.clj:13) ~[puppetdb.jar:na]
	at clojure.core$reduce.invoke(core.clj:6289) ~[puppetdb.jar:na]
	at schema.core$map_walker$fn__2760$fn__2763.invoke(core.clj:735) ~[na:na]
	at schema.core$map_walker$fn__2760.invoke(core.clj:726) ~[na:na]
	at schema.core.MapEntry$fn__2721.invoke(core.clj:675) ~[na:na]
	at schema.core$map_walker$fn__2760$fn__2763$fn__2764.invoke(core.clj:731) ~[na:na]
	at clojure.core.protocols$fn__6086.invoke(protocols.clj:143) ~[puppetdb.jar:na]
	at clojure.core.protocols$fn__6057$G__6052__6066.invoke(protocols.clj:19) ~[puppetdb.jar:na]
	at clojure.core.protocols$seq_reduce.invoke(protocols.clj:31) ~[puppetdb.jar:na]
	at clojure.core.protocols$fn__6078.invoke(protocols.clj:54) ~[puppetdb.jar:na]
	at clojure.core.protocols$fn__6031$G__6026__6044.invoke(protocols.clj:13) ~[puppetdb.jar:na]
	at clojure.core$reduce.invoke(core.clj:6289) ~[puppetdb.jar:na]
	at schema.core$map_walker$fn__2760$fn__2763.invoke(core.clj:735) ~[na:na]
	at schema.core$map_walker$fn__2760.invoke(core.clj:726) ~[na:na]
	at schema.core.Either$fn__2541.invoke(core.clj:469) ~[na:na]
	at schema.core.MapEntry$fn__2715.invoke(core.clj:665) ~[na:na]
	at schema.core$map_walker$fn__2760$fn__2763.invoke(core.clj:739) ~[na:na]
	at schema.core$map_walker$fn__2760.invoke(core.clj:726) ~[na:na]
	at clojure.core$comp$fn__4192.invoke(core.clj:2403) ~[puppetdb.jar:na]
	at schema.core$check.invoke(core.clj:155) ~[na:na]
	at schema.core$validate.invoke(core.clj:160) ~[na:na]

and Java dumps a hprof file. We've had a look at that and seen:

5,169 instances of "org.postgresql.jdbc4.Jdbc4PreparedStatement", loaded by "sun.misc.Launcher$AppClassLoader @ 0xd0288580" occupy 584,559,496 (76.68%) bytes. 
3,174 instances of "org.postgresql.core.v3.SimpleQuery", loaded by "sun.misc.Launcher$AppClassLoader @ 0xd0288580" occupy 82,194,760 (10.78%) bytes.

Our heap size is 750M, and we have 108 puppet agents, so our heap size should be plenty (the rule of thumb suggest 256MB heap). The dashboard says we have 88,142 resources, out agents check in every 15 minutes.

Last time, Kenneth suggested we tune statement-cache-size, which we have not done, as the default of 1000 seemed fine - but we seem to have five times that number.

Are we seeing a possible leak around the prepared statements?

Comment by Richard Chatterton [ 2015/07/13 ]

We're seeing a similar issue. After upgrading to 2.3.5-1, PuppetDB is crashing roughly every 12 to 24 hours with "java.lang.OutOfMemoryError: GC overhead limit exceeded". Analyzing the heap dumps from these crashes with Java VisualVM, we see that there are ~11,000 instances of org.postgresql.jdbc4.Jdbc4PreparedStatement. The "shallow size" of these objects is roughly 2 MB, however using a OQL query that should return the sum of the retained sizes of each of these objects, we see that together they are using about 3 GB of heap. This PuppetDB installation currently has -Xmx3072m.

Query: select sum(heap.objects('org.postgresql.jdbc4.Jdbc4PreparedStatement'), 'rsizeof(it)') 
Result: 3.390594731E9

In an attempt to solve this, we're trying setting the statements-cache-size configuration parameter to 500, which should be half of the default documented value of 1000. If this helps, we'll update this ticket with that information.

Some additional information: This PuppetDB environment contains 912 nodes, with ~142k resources in the population and has the maximum heap set to 3 GB. We are making use of a few relatively large structured facts, which necessitated our upgrade from 2.2.x to 2.3.5-1. We started with a fresh database in 2.3.5-1 due to schema issues in the upgrade. We can provide the hprof I referenced, if that would be helpful.

Also, version information since our installation differs from the original reporter's:

PuppetDB 2.3.5-1
CentOS 6.6
PostgreSQL 9.4.4 with pg_trgm

Comment by Richard Chatterton [ 2015/07/13 ]

I've also attached a screenshot showing the change in metrics over time of a typical case. The data source for all of these metrics is the /v3/metrics/mbean API endpoint of PuppetDB. Represented here, the blue line is the average statement execution time, the purple line is the StatementsCached metric, the green line is the committed heap, and the red line is the used heap. Note that the values have been scaled so that they can be clearly seen on the same graph.

We can see that, from the time when PuppetDB was started at ~1230, the committed heap roughly follows the increases in the StatementsCached metric. We run Puppet once an hour at the start of every hour across all nodes in the population, and most of the increases in StatementsCached occur at this time, at a rate of about 1k per hour. This begins to fall off as the committed heap begins to approach the maximum heap of 3 GB. The execution time becomes very long as performance begins to degrade, and garbage collections are not freeing up much space as time passes. Gaps in the metrics are likely due to PuppetDB's metric endpoints becoming intermittently unresponsive as the situation becomes worse, until it eventually crashes at ~0400.

Comment by Wyatt Alt [ 2015/07/13 ]

Simon Oxwell Richard Chatterton thanks for reporting this. It's pretty clear that there's an issue here but we've been wrapped up in release related stuff and haven't had a chance to delve just yet.

I tried to replicate this about a week ago and found that the number of instances of


seemed to grow unaffected by reductions in statements-cache-size and forced garbage collection, but when I left my simulation running overnight the number of instances seemed to top out around 2500 and the retained size never grew beyond 5% of my 750mb heap. Both of your retained sizes are massive in comparison.

Richard Chatterton thanks for your note about large structured facts. I remember from PDB-1448 that Simon Oxwell also uses large structured facts, and I'm sure that larger facts imply larger retained size. If that were the issue, it would explain why we're not seeing this from everyone. Any information you can give us on the size and structure of those facts (i.e array of maps, map of arrays, etc.) would be helpful. I'll repeat my experiment with some large facts in a little while, but in the meantime if you have that hprof dump handy it would definitely be nice to have.

Richard, given that you're seeing this frequently it would also be interesting to know if dramatically reducing that cache size has any effect for you. Maybe even something like 100.


Comment by Richard Chatterton [ 2015/07/13 ]

Wyatt, thanks for your follow up. We're collecting a 'packages' fact, which is an array of the full package strings for all packages on the system. On an EL system, it's essentially the output of rpm -qa as an array. For a typical example, this array is 716 elements long and ~30K on disk, serialized as JSON. Initially, this was a nested map with more information, but (particularly on 2.2.x), we found that it was not performant enough and switched to a simpler format. Each of our 912 nodes are reporting this fact on every run. At this time, I'm not certain if PuppetDB has a concept of ordered arrays in structured facts, or if rpm -qa returns the list of packages in the same order every time – if the fact value is an array with the same members in a different order, would that cause an update?

We're currently running our PuppetDB with statements-cache-size = 500, and from the metrics we can see at this time, it looks like it will likely die within the next few hours. In the event it does die, we'll try running it with statements-cache-size = 100, as you suggest.

The hprof I referenced is 3.5 GB in size, so I imagine you probably don't want me attaching it directly to this issue. Is there somewhere to which we can upload it? EDIT: Though, gzipped it is 795MB.

Comment by Richard Chatterton [ 2015/07/13 ]

On further review, we probably won't be able to supply the hprof for security reasons, unless there's a simple way to "anonymize" any data that might be in it (node names, facts, and so on).

We're going to try disabling the 'packages' fact I mentioned in my previous message and see if that alleviates the issue. I'll follow up after that test.

Comment by Wyatt Alt [ 2015/07/13 ]

Richard Chatterton I think leaving off the hprof should be fine. The more I look into this the more it seems like we've got it narrowed down enough already. When I run with a massive structured fact in the mix I can get the prepared statements to take a much larger share of heap. I haven't seen the OOM yet but I have no trouble believing that result.

To your question about updates, PDB currently isn't smart enough to avoid doing a big update when the ordering of an array-valued fact changes. I went into some detail on the mechanics here in PDB-1448, but the upshot is that big array-valued structured facts (and particularly arrays of maps) are our current worst-case scenario from the perspective of large, one-off prepared statements.

It's also worth noting that there is likely little benefit in caching those statements. If the nodes don't all have the same number of packages, I doubt the statements are being reused across nodes, and the statement generated by an addition or deletion of elements from a list of packages also seems like a relatively one-off thing, compared to, for instance, a statement that is called by the dashboard every five seconds, or something internal that we do on each command submission. I need to look into it more, but I'm curious how much benefit we actually get from caching. The statements that are good to cache are the ones that get used all the time, and my guess is that those also tend to be the small and inexpensive ones. It's certainly hard for me to believe that there's value in caching 1000 by default.

I've noticed that setting it to 0 or 1 (or low numbers generally) seems to cause me to hold more instances of


objects than when it's set to 500 or 1000. I haven't decided positively whether it's affecting my retained size yet, but in my current trial I have about 20k objects with a retained size < 1% of heap (which is very different from when I had it set to 500.) I don't know what the performance drawbacks of setting it that low would be, but it does point to a lower cache size being a possible workaround.

I still need to better understand what's going on, but we'll start working on this in earnest in the next few days and keep the ticket up to date.


Comment by Simon Oxwell [ 2015/07/13 ]

Wyatt Alt Thanks for the update. Not entirely sure how large our biggest structured facts are, but probably not quite as large Richard Chatterton package lists.

Richard Chatterton - thanks for additions to the ticket. Glad to know its not just my site.

Comment by Richard Chatterton [ 2015/07/14 ]

To follow up on the test removing the 'packages' fact, with it removed we're seeing a much more sane pattern of statement caching and heap utilization. PuppetDB has now been running for over twelve hours and doesn't look to be crashing any time soon. A graph similar to the one in my previous comment demonstrating this is attached.

Comment by Wyatt Alt [ 2015/08/03 ]

Simon Oxwell Richard Chatterton To address this we've made some changes that will reduce the size of cached prepared statements in the case of large structured facts, and also flipped the default statement cache size to 0 from 1000, after having concluded that the cache size doesn't have much of an operational impact. We're shooting to release these changes tomorrow – looking forward to hearing if the issue is resolved.


Comment by Simon Oxwell [ 2015/08/05 ]

Wyatt Alt I see that 3.0.2 is out and this issue is referenced in the release notes. Hurrah!

But, the upgrade path to PuppetDB 3.x is pretty involved. I don't suppose there's any chance of a backport to the 2.3.x branch?

Comment by Wyatt Alt [ 2015/08/05 ]

Simon Oxwell Yes, sorry about that. I'll see what we can do about a backport. In the meantime any chance you'd mind checking if setting statements-cache-size (https://docs.puppetlabs.com/puppetdb/2.3/configure.html#statements-cache-size) to 0 alleviates the issue? In testing we've seen that result in more prepared statement objects on the heap, but with a much smaller retained size (and more responsive to GC.) Obviously helpful to get confirmation in the wild though.


Comment by Simon Oxwell [ 2015/09/09 ]

Hi Wyatt Alt - sorry for the lack of updates. We set the cache size to 0 and restarted puppetdb on the 18th August. Just looking at the dashboard, it looks like JVM usage is peaking around 320MB, but GC-ing backing down to 160MB.

We've taken a memory dump to look at, but I think we're not going to find any lingering shared statements this time.

Comment by Simon Oxwell [ 2015/09/13 ]

Wyatt Alt - the memory dump had 381 instances of org.postgresql.jdbc4.Jdbc4PreparedStatement. Which is a definite improvement. Seems a little high if we're not caching them?

Generated at Sat Aug 08 22:03:22 PDT 2020 using Jira 8.5.2#805002-sha1:a66f9354b9e12ac788984e5d84669c903a370049.