Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
HA
-
HA 2021-11-03
-
Needs Assessment
-
46288
-
1
-
Bug Fix
-
Reports were gc'ed by report-ttl - 1 day, now they are not.
-
Needs Assessment
Description
There is a bug in our reports table partition garbage collection which makes any `reports-ttl` less than 24 days and 5 minutes (assuming garbage collection is around 00:05) cause the previous day's table partitioning to get garbage collected.
This is due to how the reports-ttl expiration DateTime and the reports table partition DateTime are constructed and compared. The table partitions' DateTime is always at 0:00 of that day which means the partition always contains reports that are after that time but on that same day. See puppetlabs.puppetdb.scf.storage/prune-daily-partitions.
Solution A:
We could change the table partitions' DateTime to be a day later to more accurately reflect an "expiration" date for the partition.
Solution B:
An alternative fix would be to "floor" the expiration DateTime that gets derived from the `reports-ttl` value. This way when yesterday's partition is considered for garbage collection with a `reports-ttl` of "1d", both the DateTime's should be identical. This avoids garbage collection because the partition only gets garbage collected if the partion DateTime is before the `reports-ttl` expiration DateTime. See puppetlabs.puppetdb.cli.services/sweep-reports!
Zendesk: https://puppetlabs.zendesk.com/agent/tickets/46288
Inital Zendesk Support Message:
Hi,
|
|
I believe we've identified a bug in the handling of the puppetdb report_ttl setting.
|
|
We have been using a report_ttl of `1d` for quite some time, and noticed, especially after upgrading to 2019 (but this possibly existed before), that virtually all reports would be deleted during the first sweep after midnight. Often we found the Puppet Console reporting tens of thousands of nodes with no reports.
|
|
What seems to be happening is that, if the GC runs at 00:10, only reports that came in between 00:00 and 00:10 will be retained. This means any agents that have NOT run Puppet in the past 10 minutes will have 'no reports' and will show on the status page as not having checked in.
|
|
Nodes that HAVE checked in between 00:00 and 00:10 will show as having checked in, but only have the 1 report for the day (for example).
|
|
I'm guessing that the code is rounding `1d` unexpectedly; it seems to be rounding down to the most recent calendar day, which is just "today".
|
|
If I change `1d` to `24h` it converts back to `1d` for the purposes of the sweep.
|
|
I am now experimenting with `25h` to see if reports are retained for 1 full day, and will report back, but I wanted to get the ball rolling on this ticket.
|
|
This is uniquely visible in our environment as we retain reports for 1d, and have a standard runInterval of 1h. This means if GC runs at 00:05 and we check the console immediately thereafter, almost no agents have checked in. If we check the console at 00:30, statistically speaking it is likely that 50% of our nodes have 'no reports'.
|
|
https://github.com/puppetlabs/puppetdb/blob/ad13f09bed2f9462ec97b3dc738a055bb8716c4e/src/puppetlabs/puppetdb/cli/services.clj#L206-L242
|
|