[PUP-9561] puppet language functions are 11,350x slower than ruby functions Created: 2019/03/14  Updated: 2019/05/09  Resolved: 2019/05/09

Status: Closed
Project: Puppet
Component/s: Language
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Normal
Reporter: Sean Millichamp Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to MODULES-8760 The merge() function should be able t... Ready for CI
relates to PUP-9568 Migrate merge() from stdlib and make ... Closed
Template:
Sub-team: Language
Team: Server
QA Risk Assessment: Needs Assessment

 Description   

From Slack: https://puppetcommunity.slack.com/archives/CFD8Z9A4T/p1552586549522900

I wrote a Puppet language function to, given an array of hashes, return a hash that counted duplicate instances of one of the keys in the original hash and then returned all of the values where the count was > 1.

this was the code:

# Find count of records with duplicates:
$id_count = $system_info.reduce({}) |$results, $rec| {
    $id = $rec['system_id']
    $current = $results[$id].lest || { 0 }
    $results + {$id => $current + 1}
}
 
# Find IDs with count > 1:
$duplicate_ids = $id_count.filter |$id, $count| { $count > 1 }.keys

on my set of records, which was around 1800 or so I think, it took more than two minutes to run, per the profile data

I converted it to Ruby and it was, of course, super fast

the ruby I ended up with was:

def find_dupes_system_info(system_info)
    res = Hash.new(0)
    system_info.each { |sys| res[sys['system_id']] += 1 }
    res.reject { |_id, count| count == 1 }.keys
end

155.4891 seconds vs 0.0137 seconds



 Comments   
Comment by Ben Ford [ 2019/03/14 ]

I've had similar conversations with two other people in recent weeks.

Comment by Ben Ford [ 2019/03/14 ]

After Puppet 6.3, this could be rewritten as something like (eyeball compiled only)

$system_info.group_by |$rec| { $rec['system_id'] }
            .filter |$k, $v| { $v.size > 1}
	    .keys

Comment by Henrik Lindberg [ 2019/03/15 ]

The original has exponential running time for the length of the input data, and the rewritten function is linear. For smaller number of entries (I tested with 200) the difference between the two algorithms is that Ben's version is 7x faster. For 500 entries it is 48x faster, and so on.

Comment by Henrik Lindberg [ 2019/03/15 ]

If being on a version where group_by does not exist the function can be written like this:

$keys = $system_info.map |$x| { $x['system_id'] }.sort
$duplicate_ids = $keys.map |$i, $v| { if $v == $keys[$i+1] { $v } }.unique.filter |$x| { $x != undef}

This algorithm sorts they key, and for any duplicate key there will be at least one entry in the mapped result with this key (the rest are undef and filtered out).

This variant is slower than the group_by version by 2x but it is at least linear.

A small snag in this version is that it is does not correctly deal with case independence. See functions sort() and compare() - the sort is case dependent by default, and the equality operator is case independent.

To make the algorithm work with case sensitivity:

$keys = $system_info.map |$x| { $x['system_id'] }.sort
$duplicate_ids = $keys.map |$i, $v| { if compare($v, $keys[$i+1], false) { $v } }.unique.filter |$x| { $x != undef}

And vice versa - case insensitive version

$keys = $system_info.map |$x| { $x['system_id'] }.sort |$a, $b| { compare($a, $b, true)
$duplicate_ids = $keys.map |$i, $v| { if compare($v, $keys[$i+1], true) { $v } }.unique.filter |$x| { $x != undef}

An alternative to doing this is to downcase all keys first to reduce the number of downcasings required. (Left as exercise to optimize).

Comment by Sean Millichamp [ 2019/03/18 ]

Thanks for the suggestions. Given that I can't use group_by yet (on 5.5) I'll probably stick with the Ruby I'm using, but Henrik's suggestions were clever and interesting to read.

Would you mind clarifying what part of the original is responsible for the exponential running time? Is it the last line in the reduce that merges the results? Is that because each time through the loop requires allocating a new hash and copying all of the elements in the previous hash, before doing the merge? I believe that is a fairly common pattern I use in reduce, often to manipulate a data structure I can't otherwise modify (such as in an each construct, for example) because of immutable variables.

Comment by Henrik Lindberg [ 2019/03/18 ]

Sean Millichamp You are correct - copying each time in the loop, and each time the hash is a little bigger is what kills performance.

It is a common use case, and for small sized hashes it is not too bad. For bigger hashes, one of the techniques I showed work much faster.

 

Comment by Henrik Lindberg [ 2019/03/19 ]

I added PUP-9568 as an idea for how we can add a hash builder function (by vamping the existing `merge` function) in stdlib. Maybe do the work there so it can be used with older puppet version.

Comment by Henrik Lindberg [ 2019/03/20 ]

And PUP-9568 was changed to become MODULES-8760 (for which there is now a PR to stdlib).

Comment by Henrik Lindberg [ 2019/05/09 ]

Sean Millichamp Since the feature for merge is now added to stdlib, maybe you can use that.

And on that note, I am closing this ticket - it shows enough alternatives how to avoid exponential running time for hash construction.

Generated at Mon Dec 09 13:58:33 PST 2019 using JIRA 7.7.1#77002-sha1:e75ca93d5574d9409c0630b81c894d9065296414.