Performance of Group-Object -AsHash

This topic contains 7 replies, has 4 voices, and was last updated by Profile photo of Garrett Mohammadioun Garrett Mohammadioun 9 months, 4 weeks ago.

  • Author
    Posts
  • #35063
    Profile photo of gmohammadioun
    gmohammadioun
    Participant

    Hi PowerShell Community,

    I've got a script that frequently creates hashtables from collections. I used to do this "by hand" until I realized that Group-Object already provides this functionality through its -AsHash parameter. I've replaced some of the "by hand" code with Group-Object calls and I've realized that it's no longer as fast.

    My question is, am I using the cmdlet wrong and causing this performance hit? If I'm not, I'm also wondering how the call to Group-Object (a cmdlet presumably written in C#) could be slower than the "by hand" PowerShell code.

    I've already done a bit of investigating myself by writing a script that creates hashtables using both methods ("by hand" and Group-Object) and timing each. I've found that Group-Object is only slightly slower when the number of keys in the hash table is ~5000 or lower. However, once you get to something like 10,000 keys the difference in performance is staggering and Group-Object takes much longer.

    The lowdown on the Gist script:
    Just dot source it and run Compare-HashCreation with the required params

    EXAMPLE:
    Compare-HashCreation -NumValues 50000 -NumKeys 1000

    This will create a list of 50,000 tuples of the form (Num, "foobar") where N is a random number from 0-999. Then it will create two hashtables via both methods and using the the Num property of the tuple for the hashtable keys.

    Gist:

    Thanks, Garrett

  • #35066
    Profile photo of Don Jones
    Don Jones
    Keymaster

    I apologize; your account is flagged in the global WordPress system as a spam originator, and so your many posts on this topic have all been held. I've released this one.

  • #35071
    Profile photo of Don Jones
    Don Jones
    Keymaster

    At a guess, I'd attribute this to the way .NET itself handles hash tables and arrays generally, meaning when you add an element to one, it more or less has to re-create the entire array. As the array grows progressively larger, that process obviously takes longer and longer.

  • #35072

    Thank you so much! Sorry for spamming but I spent like 2 hrs crafting this post so I didn't want it to go un-posted.

  • #35074

    Thanks for thoughts Don. I thought that at first but I found that the hashtable values returned by Group-Object are actually of Collection type:

    [16:53:49] PS> (dir | Group-Object Mode -AsHashTable -AsString).Values | % { $_.GetType() }

    IsPublic IsSerial Name BaseType
    -------- -------- ---- --------
    True True Collection`1 System.Object
    True True Collection`1 System.Object

    [16:54:01] PS>

    I not 100% positive about this but I don't think the Collection objects should run into any expensive array copying problems when they grow.

  • #35076
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Interesting... I've never looked at the Group-Object cmdlet's code before, and it behaves a bit oddly in this method (decompiled with ILSpy):

    The first bit of code (based on the result of TryGetValue) is what you'd expect of code that adds to a dictionary. What's interesting is the "for" loop in the else block, which iterates over all of the groups instead of using a dictionary-based lookup. I'm not sure why that code needs to be there, but that is definitely the sort of thing that could make it take a long time to execute if you're dealing with a large data set.

    // Microsoft.PowerShell.Commands.GroupObjectCommand
    internal static void DoGrouping(OrderByPropertyEntry currentObjectEntry, bool noElement, List groups, Dictionary groupInfoDictionary, OrderByPropertyComparer orderByPropertyComparer)
    {
    	if (currentObjectEntry != null && currentObjectEntry.orderValues != null && currentObjectEntry.orderValues.Count > 0)
    	{
    		object key = PSTuple.ArrayToTuple(currentObjectEntry.orderValues.ToArray());
    		GroupInfo groupInfo = null;
    		if (groupInfoDictionary.TryGetValue(key, out groupInfo))
    		{
    			if (groupInfo != null)
    			{
    				groupInfo.Add(currentObjectEntry.inputObject);
    				return;
    			}
    		}
    		else
    		{
    			bool flag = false;
    			for (int i = 0; i < groups.Count; i++)
    			{
    				if (orderByPropertyComparer.Compare(groups[i].GroupValue, currentObjectEntry) == 0)
    				{
    					groups[i].Add(currentObjectEntry.inputObject);
    					flag = true;
    					break;
    				}
    			}
    			if (!flag)
    			{
    				GroupObjectCommand.tracer.WriteLine(string.Format(CultureInfo.InvariantCulture, "Create a new group: {0}", new object[]
    				{
    					currentObjectEntry.orderValues
    				}), new object[0]);
    				GroupInfo groupInfo2 = noElement ? new GroupInfoNoElement(currentObjectEntry) : new GroupInfo(currentObjectEntry);
    				groups.Add(groupInfo2);
    				groupInfoDictionary.Add(key, groupInfo2);
    			}
    		}
    	}
    }
    

  • #35078
    Profile photo of gmohammadioun
    gmohammadioun
    Participant

    Man, I tried to edit bc I forgot the Gist link and it got marked as spam again 🙁

  • #35092

    Hey Dave, thanks for your reply. I kinda see what you're talking about but I'll need a bit more time to digest exactly what's going on in the code you pasted.

    Somewhat unrelated, I've never heard of ILSpy but I just downloaded it bc it seems pretty useful. However, I'm not sure how you navigated to the Group-Object code. Could explain how you did that?

You must be logged in to reply to this topic.