Efficiency: Joining multiple PSObjects

This topic contains 4 replies, has 3 voices, and was last updated by  Dave Wyatt 4 years, 1 month ago.

  • Author
    Posts
  • #11214

    Rob Simmers
    Participant

    Currently, I pull data from AD (approx. 25000 records) , DFS Namespace 1 (approx. 13000 records), DFS Namespace 2. I have a function that loops thru each AD user, does a lookup in the DFS Namespace 1 and 2 and builds a result PSObject. As requirements have continually changed, I've added more for loops and if blocks for each user to be processed, so this script that was running in 45 minutes is now taking like 5 hours.

    I'm looking for ways to make the script more efficient. For instance, if I build each data object with a common attribute (e.g. SamAccountName), has anyone attempted to do a Join on multiple datasets with an attribute. I've seen blogs like http://blogs.msdn.com/b/powershell/archive/2012/07/13/join-object.aspx, but they still take a while to Join. Is it possible to convert them to DataTables and do SQL type join to get a final result set (like LINQ). Just curious if any of the experts have worked something like this and what was found to be the most efficient way to build a final result with all of the data joined together.

    I can provide code, but I'm just curious if someone has had a similar scenario and if they've measured 15 different options and came to a consensus on the best approach. Thanks

  • #11217

    Dave Wyatt
    Moderator

    When you're working with a small enough data set that everything can be stored in memory, I build hashtables for this purpose. Doing a lookup on the key of a hashtable is quite fast. For example:

    $adUsers = @{}
    
    Get-ADUser -Filter * |
    ForEach-Object {
        $adUsers[$_.SamAccountName] = $_
    }
    
    # Later:
    
    $adUser = $adUsers[$someSamAccountName]
    if ($null -ne $adUser)
    {
        # do stuff
    }
    

    If the data's too large for that, I'd probably just use PowerShell to stick all that data into a SQL database, which can handle joins and full-text indexing, rather than trying to reinvent that particular wheel myself.

  • #11219

    Rob Simmers
    Participant

    I was actually writing the code to try option 2 as I agree with not re-inventing the wheel, writing to SQL and doing a query with joins. Never used a hash table for lookups, so I may try to rewrite the code using a hash versus a | Where-Object search on a PSObject just to do some measuring. I didn't want to spend time rewriting code if there was something blatantly simple I was missing. Thanks for the prompt response.

  • #11228

    David Zemdegs
    Participant

    I dont know if this will help but I had an issue where I had very large csv files (tens of thousands of records) that I needed to "join". Key lookup methods were too time consuming so I wrote this:
    http://gallery.technet.microsoft.com/Join-CSV-A-simple-SQL-c2f92b60

    Cheers

    David Z

  • #11287

    Dave Wyatt
    Moderator

    [quote=11228]I dont know if this will help but I had an issue where I had very large csv files (tens of thousands of records) that I needed to “join”. Key lookup methods were too time consuming so I wrote this:
    http://gallery.technet.microsoft.com/Join-CSV-A-simple-SQL-c2f92b60
    Cheers
    David Z
    [/quote]

    Your script is very similar to the algorithm used by Compare-Object (sorting and enumerating two collections together.) It still holds everything in memory while that's being done, though. For really large datasets (that would exceed PowerShell's maximum memory limits), it's more of a pain.

You must be logged in to reply to this topic.