Executing LINQ Queries in PowerShell – Part 2

And we're back!
Ok, so in the last blog we began a conversation about delegates and using LINQ in PowerShell. In today's post, I'm going to give an example of how it can be incredibly useful. Let's talk about Joins.


In my line of work, I'm constantly running into the need to combine datasets from multiple sources that relate to each other and pull out some specific properties. Say you have two internal services, one which is used to track production status and another which is used to monitor whether machines are online. To demonstrate this, let's initialize some mock data once again.

#Create empty arrays
$DatasetA = @()
$DatasetB = @()
#Initialize "status" arrays to pull random values from
$ProductionStatusArray = @('In Production','Retired')
$PowerStatusArray = @('Online','Offline')
#Loop 1000 times to populate our separate datasets
1..1000 | Foreach-Object {
    #Create one object with the current iteration attached to the name property
    #and a random power status
    $PropA = @{
        Name = "Server$_"
        PowerStatus = $PowerStatusArray[(Get-Random -Minimum 0 -Maximum 2)]
    $DatasetA += New-Object -Type PSObject -Property $PropA
    #Create a second object with the same name and a random production status
    $PropB = @{
        Name = "Server$_"
        ProductionStatus = $ProductionStatusArray[(Get-Random -Minimum 0 -Maximum 2)]
    $DatasetB += New-Object -Type PSObject -Property $PropB

Now we have two datasets with the same server names, one showing production status and the other showing power status. Our goal is to join that data together. In traditional PowerShell, we would likely iterate through one of the sets while doing a filter on the second set and then either add property members to the first set or create all new objects with a combination of properties from both sets. Something like this:

$JoinedData = @()
foreach($ServerA in $DatasetA) {
    $ServerB = $DatasetB | Where-Object Name -eq $ServerA.Name
    $Props = @{
        Name = $ServerA.Name
        PowerStatus = $ServerA.PowerStatus
        ProductionStatus = $ServerB.ProductionStatus
    $JoinedData += New-Object -Type PSObject -Property $Props

This works fine. If I wrap it in a Measure-Command it takes right around 8.82 seconds to complete. Not awful, but at enterprise level where you're dealing with ten times that amount of data, you can see how that run time could get out of control. Now let's do the same with LINQ:

$LinqJoinedData = [System.Linq.Enumerable]::Join(
    [System.Func[Object,string]] {param ($x);$x.Name},
    [System.Func[Object,string]]{param ($y);$y.Name},
        param ($x,$y);
        New-Object -TypeName PSObject -Property @{
        Name = $x.Name;
        PowerStatus = $x.PowerStatus;
        ProductionStatus = $y.ProductionStatus}
$OutputArray = [System.Linq.Enumerable]::ToArray($LinqJoinedData)

This completed for me in just over 0.4 seconds! Hopefully after last week this syntax doesn't look too daunting, but let's walk through what we just did. We're calling the Join method on System.Linq.Enumerable and then passing it five parameters.

  1. The first dataset we're going to join
  2. The second dataset to join
  3. The delegate which defines the key to compare against on the first dataset
  4. The delegate which defines the key to compare against on the second dataset
  5. Finally, we pass in the delegate which defines what the output should look like

So it looks complicated, but once you use it a few times, it's really not too bad. Now you're probably wondering why I added that final line where I called "[System.Linq.Enumerable]::ToArray($LinqJoinedData)." For that we need to talk about "Deferred Execution vs. Immediate Execution." When you call the Join method, it's not actually joining the data at that time, rather it's building an expression tree which defines the relational algebra needed to perform the join. This defers the execution point to when the data is actually operated against. So in the above example, I called "ToArray()" merely to provide an accurate timespan for how long the join actually takes as opposed to the more traditional PowerShell approach we used before it. If this were production code and I wanted to see  machines with an offline status that are listed as in production, rather than that "ToArray()" line I could simply run this:

$LinqJoinedData.Where({($_.PowerStatus -eq "Offline") -and ($_.ProductionStatus -eq "In Production")})

The Join query would execute at that time and then "Where()" would filter down to just the objects I requested.
And there you have it! If you found this interesting, I encourage you to check out these modules:

Feel free to reach out to me on Twitter or check out my personal site from time to time for other content. If you've seen my recent talk at PowerShell Summit, I'll be posting the blog I referenced there soon about turning my dog into a tea kettle.  (it's not PowerShell related, thus it will be landing somewhere other than here)
Happy tinkering!

7 Responses to " Executing LINQ Queries in PowerShell – Part 2 "

  1. js says:

    An alternative way to join is to make a hash of one of the objects. This took me 61 ms.
    foreach ($ServerB in $DatasetB){
    $hashB[$ServerB.name] = $ServerB
    $JoinedData = $DatasetA | foreach {
    $ServerA = $_
    $MyJoin = $hashB[$ServerA.name]
    Name = $ServerA.Name
    PowerStatus = $ServerA.PowerStatus
    ProductionStatus = $MyJoin.ProductionStatus

    • Eli Hess says:

      Absolutely! If you’re the one in control of how the object is created, then an indexed hash is totally the way to go. My example was meant to be more in line with the scenario where you’re retrieving objects from a REST interface in which you don’t have that kind of control. But great point all the same.

  2. js says:

    These blog posts were interesting, thanks.

  3. Ian Blew says:

    How do you do other types of joins? What if the datasets don’t match?
    Example: Trying to track changes to the file system to see what files were added, changed, or deleted by an install. You may end up with entries that are not in one or the other list, how can we include those in the final join?

    • Eli Hess says:

      So Outer Joins aren’t quite so simple in PowerShell. If you look at the Join-Object module I linked above, you can see how ili went about it using the GroupJoin method. I’ll try to setup a gist that shows the syntax a bit more succinctly when I have a moment.

    • Eli Hess says:

      Morning Ian! Sorry for the delay on this answer, but I finally had a few minutes to boil down the bare minimum needed to perform outer joins: