Search
Generic filters
Exact matches only
Filter by Custom Post Type

Executing LINQ Queries in PowerShell - Part 2

And we're back!

Ok, so in the last blog we began a conversation about delegates and using LINQ in PowerShell. In today's post, I'm going to give an example of how it can be incredibly useful. Let's talk about Joins.

Joins

In my line of work, I'm constantly running into the need to combine datasets from multiple sources that relate to each other and pull out some specific properties. Say you have two internal services, one which is used to track production status and another which is used to monitor whether machines are online. To demonstrate this, let's initialize some mock data once again.

#Create empty arrays
$DatasetA = @()
$DatasetB = @()
#Initialize "status" arrays to pull random values from
$ProductionStatusArray = @('In Production','Retired')
$PowerStatusArray = @('Online','Offline')
#Loop 1000 times to populate our separate datasets
1..1000 | Foreach-Object {
    #Create one object with the current iteration attached to the name property
    #and a random power status
    $PropA = @{
        Name = "Server$_"
        PowerStatus = $PowerStatusArray[(Get-Random -Minimum 0 -Maximum 2)]
    }
    $DatasetA += New-Object -Type PSObject -Property $PropA
    #Create a second object with the same name and a random production status
    $PropB = @{
        Name = "Server$_"
        ProductionStatus = $ProductionStatusArray[(Get-Random -Minimum 0 -Maximum 2)]
    }
    $DatasetB += New-Object -Type PSObject -Property $PropB
}

Now we have two datasets with the same server names, one showing production status and the other showing power status. Our goal is to join that data together. In traditional PowerShell, we would likely iterate through one of the sets while doing a filter on the second set and then either add property members to the first set or create all new objects with a combination of properties from both sets. Something like this:

$JoinedData = @()
foreach($ServerA in $DatasetA) {
    $ServerB = $DatasetB | Where-Object Name -eq $ServerA.Name
    $Props = @{
        Name = $ServerA.Name
        PowerStatus = $ServerA.PowerStatus
        ProductionStatus = $ServerB.ProductionStatus
    }
    $JoinedData += New-Object -Type PSObject -Property $Props
}

This works fine. If I wrap it in a Measure-Command it takes right around 8.82 seconds to complete. Not awful, but at enterprise level where you're dealing with ten times that amount of data, you can see how that run time could get out of control. Now let's do the same with LINQ:

$LinqJoinedData = [System.Linq.Enumerable]::Join(
    $DatasetA, 
    $DatasetB, 
    [System.Func[Object,string]] {param ($x);$x.Name},
    [System.Func[Object,string]]{param ($y);$y.Name},
    [System.Func[Object,Object,Object]]{
        param ($x,$y); 
        New-Object -TypeName PSObject -Property @{
        Name = $x.Name; 
        PowerStatus = $x.PowerStatus; 
        ProductionStatus = $y.ProductionStatus}
    }
)
$OutputArray = [System.Linq.Enumerable]::ToArray($LinqJoinedData)

This completed for me in just over 0.4 seconds! Hopefully after last week this syntax doesn't look too daunting, but let's walk through what we just did. We're calling the Join method on System.Linq.Enumerable and then passing it five parameters.

  1. The first dataset we're going to join
  2. The second dataset to join
  3. The delegate which defines the key to compare against on the first dataset
  4. The delegate which defines the key to compare against on the second dataset
  5. Finally, we pass in the delegate which defines what the output should look like

So it looks complicated, but once you use it a few times, it's really not too bad. Now you're probably wondering why I added that final line where I called "[System.Linq.Enumerable]::ToArray($LinqJoinedData)." For that we need to talk about "Deferred Execution vs. Immediate Execution." When you call the Join method, it's not actually joining the data at that time, rather it's building an expression tree which defines the relational algebra needed to perform the join. This defers the execution point to when the data is actually operated against. So in the above example, I called "ToArray()" merely to provide an accurate timespan for how long the join actually takes as opposed to the more traditional PowerShell approach we used before it. If this were production code and I wanted to see  machines with an offline status that are listed as in production, rather than that "ToArray()" line I could simply run this:

$LinqJoinedData.Where({($_.PowerStatus -eq "Offline") -and ($_.ProductionStatus -eq "In Production")})

The Join query would execute at that time and then "Where()" would filter down to just the objects I requested.

And there you have it! If you found this interesting, I encourage you to check out these modules:

Feel free to reach out to me on Twitter or check out my personal site from time to time for other content. If you've seen my recent talk at PowerShell Summit, I'll be posting the blog I referenced there soon about turning my dog into a tea kettle.  (it's not PowerShell related, thus it will be landing somewhere other than here)

Happy tinkering!

-Eli

Executing LINQ Queries in PowerShell - Part 1

Greetings PowerShellers!

Lately, I've been itching to write something up on Microsoft’s Language-Integrated Query (LINQ). You've likely encountered it if you've done any development in C#. LINQ is an incredibly powerful querying tool for performing look-ups, joins, ordering, and other common tasks on large data sets. We have a few similar cmdlets built into PowerShell, but other than the '.Where()' method on collection objects nothing that comes close to the speed at which LINQ operates.

To dig into this topic, we're going to have to do a quick high level overview of a couple of other .NET staples often encountered in the C# world. You see, unlike most .NET methods which accept object types like integers, strings, and the like, LINQ uses static extension methods which only accept delegate object types.

What are delegates? In application development, there is an occasional need for objects within memory to communicate with each other for things such as "button click events." To address this, the Windows API uses function pointers to create callback functions which then report back to other functions in your applications. Within the .Net Framework, these are called delegates.

Delegates are objects that point to another method, or possibly many methods, by storing three key pieces of information: the address of the method on which it makes calls, the parameters (if any) of this method, and the return type (if any) of this method. With this information, a delegate object is able to invoke these methods dynamically at runtime, either synchronously or asynchronously. With this information, a delegate object is able to invoke these methods dynamically at runtime, either synchronously or asynchronously.

A simple example of this in C# looks like this:

using System;
namespace SimpleDelegate
{
    //Delegate declaration
    public delegate void PrintMessage(string msg);
    // Create a class with the method to bind to the delegate
    public class MessagePrinter
    {
        public static void PrintLine(string msg)
        { 
            Console.WriteLine(msg); 
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PrintMessage delegate object that
            // "points to" MessagePrinter.PrintLine().
            PrintMessage p = new PrintMessage(MessagePrinter.PrintLine);
            p("Hi Animatronio!");
            Console.ReadLine();
        }
    }
}

Clearly, in this example the use of delegates is not necessary. I'm just trying to frame up how they would be declared and subsequently called. To simplify all of the above, Microsoft has created two generic delegate definitions. For delegates with no output, we can use Action<> and for delegates with output, we can use Func<>. These two beauties are what give us PowerShellers access to LINQ. Today we're going to use Func<> because we want output. The syntax for doing so looks like this:

[Func[int,int]]$Delegate = { param($i); return $i + 1 }

Let's break this down left to right:

  1. Declare Func<>
  2. Tell it the type of parameter(s) to expect. In this case we're passing a single integer parameter.
  3. Tell it the type of output to produce, again an integer will be returned.
  4. Name the delegate variable.
  5. Define the delegate with a scriptblock. We're just doing a very simple addition step on the parameter and returning the output.

And now we've finally arrived at the meat of this article. Let's initialize a mock dataset with ~2 million objects to play with:

$Dataset = @()
0..1000 | Foreach-Object { $Dataset += (Get-Verb)[(Get-Random -Maximum 98)] }
0..10 | ForEach-Object {$Dataset += $Dataset}

Next we'll measure how long it takes to filter down to only the objects which equal "Get" using Where-Object on three different Windows Server OS's running on the same Azure compute instances:

Measure-Command { ($Dataset | Where-Object Verb -eq "Use") }
# 2008 R2: TotalSeconds : 23.3399981
# 2012 R2: TotalSeconds : 61.7634027
# 2016   : TotalSeconds : 18.0190367

Now let's do the same query using LINQ:

[Func[object,bool]] $Delegate = { param($v); return $v.verb -eq "Use" }
Measure-Command { [Linq.Enumerable]::Where($Dataset,$Delegate) }
# 2008 R2: TotalSeconds : 11.3967464
# 2012 R2: TotalSeconds : 25.6511816
# 2016   : TotalSeconds : 12.8999417

As you can see, in the older operating systems, LINQ is over twice as fast (also, what's the deal with 2012 R2??). In 2016, it's only about 50% faster. But of course, calling '.Where()' directly on the object is still by far the fastest way to filter on a dataset:

Measure-Command { $Dataset.Where( {$_.Verb -eq "Use"}) }
# 2008 R2: TotalSeconds : 5.5102392
# 2012 R2: TotalSeconds : 17.5893828
# 2016   : TotalSeconds : 6.1834444

Initially I had suspected it was translating the scriptblock as an anonymous function and tapping into the LINQ extension method behind the scenes, but Bruce Payette set me straight. According to Bruce, It's using a very low level API to invoke the scriptblock. Source code is here.

So if '.Where()' is so much faster, why did I bother writing this? I wanted to open with a familiar concept. The true power in LINQ comes from its SQL-like ability to aggregate and manipulate data. In the next blog, we'll take a look at grouping data, using joins, and why that's awesome.

Until then, happy tinkering!

-Eli

Skip to toolbar