How would you make this faster?

This topic contains 17 replies, has 4 voices, and was last updated by Profile photo of Adam Bertram Adam Bertram 2 years, 1 month ago.

  • Author
    Posts
  • #19963
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    I have a big list CSV file formatted as follows:

    Firstname,MiddleInitial,LastName

    I want to find each AD user object that matches the each field, if possible. For example, a best case scenario would be an AD user object match that matches Firstname,Lastname and MiddleInitial exactly. The next best scenario would be the first name and last name match exactly. Based on these rules I'm still seeing hundreds of user accounts that don't have an exact match but might have a character or two that's off.

    For example, in the CSV file the first name could be "Joe" but in AD his first name is "Joseph".

    I've added in a fuzzy search using the Levenshtein distance (http://www.codeproject.com/Tips/102192/Levenshtein-Distance-in-Windows-PowerShell). This function finds the number of edits that must happen on string1 before it looks like string2. It's the best way I've found to perform these kind of matches.

    Here's how I'm currently doing this:

    $AllAdUsers = Get-ADUser -Filter { [givenname -like '*'] -and [surname -like '*'] }
    $UserCount = $AllAdUsers.Count
    foreach ($Row in $Csv)
        for [$i=0; $i -lt $UserCount; $i++] {
    			$FnameDistance = Get-LevenshteinDistance -first $Row.FirstName -second $AllAdusers[$i].Givenname -ignoreCase
    			$LnameDistance = Get-LevenshteinDistance -first $Row.LastName -second $AllAdusers[$i].surname -ignoreCase
    			$TotalDistance = $FnameDistance + $LnameDistance
    			if [$i -eq 0] {
    				$LowestDistance = $TotalDistance
    			} elseif [$TotalDistance -lt $LowestDistance] {
    				$LowestDistance = $TotalDistance
    			}
    		}
                   $LowestDistance
    }

    I'm just getting the distance between my CSV field and the AD field, adding them up for each AD object and getting the one with the lowest number. It's terribly slow but I can't think of a better way to do it.

    Any input?

  • #19970
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    The first thing I would do is move most of this code into C# (at least, if not C++). PowerShell code takes quite a bit longer to execute than its compiled equivalent, and you've got a very slow algorithm here to begin with ( O(n^2) just in the code you've shown, and another O(n^2) inside the Get-LevenshteinDistance function on top of that.)

    Beyond that, the best performance improvements would come from either allowing faster comparisons (calculate levenshtein distance in a faster way), or cutting down on the number of iterations in your loops somehow. You could look into spell checker algorithms, as they have a very similar job to what you're doing here.

  • #19971
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    I'm no developer, Mr. C++. 🙂 A weeeee bit of C# is what I'm comfortable with. I hadn't looked at the function code yet probably because I just didn't want to look under the covers and get started down that rabbit hole.

    Spell checker algorithms...that's an idea. I'll look into that or, if I'm brave enough crack open the LevenshteinDistance function and see if there's a better way.

  • #19972
    Profile photo of Matt McNabb
    Matt McNabb
    Participant

    Have you timed how long each of the Levenshtein operations takes? There isn't a whole lot outside of that going on in your script that would take all that much time. If you have 10,000 users in AD and your CSV file has 100 users, then the way you are doing it will loop through all 10,000 AD users 100 times for a total of 1 million comparisons. Maybe you could try reversing the loop logic like below but I think the total number of comparisons will be the same:

    $AllAdUsers = Get-ADUser -Filter *
    
    foreach ($User in $AllAdusers)
    {
        for ($i=0; $i -lt $Csv.Count; $i++)
        {
                $FnameDistance = Get-LevenshteinDistance -First $User.Givenname -Second $Csv[$i].Firstname -ignoreCase
                $LnameDistance = Get-LevenshteinDistance -First $User.Surname -Second $Csv[$i].Lastname -ignoreCase
                $TotalDistance = $FnameDistance + $LnameDistance
                
                if ($i -eq 0)
                {
                    $LowestDistance = $TotalDistance
                }
                elseif ($TotalDistance -lt $LowestDistance)
                {
                    $LowestDistance = $TotalDistance
                }
            }
            
            $LowestDistance
    }

    Also, you are doing two Levs in each for loop iteration, maybe you could compare the surnames first, check if that is within a certain distance, and then only run the givenname comparison if it meets that criterion. This could cut down drastically on the total number of Lev comparisons.

  • #19973
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    I have not timed the Levenshtein functions. However, I might try that if I decide to try to improve upon that function.

    I am already comparing the given name and surname in another instance of this script. The ones that it can't find exact matches on get passed to this fuzzy search method. I do believe if I reversed them it'd still be the same.

    I'm able to get output after 20 minutes or so. It's not the end of the world but more of an academic exercise in Powershell. 🙂

  • #19976
    Profile photo of Simon Wåhlin
    Simon Wåhlin
    Participant

    How about concatenating the FirstName and LastName to FirstNameLastName? That would cut the amount on Levs in half wouldn't it?

    $Name = '{0}{1}' -f $Row.FirstName, $Row.LastName
    $ADname = '{0}{1}' -f $AllAdusers[$i].Givenname, $AllAdusers[$i].surname
    $CurrDistance = Get-LevenshteinDistance -first $Name -second $ADname -ignoreCase
    $LowestDistance = [Math]::Min($CurrDistance,$LowestDistance)
    
  • #19977
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    Hmm..that's a great idea! I don't see why that wouldn't work.

  • #19978
    Profile photo of Simon Wåhlin
    Simon Wåhlin
    Participant

    Actually I think it was a really bad idea...

    Since the lev-algorithm is a nested loop, the amount of iterations is $first.length * $second.length

    Two comparisons with a length of 5 each will make (5*5)+(5*5) which is 50 iterations.

    Concatenating the strings will get us one compare with 10*10 which is 100 iterations.

    If you instead could implement a limit, say if the cost is greater than X, stop comparing that would speed up the process.

  • #19981
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    Honestly I haven't even looked at the code of the lev function. I was doing something bad and assuming it was efficient. 🙂 I tried to run that and it did take FOREVER.

  • #19982
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    If you look up Levenshtein Distance on Wikipedia, you'll find the algorithm that was used in your original sample, but there's also a different one which only uses two rows of the array. In my tests, this seemed to be around 5x faster (though this may depend on string length and other factors.) Here's that other algorithm, converted into PowerShell:

    function Get-LevenshteinDistance
    {
        param([string] $first, [string] $second, [switch] $ignoreCase)
     
        $len1 = $first.length
        $len2 = $second.length
     
        if($len1 -eq 0)
        { return $len2 }
     
        if($len2 -eq 0)
        { return $len1 }
     
        if($ignoreCase -eq $true)
        {
            $first = $first.tolowerinvariant()
            $second = $second.tolowerinvariant()
        }
        
        if ($first -ceq $second) { return 0 }
        
        $v0 = 0..$len1
        $v1 = 0..$len1
    
        for ($i = 0; $i -lt $len2; $i++)
        {
            $v1[0] = $i + 1
    
            for ($j = 0; $j -lt $len1; $j++)
            {
                $cost = if ($first[$j] -ceq $second[$i]) { 0 } else { 1 }
                $v1[$j + 1] = [math]::Min($v1[$j] + 1, [math]::Min($v0[$j + 1] + 1, $v0[$j] + $cost))
            }
    
            $v0,$v1 = $v1,$v0
        }
        
        return $v1[-1]
    }
    
  • #19991
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    Thanks, Dave! Talk about service! 🙂 I'll run it and let you know what the time difference is.

  • #19996
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    I just ran it again and got a 60% time reduction. Thanks!

  • #19999
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Just for giggles, here it is in C#. See how the speed compares; I'm curious. 🙂

  • #20000
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Here are the results I got (total milliseconds) of one thousand iterations comparing 'Kittens' to 'Sitting' with the three implementations:

    325.8029
    144.292
    52.5131

  • #20002
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    The forum is doing something weird to the closing quotation mark of my here-string; it's supposed to be a single quote, not double.

    Bah, it also removed some code from inside the method. Hang on, I'll repost this as a gist.

  • #20007
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    Are you bored or something today? 🙂

  • #20017
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Not really, it's just an interesting problem. 🙂

  • #20020
    Profile photo of Adam Bertram
    Adam Bertram
    Participant

    I hear you and now I know who to call on for help if I ever come up with this kind of problem in the future. This kind of problem wasn't interesting to me. That's why I admittingly threw it to you. However, I can definitely relate to getting so zoned in to a particular problem I lose track of time. 🙂

You must be logged in to reply to this topic.