'String in Files Search' - More Efficient?

This topic contains 4 replies, has 3 voices, and was last updated by Profile photo of Kevin Osborn Kevin Osborn 2 years, 1 month ago.

  • Author
    Posts
  • #24547
    Profile photo of Kevin Osborn
    Kevin Osborn
    Participant

    I'm kinda new to Powershell, and as a practice what I have learned, wrote the following script/utility to search through all files in a given path, whose file name match one or multiple parts of the given name, that contain one or multiple given strings.

    Here is the script.
    ___________________________________

    Clear-Host
    Clear-Variable -name [a..z]
    
    $inpath = read-host "Enter Search Path"
    $filepart = @{}
    $searchtext = @{}
    
    $another = "Y"
    $i = 0
    while ($another -eq "Y") {
        $filepart[$i] = read-host "Enter Partial File Name"
        if ($filepart[$i] -eq 'XXXX') {$another = 'N'}
        $filepart[$i] = "*" + $filepart[$i] + "*"
        $i++
    }
    $i--
    
    $another = "Y"
    $j = 0
    while ($another -eq "Y") {
        $searchtext[$j] = read-host "Enter Search String"
        if ($searchtext[$j] -eq 'XXXX') {$another = 'N'}
        $j++
    }
    $j--
    
    
    $outlist = .{ 
    for ($x = 0; $x -le $i; $x++) {
        for ($y = 0; $y -le $j; $y++) {
            Get-ChildItem -path $inpath | Where-Object { $_.name -like $filepart[$x]} | Select-String -pattern $searchtext[$y] | group path | select name
        }    
    }
    }
    
    
    $outlist
    

    ___________________________________________________________________________

    The script works fine as is, but the question I have has to do with efficiency.

    The directory I am searching has 7720 items in it, about 34 MB of data. With entering only two $fileparts and two $searchtext it takes about 6 -7 mins to complete. Is there a more efficient way to accomplish the same goal. I was wondering if creating two loops, one to get a list of applicable files in the path as one, then searching only the objects in that list for the searchtext. But I could not figure out how to get the output of one to pipe into the search of the other, and give a meaningful list.

    Or maybe, 6-7 mins to search that many files with that much data is pretty efficient?

    Any suggestions would be appreciated.

  • #24554
    Profile photo of Don Jones
    Don Jones
    Keymaster

    First, don't do a Clear-Variable up-front. There's no need; at the start of the script, you've got nothing to clear, so it's just wasted time.

    I'm not sure there's a hugely more-efficient way to go after this. You're searching 68MB of data, and .NET's file I/O isn't the fastest thing in the world. YOU enumerating is always slower than a cmdlet enumerating; here, you're letting Get-ChildItem do the enumeration, which is probably your fastest option.

  • #24571
    Profile photo of Craig Duff
    Craig Duff
    Participant

    Things that I can see:

    1) You are actually getting the files with Get-ChildItem once for each file filter and once for each search term. So with 2 file filters and 2 search terms you are going out to the file system 4 times. If you had 100 file filters and 20 search terms, you'd be going out to the file system 2000 times. Everything else aside, you only have to query the file system once for each filter. To do so place Get-ChildItem before the y loop and store the results in a variable

    2) You are searching each file for the search term, the whole file, but since you are grouping, I assume you don't need to know how many matches. One match is sufficient to select the file. Therefore you only need to search until you find the first match. You can do that with Select -First 1. Now in powershell version 2, this wouldn't afford you much of an improvement, but in powershell 3+ it'll stop searching once the first match is returned. The performance improvement will be variable based on the size of the files and how soon in the file the match is found. If the files are larger, better performance savings. If the files are smaller, smaller performance savings. And obviously, if a match is found in line 1 it saves a lot more than if a match is found in line 1000. At any rate, you should see some amount of performance gain, that amount depending on the data at hand.

    3) You can use Select-String to search for more than one term at once. Searching for all the terms at once will help performance. Also, your $outlist may contain duplicate files and using select-string to search for all the terms at once will avoid the duplication issue.

    4) You are using Where-Object to apply the file name filter; however, you are able to use the -Filter parameter of Get-ChildItem. This will improve performance.

    Here is an illustration of the techniques I used to test and demonstrate the methods mentioned above. In the last illustration I changed to looping method to a more powershellish way to loop. I've commented the typical response times I got over my test data. Keep in mind that a lot of this performance has to do with your data and you may get different results. In particular, given a situation where you have a whole lot of very short files, the extra For-EachObject loop I started to use may cause the performance to go the other way. Measure-Command is your friend.

    $path = 'C:\Users\cduff\Downloads\test\29\src'
    
    $VerbosePreference = 'Continue'
    
    
    Write-Verbose "Search File, Group Method"
    Measure-Command {gi "$path\1.txt" | select-string Lorem | group Path | select name}
    #~100ms
    Write-Verbose "Search File, Find First Method"
    Measure-Command {gi "$path\1.txt" | select-string Lorem | select -First 1 | select path}
    #~2ms
    
    Write-Verbose "Filter Right Method"
    Measure-Command {Get-ChildItem -Path $path | Where-Object {$_.name -like '*search1*'} }
    #~1900ms
    Write-Verbose "Filter Left Method"
    Measure-Command {Get-ChildItem -Path $path -Filter "*search1*" }
    #~19ms
    
    $files = @(
     "*search1*"
     "*search2*"
    )
    
    $terms = @(
     "John"
     "Bob"
    )
    
    Write-Verbose "Loop original"
    Measure-Command {
        for ($x = 0; $x -lt 2; $x++) {
            for ($y = 0; $y -lt 2; $y++) {
                Get-ChildItem -path $path -Filter $files[$x] | 
                Select-String -pattern $terms[$y] | 
                Group path | 
                Select Name
            }    
        }
    }
    #~6900ms
    
    Write-Verbose "Loop modified to only search for files once"
    Measure-Command {
        for ($x = 0; $x -lt 2; $x++) {
            $children = Get-ChildItem -path $path -Filter $files[$x]
            for ($y = 0; $y -lt 2; $y++) {
                $children | 
                Select-String -pattern $terms[$y] | 
                Group path | 
                Select Name
            }    
        }
    }
    #~6500ms
    
    Write-Verbose "Loop modified to only search for terms once"
    Measure-Command {
        for ($x = 0; $x -lt 2; $x++) {
            Get-ChildItem -path $path -Filter $files[$x] |
            Select-String -pattern $terms | 
            Group path | 
            Select Name 
        }
    }
    #~4200ms
    
    Write-Verbose "Loop modified with select first"
    Measure-Command {
        for ($x = 0; $x -lt 2; $x++) {
            Get-ChildItem -path $path -Filter $files[$x] |
            ForEach-Object {
                $_ |
                Select-String -pattern $terms | 
                Select-Object -First 1 |
                Select-Object path
            }
        }
    }
    #~285ms
    
    Write-Verbose "Loop modified change loop construct"
    Measure-Command {
        ForEach ($filter in $files) {
            Get-ChildItem -path $path -Filter $filter |
            ForEach-Object {
                $_ |
                Select-String -pattern $terms | 
                Select-Object -First 1 |
                Select-Object path
            }
        }
    }
    #~285ms
  • #24586
    Profile photo of Kevin Osborn
    Kevin Osborn
    Participant

    Thank you both for your help. I'll let you know how it goes.

  • #24600
    Profile photo of Kevin Osborn
    Kevin Osborn
    Participant

    So I made the changes as suggested... and it made a whole lot of difference. Running just two filepart, with two search strings took 6 mins 25 seconds the old way, and 54 seconds the new way. Here is the updated code. Thanks again for all the help, especially with the "more powershellish" way of coding. Getting to that "mindset" consistently, I think is the biggest challenge for me at this time.

    Here is the completed code, so far.

    I hope to add date ranges, and then maybe a GUI interface as well.

    $inpath = read-host "Enter Search Path"
    $filepart = @{}
    $searchtext = @{}

    $i = 0
    for (;;) {
    $filepart[$i] = read-host "Enter Partial File Name"
    if ($filepart[$i] -eq 'XXXX') {break}
    $filepart[$i] = "*" + $filepart[$i] + "*"
    $i++
    }
    $i–
    $filepart = $filepart[0..$i]

    $j = 0
    for (;;) {
    $searchtext[$j] = read-host "Enter Search String"
    if ($searchtext[$j] -eq 'XXXX') {break}
    $j++
    }
    $j–
    $searchtext = $searchtext[0..$j]

    $outlist = .{
    ForEach ($filter in $filepart) {
    Get-ChildItem -path $inpath -Filter $filter |
    ForEach-Object {
    $_ |
    Select-String -pattern $searchtext |
    Select-Object -First 1 |
    Select-Object path
    }
    }
    }

    $outlist

You must be logged in to reply to this topic.