Author Posts

January 20, 2018 at 7:26 pm

I need to evaluate the lines in a file. Each line is made up of loan data like price, credit score, etc etc . I need to filter on these values and if selected write the line to an output file.

The problem is the file has 400,000 + lines.
Each line has about 20+ values to filter on,
And I cannot take more than 60 seconds for this part of the script.
Any ideas how i can make the code faster?

Here is the code:

$File_NotesDownload = "C:\temp\NotesDownload.csv"
$File_NotesGood = "C:\temp\NotesGood.txt"
if (Test-Path $File_NotesGood) { remove-item $File_NotesGood }
$Start = Get-Date

$Notes = Get-Content $File_NotesDownload
foreach($Note in $Notes) 
   {
    $Note = $Note -replace '"',''
    $Loan_Id = ($Note -split ',')[0].trim()
    $Loan_Id = $Loan_Id -as [int]
    if (($Loan_Id -is [int])  -eq $false) {Continue}

    $Loan_Status     = ($Note -split ',')[5].trim()
    $Ask_price       = ($Note -split ',')[6].trim()
    $Markup_Discount = ($Note -split ',')[7].trim()
    $YTM             = ($Note -split ',')[8].trim()
    $DaysSinceLastPayment  = ($Note -split ',')[9].trim()
    $CreditScoreTrend  = ($Note -split ',')[10].trim()
    $FicoEndRange      = ($Note -split ',')[11].trim()
    $NeverLate         = ($Note -split ',')[13].trim()
    $Loan_Class        = ($Note -split ',')[14].trim()
    $Loan_Maturity     = ($Note -split ',')[15].trim()
    $Interest_Rate     = ($Note -split ',')[17].trim()
    $RemainingPayments = ($Note -split ',')[18].trim()

    if ($Loan_Maturity -ne 36) {Continue}
    if (($Interest_Rate -lt 4) -Or ($Interest_Rate -gt 25)) {Continue}
    if ($Ask_price -gt 20) {Continue}
    if ($Markup_Discount -gt -0.01) {Continue}
    if ($YTM -lt 10) {Continue}
    if ($CredScoretrend -ne "UP") {Continue}

    $NotesList += $Note + "`r`n"
    }
$NotesList | out-file $File_NotesGood

$End = Get-Date
$TotalTime = New-Timespan -Start $Start -End $End
$TotalTimeSecs = $TotalTime.seconds
$TotalTimeSecs

January 20, 2018 at 11:30 pm

Notes on this topic —

There are things (commands, functions, API's, modules, constructs, etc.) in any language that are just slow, and then the are things you introduce into your code which make it slow.

The former, means, things like native cmdlet/functions/operations. Loops are just slow by design, especially if you are reading in large data. Some cmdlets are just slow by design, for example the Web cmdlets.
Improvements in the core language occur over time. Take a look at PowerShell Core (PowerShell v6) as an example. PSCore is not a replacement for PowerShell. There are lots of things missing to make it cross platform capable.

The later is allow about you and what you are doing. Large files are going to be slow, period. Loops on large files are going to be slow.
Get-Content loads the entire file into memory (once loaded, then it acts on it) and if you don't have enough, you will run into issues, performance and failures.

If you have large files, don't use Get-Content, use the .Net libraries (StreamReader) instead. There are lots of articles on this topic on the web about how to use the .Net libraries and or Get content to read large files in chunks in order to not tax memory and improve performance.

So, by example...

The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line or in a batch using Get-Content or StreamReader. The .Net StreamReader is still faster.

System.IO.File.ReadLines()
It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.

https://msdn.microsoft.com/en-us/library/dd383503.aspx

$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
$line = $r.ReadLine()
# Process $line here...
}
$r.Dispose()

# Some performance comparisons:

Measure-Command {Get-Content .\512MB.txt > $null}

Total Seconds: 49.4742533

Measure-Command {
$r = [IO.File]::OpenText('512MB.txt')
while ($r.Peek() -ge 0) {
$r.ReadLine() > $null
}
$r.Dispose()
}

Total Seconds: 27.666803

Using Get-Content leveraging batching

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"

$batch = 1000

Here is one of the better articles on the topic.
http://www.happysysadm.com/2014/10/reading-large-text-files-with-powershell.html

January 21, 2018 at 11:42 am

Have a read of some Chrissy LeMaire articles; she has blogged on techniques for increasing performance when working with CSV files.

Specifically, I think the Streaming a 32-bit OleDbconnection section in the second article, which describes performing a SELECT on CSV file using a WHERE clause, will be of use to you. You'll just need to modify it to write to a file instead of a SQL database.

Quickly Find Duplicates in Large CSV Files using PowerShell

High-Performance Techniques for Importing CSV to SQL Server using PowerShell

January 21, 2018 at 11:44 am

Please post a dozen or so lines as representative sample of what a line in this file looks like. Make sure to remove/replace any PII

January 21, 2018 at 1:27 pm

Thanks Postanote,
Your script worked great. Originally I believe I received about several minutes to maybe an hour.
With your script I managed to get it down to 9- 12 seconds !!!!

January 21, 2018 at 1:30 pm

Thanks Matt. Great tip

January 21, 2018 at 8:17 pm

No worries, glad it helped. I have to deal with large log files daily, so I get your pain on this one. Especially when trying to get results to folks on such large files who have low to no patience.

January 28, 2018 at 12:35 am

I have another problem with the speed.
As I add more lines to the $NotesList the script takes longer and longer. I believe .NET is duplicating the entire array in memory, adding the new item, and deleting the old copy in memory https://blogs.technet.microsoft.com/ashleymcglone/2017/07/12/slow-code-top-5-ways-to-make-your-powershell-scripts-run-faster/. Is there a way around it?

$File_Path = "C:\Temp\"
$File_NotesDownload = $File_Path  + "SecondaryMarketAllNotes.csv"
$File_NotesEval     = $File_Path  + "SecNotesEval.txt"

$r = [IO.File]::OpenText($File_NotesDownload)
while ($r.Peek() -ge 0) 
    {
     $Note = $r.ReadLine()
    $NotesList += $Note
    }
$NotesList | out-file $File_NotesEval
$r.Dispose()