Why Get-Content Ain't Yer Friend

Well, it isn't your enemy, of course, but it's definitely a tricky little beast.

Get-Content is quickly becoming my nemesis, because it's sucking a lot of PowerShell newcomers into its insidious little trap. Actually, the real problem is that most newcomers don't really understand that PowerShell is an object-oriented, rather than a text-oriented shell; they're trying to treat Get-Content like the old Type command (and why not? type is an alias to Get-Content in PowerShell, isn't it?), and failing.

Worse, PowerShell has just enough under-the-hood smarts to make some things work, but not everything. 

For example, this works to replace all instances of "t" with "x" in the file test.txt, outputting the result to new.txt:

$x = Get-Content test.txt
$x -replace "t","x" | Out-File new.txt

Sadly, this reinforces - for newcomers - the notion that Get-Content is just reading in the text file as a big chunk o' text.

Nope.

You see, in reality, Get-Content reads each line of the file individually,and returns collection of System.String objects. It "loses" the carriage returns from the file at the same time. But you'd never know that, because when PowerShell displays a collection of strings, it displays them one object per line and inserts carriage returns.So if you do this, it'll look like you're dealing with a big hunk o' text:

$x = Get-Content test.txt
$x

But you're not. $x, in that example, is a collection of objects, not a single string.

Never fear - you can make sense of this. First, if you use the -Raw parameter of Get-Content (available in v3+), it does in fact read the entire file as a big ol' string, preserving carriage returns instead of using them to separate the file into single-line string objects. In v2, you can achieve something similar by using Out-String:

$x = Get-Content test.txt | Out-String

So if you just need to work with a big ol' string, you can. Alternately, you might find that some operations are quicker when you actually do work line-by-line. For example, asking PowerShell to do a regex replace on a huge string can consume a ton of memory; working with one line at a time is often quicker. Just use a foreach:

ForEach ($line in (Get-Content test.txt)) {
  $line -replace "\d","x" | Out-File new.txt -Append
}

Of course, don't assume it'll be quicker - Measure-Command lets you test different approaches, so you can see which one is actually quicker.

You should also consider not using Get-Content, especially with very large files. That's because it wants to read the entire file into memory at once, at that can take a lot of memory - not to mention a bit more processor power, swap file space, or whatever else.

Instead, read your file from disk one line at a time, work with each line, and then (if that's your intent) write each line back out to disk. Instead of caching the entire file in RAM, you're reading it off disk one line at a time.

$file = New-Object System.IO.StreamReader -Arg "test.txt"
while ($line = $file.ReadLine()) {
  # $line has your line
}
$file.close()

Or at least something like that. Yeah, welcome to .NET Framework. Other options available to the Framework include reading a text file in chunks - again, to help conserve memory and improve processing speed, but not necessarily making you read line-by-line.

Whatever approach you choose, just remember that, by default, Get-Content isn't just reading a stream of text all at once. You'll be getting, and need to be prepared to deal with, a collection of objects. Those will often require that you enumerate them (line by line, in other words) using a foreach construct, and with large files the act of reading the entire file might negatively impact performance and system resources.

Knowing is half the battle!

About the Author

Don Jones

Don Jones is a Windows PowerShell MVP, author of several Windows PowerShell books (and other IT books), Co-founder and President/CEO of PowerShell.org, PowerShell columnist for Microsoft TechNet Magazine, PowerShell educator, and designer/author of several Windows PowerShell courses (including Microsoft's). Power to the shell!

6 Comments

  1. I think that the ability to not cache an entire file in memory at once is the reason Get-Content outputs one line at a time, by default; you can use the pipeline to stream one line at a time, very similar to your example using a StreamReader:

    Get-Content .\SomeFile.txt |
    ForEach-Object { $_ -replace "\d", "x" } |
    Out-File .\SomeNewFile.txt

  2. I know Don has an aversion to filters, but you can replace a simple foreach-object loop with a filter and get much better performance.

    filter num2x { $_ -replace "\d","x" }
    Get-Content test.txt | num2x | add-content new.txt

    The streamreader solution still puts you in a position of doing many trivial disk I/O operations which is a performance killer. For raw performance working in batches (Get-Content with -ReadCount) is still the best way to handle large files because it has the potential to greatly reduce the number of disk I/O operations that will be required to complete the task.
    IMHO

  3. One problem I ran into with the streamreader is that, if you hit an empty line before the end of your file, it stops reading.
    However, I found that by checking if the value of (StreamReader).peek() is greater than 0, I could successfully read to the end of the file.

  4. I was looking for a way to speed up a process I had to run on a 53,000+ line text file and I came across this article. Once I changed to using StreamReader, my processing went down to 4 minutes. To give you a comparison, a 5000 line test file was taking about 10 minutes to only process one of the regular expression replaces I needed to make. Thank you for this article - you have saved me a tremendous amount of time and processing!

    Below is the original code before optimization:

    $elapsed = [System.Diagnostics.Stopwatch]::StartNew()
    Write-Host "Started at $(get-date)"

    $PriceLine = Get-Content '.\Price2016.txt'

    foreach ($line in $PriceLine)
    {
    $PriceLine = $PriceLine -replace '(\d)$','$1"'
    $PriceLine = $PriceLine -replace '\",\$','","'
    $PriceLine = $PriceLine -replace '\,\$','","'
    $PriceLine = $PriceLine -replace '0\.00','0'
    }

    $PriceLine| Out-File '.\GD2016.txt'

    Write-Host "Ended at $(Get-Date)"
    Write-Host "Total Elapsed Time $($elapsed.Elapsed.ToString())"

    Here is the new code after optimization:

    $elapsed = [System.Diagnostics.Stopwatch]::StartNew()
    Write-Host "Pass #1 - Started at $(get-date)"

    $file = New-Object System.IO.StreamReader -Arg "Price2016.txt"
    while ($PriceLine = $file.ReadLine())
    {
    $PriceLine = $PriceLine -replace '(\d)$','$1"'
    $PriceLine = $PriceLine -replace '\",\$','","'
    $PriceLine = $PriceLine -replace '\,\$','","'
    $PriceLine = $PriceLine -replace '0\.00','0'
    $PriceLine | Out-File '.\GD2016.txt' -Append
    }

    $file.Close()

    Write-Host "Ended at $(Get-Date)"
    Write-Host "Total Elapsed Time $($elapsed.Elapsed.ToString())"