Author Posts

October 8, 2015 at 1:42 pm

I have a file (WARC data) that contains a mix of string and binary data. I have to modify some of the strings without affecting the binary parts. I tried this using Get-content and looping through each line, searching for the string I needed to edit and using Add-content to dump the result to a file. This appears to corrupt the file. (I'm not particularly surprised by this).
Going deeper, the edits I need to make are like this:
– Locate the pattern
– Find the "." on that line (before next CRLF)
– Replace the "." and all characters to the next CRLF with "Z"

Is there a way to approach this in Powershell? Note that some of the files will be multi-GB in size should it matter.

Here's my current attempt (Includes borrowed converter. My code starts after "#######")

filter ConvertTo-String
{
    [OutputType([String])]
    Param (
        [Parameter( Mandatory = $True,
                    Position = 0,
                    ValueFromPipeline = $True )]
        [ValidateScript( { -not (Test-Path $_ -PathType Container) } )]
        [String]
        $Path
    )

    $Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $Path), 'Open', 'Read'

    # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
    $Encoding = [Text.Encoding]::GetEncoding(28591)
    
    $StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding

    $BinaryText = $StreamReader.ReadToEnd()

    $StreamReader.Close()
    $Stream.Close()

    Write-Output $BinaryText
}
############### my code starts here

$BinaryString = ConvertTo-String D:\Acc\hanzo2\WARCfiles\warca.warc
$BinaryString
#$DateRegex = [Regex] '\x57\x41\x52\x43\x2d\x44\x61\x74\x65\x3a.*\.' 
$DateRegex = [Regex] 'WARC-Date:.*\.' #matches up to first dot

$DateRegex.Matches($BinaryString)| foreach{
    $curindex = $_.index
    $curlen = $_.length 
    $curdatestr = $BinaryString.Substring($curindex,$curlen-1) 
    $curdot = $BinaryString.IndexOf(".",$curindex)
    $cureol = $BinaryString.IndexOf("`r`n",$curindex)
    $lengthdif = $curEOL - $curdot -1 
    $newdatestr = $curdatestr + "Z" + (" " * $lengthdif)
    
    $Binarystring.Replace($BinaryString,$newdatestr,1,$curindex)
        }
    

Thanks for looking

\\Greg

October 8, 2015 at 3:31 pm

How big is the file? Get-Content reads everything into memory. If the file is rather big you're probably better off using another approach than Get-Content. Add-Content is the wrong cmdlet for your use case, you were probably looking for Set-Content 🙂

October 8, 2015 at 3:36 pm

Thanks Sebastian. I was using Add-content since I was writing it out line by line. Is that wrong?
I added some code above with my newer attempt.