Parsing a file with string and binary data

This topic contains 2 replies, has 2 voices, and was last updated by  uSlackr 1 year, 10 months ago.

  • Author
    Posts
  • #30573

    uSlackr
    Participant

    I have a file (WARC data) that contains a mix of string and binary data. I have to modify some of the strings without affecting the binary parts. I tried this using Get-content and looping through each line, searching for the string I needed to edit and using Add-content to dump the result to a file. This appears to corrupt the file. (I'm not particularly surprised by this).
    Going deeper, the edits I need to make are like this:
    – Locate the pattern
    – Find the "." on that line (before next CRLF)
    – Replace the "." and all characters to the next CRLF with "Z"

    Is there a way to approach this in Powershell? Note that some of the files will be multi-GB in size should it matter.

    Here's my current attempt (Includes borrowed converter. My code starts after "#######")

    filter ConvertTo-String
    {
        [OutputType([String])]
        Param (
            [Parameter( Mandatory = $True,
                        Position = 0,
                        ValueFromPipeline = $True )]
            [ValidateScript( { -not (Test-Path $_ -PathType Container) } )]
            [String]
            $Path
        )
    
        $Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $Path), 'Open', 'Read'
    
        # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
        $Encoding = [Text.Encoding]::GetEncoding(28591)
        
        $StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding
    
        $BinaryText = $StreamReader.ReadToEnd()
    
        $StreamReader.Close()
        $Stream.Close()
    
        Write-Output $BinaryText
    }
    ############### my code starts here
    
    $BinaryString = ConvertTo-String D:\Acc\hanzo2\WARCfiles\warca.warc
    $BinaryString
    #$DateRegex = [Regex] '\x57\x41\x52\x43\x2d\x44\x61\x74\x65\x3a.*\.' 
    $DateRegex = [Regex] 'WARC-Date:.*\.' #matches up to first dot
    
    $DateRegex.Matches($BinaryString)| foreach{
        $curindex = $_.index
        $curlen = $_.length 
        $curdatestr = $BinaryString.Substring($curindex,$curlen-1) 
        $curdot = $BinaryString.IndexOf(".",$curindex)
        $cureol = $BinaryString.IndexOf("`r`n",$curindex)
        $lengthdif = $curEOL - $curdot -1 
        $newdatestr = $curdatestr + "Z" + (" " * $lengthdif)
        
        $Binarystring.Replace($BinaryString,$newdatestr,1,$curindex)
            }
        
    

    Thanks for looking

    \\Greg

  • #30577

    Sebastian Neumann
    Participant

    How big is the file? Get-Content reads everything into memory. If the file is rather big you're probably better off using another approach than Get-Content. Add-Content is the wrong cmdlet for your use case, you were probably looking for Set-Content 🙂

  • #30579

    uSlackr
    Participant

    Thanks Sebastian. I was using Add-content since I was writing it out line by line. Is that wrong?
    I added some code above with my newer attempt.

You must be logged in to reply to this topic.