Replacing null characters in bulk

This topic contains 0 replies, has 1 voice, and was last updated by Profile photo of Forums Archives Forums Archives 5 years, 5 months ago.

  • Author
    Posts
  • #5062

    by willsteele at 2012-08-30 16:53:28

    I have a set of files I need to add headers to. This portion of my script is fine. But, I need to find a way to:

    1) read an input file (the file without the header)
    2) add the header
    3) replace a 28-character long string of null characters (0x01) with 1 null character
    4) add the "footer" (the contents of the input file)
    5) write the modified file to disk

    I was going to use the Get-Content | # do some work | Set-Content pattern, but, I cannot quite figure out how to both add the header and change the 28-char string to a 1-char string in flight. Any suggestions would be great. Also, note I am doing this inside a foreach loop which is being piped to by a Get-ChildItem cmdlet. So, it is essentally a directory read and file change, one by one.

    by DonJ at 2012-08-30 17:14:37

    Well...


    Get-Content filename.txt | select -skip 1

    Will skip the header on read. You can then pipe to ForEach-Object and run a -replace...


    ... | foreach { $_ -replace "null","whatever" }

    to get rid of the null. I'd output the header first, on a prior command, to a new file. Then...


    ... | out-file newfile.txt -append

    to get the contents to disk. Or something like that. You can't really get the header crammed into that one-liner though. Not elegantly. Just create it as a first step.


    "my,new,header" | out-file newfile.txt ; get-content filename.txt | select...

    Like that? The pipeline isn't really explicitly designed for a one-off exception like a header row... it wants to just deal with heterogenous objects.

    by willsteele at 2012-08-30 17:30:34

    I am worried about losing characters with Get-Content. There are lots of control characters (I use that to mean ASCII characters below 0x20). Does Get-Content/Out-File pose any issues if I don't specify the encoding? The encoding thing always gets me.

    by DonJ at 2012-08-30 17:39:01

    You'll have to try it, frankly. Without knowing exactly what you're looking at, it's kind of impossible to predict. If it won't work, you'll be into low-level .NET stuff to read instead.

    by willsteele at 2012-08-30 17:45:38

    Here's the before (I show it in hex and Textpad to demo what's going on)

    http://www.2shared.com/photo/EgKdzj_v/before.html

    And the after:

    http://www.2shared.com/photo/d40VLn6y/after.html

    It may not be super-clear, but, basically, I tacked on a header (just dummy text), and left 1 of the 28 null character (0x00).

    by mjolinor at 2012-08-30 19:01:36

    I'd use set-content instead of out-file. Out-* cmdlets seem to add formatting.

    If it's not a really big file,

    (Get-Content filename.txt | select -skip 1) -replace "null","whatever"

    will save you the foreach.

    If it is a really big file, you can split the difference and use -readcount, and work with big chunks of lines at a time.

    FWIW

    by willsteele at 2012-08-30 19:03:41

    In this one case I have a sequential string of 60 null characters. And, null characters serve as page breaks in these documents. So, I need to only replace on this one line with 60 null char, and, drop it down to 59 char. All other $null's need to be left alone.

    by mjolinor at 2012-08-30 19:16:42

    If that's the only line with 60 nulls in it, it should only do a replace on that line. The result of doing the replace on the entire array shouldn't be any different that doing the same replace one line at a time with foreach.

    by poshoholic at 2012-08-30 19:26:02

    When you're really concerned about the content of the file, you might want to use the ReadAll* and WriteAll* static methods on [System.IO.File] instead of Get-/Set-Content.

    Also, if you use ForEach-Object, don't forget that it has a -Begin script block which would be great for returning your new header and a -End script block which would be great for returning your footer to the new file. Those parameters may make working this into a one-liner much easier.

    by mjolinor at 2012-08-30 19:44:47

    getting the tags to work.....

    [script=powershell]$header = @'
    Header added after file was modified
    '@

    $header | set-content c:\somedir\newfile.txt

    (get-content c:\someotherdir\inputfile.txt) -replace '0x01{27}(0x01)','$1' |
    add-content c]

    by poshoholic at 2012-08-30 19:50:04

    Add powershell (lowercase, no quotes) to the inside of the script tag right after the equals. Like this:

    [script=powershell][/script]

    Also, make sure you use just the script tag or the code tag, not both.

    by mjolinor at 2012-08-30 19:53:46

    Got it. Been a long day......

    by willsteele at 2012-08-30 21:17:13

    [quote="poshoholic"]When you're really concerned about the content of the file, you might want to use the ReadAll* and WriteAll* static methods on [System.IO.File] instead of Get-/Set-Content.

    Also, if you use ForEach-Object, don't forget that it has a -Begin script block which would be great for returning your new header and a -End script block which would be great for returning your footer to the new file. Those parameters may make working this into a one-liner much easier.[/quote]

    The real script is MUCH more complex than what I put here. I wanted to generalize the question to key in more on the $null-character string -replace issue. This thing has custom functions parsing files, processing data. One of the hard parts of working on proprietary stuff...you never get to talk about the work you really do it's usually sanitized a few times over. This is the only seemingly safe way to talk about it.

    by willsteele at 2012-08-30 21:41:32

    [quote="mjolinor"]getting the tags to work.....

    [script=powershell]$header = @'
    Header added after file was modified
    '@

    $header | set-content c:\somedir\newfile.txt

    (get-content c:\someotherdir\inputfile.txt) -replace '0x01{27}(0x01)','$1' |
    add-content c][/quote]

    I must have not been alert earlier. Let me try this in the AM and see if it works. I figure it will. Thanks.

    by willsteele at 2012-08-31 11:59:19

    Finally figured it out. It wouldn't allow '0x00{27}', only, '`0{27}'. Weirdness. Glad I recalled the escape character! Thanks all.

    by MattG at 2012-09-01 08:42:46

    Hey Will,

    My recommendation would be to read in your file in as a byte array and then convert it to a string using an encoder that can represent all 256 values. The default .NET encoders cannot handle such a request (ASCII, UTF7, UTF8, etc.), unfortunately. The Western-European Latin encoder will do the trick (codepage 1252) though. You can confirm for yourself that each byte encodes and decodes to the same values with the following code snippet:

    [script=powershell][Byte[]] $ByteArray = foreach ($Byte in 0..255) { $Byte }
    $LatinEncoder = [Text.Encoding]::GetEncoding(1252)
    $LatinEncoder.GetBytes($LatinEncoder.GetString($ByteArray))[/script]
    You'll see that for each byte in $ByteArray, the latin decoder decodes each byte back into its original byte. What's the significance of this? Regular expression only operate on strings. By using the latin encoder you can perform binary regular expressions on your data without it being modified during the string encoding process. Consider your requirement of needing to find/replace 28 \x01 bytes. Let's say you read in a file using `Get-Content -Encoding Byte` into the following byte array:

    [script=powershell][Byte[]] $BinaryByteArray = @(0x30,0x31,0x32,0x34,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x01,0x41,0x42,0x43,0x44)
    # Here's what the string looks like
    $LatinEncoder.GetString($BinaryByteArray)
    # Search for exactly 28 \0x01 characters matching as little as possible and replace with 'REPLACED'
    $LatinEncoder.GetString($BinaryByteArray) -replace '\x01.{28,28}?', 'REPLACED'[/script]
    What's nice about this technique is that it allows you to perform regular expressions on any binary data. I've used this technique in the past to search for particular assembly language instructions in a compiled binary. It's kind of a hack but since you can't perform regular expressions on bytes arrays, this is probably the next best thing.

    I hope this helps!

You must be logged in to reply to this topic.