Author Posts

September 6, 2018 at 8:40 pm

Hi,

Was just playing with exporting directory listings to CSV and noticed a little strangeness. I hope someone can enlighten me why this happens?

Two commands which seem to work the same,

[pre]Get-ChildItem | Select-Object fullname,length | ConvertTo-Csv |out-file -FilePath dir-list.csv[/pre]

[pre]Get-ChildItem | Select-Object fullname,length | Export-Csv -Path dir-list2.csv[/pre]

The resulting files look the same in Notepad++ and when 'cat'ed. But when the file sizes are checked the first command always creates a file about twice the size of the second command. I have opened both files in a hex editor and the larger file shows NULL (hex code 00) characters separating every character which accounts for the size difference. Why is this happening?

September 6, 2018 at 8:56 pm

Ran a few tests myself, and I can say that it's not the CSV cmdlets causing the difference. The issue appears to be the Out-File cmdlet, and it's only present in Windows PowerShell 5.1, not PS Core (6.1.0 RC1). Unsure of prior versions, but it's likely that it's a long-standing bug that was fixed for PS Core at some point.

Instead, I'd suggest using the Set-Content or Add-Content cmdlet.

September 6, 2018 at 8:57 pm

Use Notepad++ to check the encoding of the files. There you will see the difference. If you like to have it equally use this:

Get-ChildItem -exclude 'dir-list*.csv' | Select-Object fullname,length | ConvertTo-Csv -NoTypeInformation |out-file -FilePath dir-list.csv -Encoding utf8

Get-ChildItem -exclude 'dir-list*.csv' | Select-Object fullname,length | Export-Csv -Path dir-list2.csv -Encoding utf8 -NoTypeInformation

September 7, 2018 at 10:14 am

I found out what is happening but not the why. I went back to double check the files in Notepad++ as @olaf-soyk suggested but couldn't see any differences. I did notice that Notepad++ had decided that the files had difference encodings.
The smaller file was UTF-8
The larger file UCS-2 BE ROM

I haven't come across UCS-2 BE ROM encoding before but a quick websearch showed it to be a 16-bit encoding as opposed to the UTF-8 which is 8-bit. I suppose it should have been obvious when I saw the extra empty chars in the hex editor!

Using out-file with -encoding utf8 gives files of equivalent size. There is still some BOM characters at the beginning of the file though. Hope this helps someone.

September 7, 2018 at 11:09 am

BTW: That does not affect the functionality of the files. It takes a little more space and you could save some more "exotic" charachters from the unicode table. But it will work the same as UTF8 encoded files in common environments. 😉

September 7, 2018 at 12:53 pm

Yep. I had a thread about this a little while ago. At first I thought it was unix text. PS 5's Out-File (or ">") encodes in what Notepad calls "unicode" and most other commands output in what Notepad calls "ansi". Some applications won't like it, like Infoblox (for csv import).