How to set encoding for TextFieldParser (csv import to SQL)

This topic contains 2 replies, has 2 voices, and was last updated by  Mr Will 1 week, 2 days ago.

  • Author
    Posts
  • #81496

    Mr Will
    Participant

    Hi guys

    I had a well functioning script where I was using Import-Csv, but as the files started growing, I got into trouble with poor performance and out of memory issues. The ~15 different csv files I'm nowadays constantly importing into SQL usually range from 2 to 300.000 lines, with up to 105 columns each. The files have two rows of header information AND the files use double quotes, but only when there are embedded delimiters. Why two rows of column names? First row has the real column name from the originating Pervasive database and the second row includes a language specific, and far more descriptive, name for that column.

    So what I'm trying to achieve, is to get these nightmare-inducing csv files into a staging table in SQL where I can work more with the data.

    So now I'm trying to adapt a script by Chrissy LeMaire, but I'm having difficulties with encoding. Import-Csv read the files correctly, just like Get-Contents, but Reader() seems to read them as UTF-8, when my best guess for the right encoding is Latin 1 (ISO 8859-1). Otherwise it's working quite nicely, performance is at an acceptable level and the bottleneck now seems to be my Azure SQL database with its DTU limits.

    tl;dr I would like to find a way to change the encoding without converting the files. And no, I do not have access to the system producing these csv files.

    So if you could help me out with the script below (original code by Chrissy) or if you have a better idea, please let me know. Keep in mind, I'm a total n00b when it comes to PowerShell!

    # Database variables
    $sqlserver = "sqlserver"
    $database = "locations"
    $table = "allcountries"
     
    # CSV variables
    $parserfile = "C:\temp\million.csv"
    $parserdelimiter = ","
    $firstRowColumnNames = $false
    $fieldsEnclosedInQuotes = $true
     
    ################### No need to modify anything below ###################
    Write-Host "Script started..."
    $elapsed = [System.Diagnostics.Stopwatch]::StartNew()
    [void][Reflection.Assembly]::LoadWithPartialName("System.Data")
    [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
    [void][Reflection.Assembly]::LoadWithPartialName("Microsoft.VisualBasic")
     
    # 50k worked fastest and kept memory usage to a minimum
    $batchsize = 50000
     
    # Build the sqlbulkcopy connection, and set the timeout to infinite
    $connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
    $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
    $bulkcopy.DestinationTableName = $table
    $bulkcopy.bulkcopyTimeout = 0
    $bulkcopy.batchsize = $batchsize
     
    # Open text parser for the column names
    $columnparser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser($parserfile)
    $columnparser.TextFieldType = "Delimited"
    $columnparser.HasFieldsEnclosedInQuotes = $fieldsEnclosedInQuotes
    $columnparser.SetDelimiters($parserdelimiter)
     
    Write-Warning "Creating datatable"
    $datatable = New-Object System.Data.DataTable
    foreach ($column in $columnparser.ReadFields()) {[void]$datatable.Columns.Add()} 
    $columnparser.Close(); $columnparser.Dispose()
     
    # Open text parser again from start (there's no reset)
    $parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser($parserfile)
    $parser.TextFieldType = "Delimited"
    $parser.HasFieldsEnclosedInQuotes = $fieldsEnclosedInQuotes
    $parser.SetDelimiters($parserdelimiter)
    if ($firstRowColumnNames -eq $true) {$null = $parser.ReadFields()}
     
    Write-Warning "Parsing CSV"
    while (!$parser.EndOfData) {
     try { $null = $datatable.Rows.Add($parser.ReadFields()) }
     catch { Write-Warning "Row $i could not be parsed. Skipped." }
     
     $i++; if (($i % $batchsize) -eq 0) { 
     $bulkcopy.WriteToServer($datatable) 
     Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())."
     $datatable.Clear() 
     } 
    } 
     
    # Add in all the remaining rows since the last clear
    if($datatable.Rows.Count -gt 0) {
     $bulkcopy.WriteToServer($datatable)
     $datatable.Clear()
    }
     
    Write-Host "Script complete. $i rows have been inserted into the database."
    Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
     
    # Clean Up
    #$parser.Dispose(); $bulkcopy.Dispose(); $datatable.Dispose(); 
     
    $totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2)
    Write-Host "Total Elapsed Time: $totaltime seconds. $i rows added." -ForegroundColor Green
    # Sometimes the Garbage Collector takes too long to clear the huge datatable.
    [System.GC]::Collect()
    

    Thanks!

    -Will

  • #81512

    Curtis Smith
    Participant

    To change the Encoding, it looks like you would just specify it in the TextFieldParser

    TextFieldParser(Stream, Encoding) Initializes a new instance of the TextFieldParser class.
    Ref: https://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser(v=vs.110).aspx

    So maybe something like:

    $columnparser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser($parserfile, [system.text.encoding]::GetEncoding(858))

    I've never done this, so this is all theory for me.

    • #81517

      Mr Will
      Participant

      Thank you, Curtis!

      I had been to that page you referenced, but I just couldn't wrap my head around how to use that. With your help I finally got it working. The code page I had to use in the end was 28591.

      Again, thank you.

      P.S. Blade Runner 2049 was awesome!

You must be logged in to reply to this topic.