Remove Duplicate Rows in a CSV

Welcome Forums General PowerShell Q&A Remove Duplicate Rows in a CSV

Viewing 7 reply threads
  • Author
    Posts
    • #212214
      Von
      Participant
      Topics: 3
      Replies: 12
      Points: 25
      Rank: Member

      I’d like to use Powershell to remove some rows and save a delimited text file without changing the file formatting. The file is a list of associates exported from workday. It recently started showing up with some duplicate rows (same employee ID on two separate rows). Until they can fix the problem and send me a file without duplicates, I’m hoping to automate a ‘fix’ on my end by having Powershell:

      1. Find rows with a duplicate employee ID (each new line starts with the employee ID)
      2. For the duplicate employee ID, remove the row where Status = Terminated
      3. For the duplicate employee ID, keep the row where Status = Active
      4. Save the file with a new name (do not change the original file formatting)
      Employee_ID Name Status Action Needed
      11111111 Joe Smith Terminated Remove this row
      11111111 Joe Smith Active Keep this row
      22222222 Jane Smith Terminated Remove this row
      22222222 Jane Smith Active Keep this row
      33333333 Bob Smith Terminated Remove this row
      33333333 Bob Smith Active Keep this row

      Or, if it isn’t possible to do the above without changing the file formatting, then something like:

      1. Find rows with a duplicate employee ID
      2. Remove all rows containing the duplicate employee ID
      3. Save the file with new name (do not change the original file formatting)

       

    • #212217
      Participant
      Topics: 12
      Replies: 1610
      Points: 2,508
      Helping Hand
      Rank: Community Hero

      Take a look at Group-Object:

      $csv = @"
      Employee_ID,Name,Status,Action Needed
      11111111,Joe Smith,Terminated,Remove this row
      11111111,Joe Smith,Active,Keep this row
      22222222,Jane Smith,Terminated,Remove this row
      22222222,Jane Smith,Active,Keep this row
      33333333,Bob Smith,Terminated,Remove this row
      33333333,Bob Smith,Active,Keep this row
      44444444,Frank Adams,Terminated,Keep this row
      55555555,Ralph Dickens,Active,Keep this row
      66666666,Alice White,Active,Keep this row
      "@ | ConvertFrom-Csv
      
      $csvGroup = $csv | Group-Object -Property Employee_id
      
      $newCSV = @()
      $newCSV += ($csvGroup | Where{$_.Count -gt 1}).Group | Where{$_.Status -ne 'Terminated'}
      $newCSV += ($csvGroup | Where{$_.Count -eq 1}).Group
      
      
      $newCSV
      

      Output:

      Employee_ID Name          Status     Action Needed
      ----------- ----          ------     -------------
      11111111    Joe Smith     Active     Keep this row
      22222222    Jane Smith    Active     Keep this row
      33333333    Bob Smith     Active     Keep this row
      44444444    Frank Adams   Terminated Keep this row
      55555555    Ralph Dickens Active     Keep this row
      66666666    Alice White   Active     Keep this row
      
    • #212223
      Von
      Participant
      Topics: 3
      Replies: 12
      Points: 25
      Rank: Member

      Rob, your example works nicely, but when I try to import and export the file, the file formatting ends up totally different. I’m importing a file with around 130k rows using an SSIS job, so Powershell can’t disrupt the original file formatting. Any recommendations with the file import/export? This is what I tried:

      $csv = Import-Csv -Path "C:\PathToFile\file.csv"| ConvertFrom-Csv
      
      $csvGroup = $csv | Group-Object -Property Employee_id
      
      $newCSV = @()
      $newCSV += ($csvGroup | Where{$_.Count -gt 1}).Group | Where{$_.Status -ne 'Terminated'}
      $newCSV += ($csvGroup | Where{$_.Count -eq 1}).Group
      
      
      $newCSV | Out-File "C:\PathToFile\file_modified.csv"
    • #212232
      Participant
      Topics: 12
      Replies: 1610
      Points: 2,508
      Helping Hand
      Rank: Community Hero

      Export it as a CSV:

      $newCSV | Export-CSV -Path "C:\PathToFile\file_modified.csv" -NoTypeinformation
      

      There are methods to remove duplicate in SSIS:

      https://www.mssqltips.com/sqlservertip/3036/removing-duplicates-rows-with-ssis-sort-transformation/

    • #212286
      Von
      Participant
      Topics: 3
      Replies: 12
      Points: 25
      Rank: Member

      Since this is a temporary fix, I didn’t want to modify the SSIS job in production. I was hoping to just insert a Powershell step to clean the file before SSIS runs. When they fix the file and remove the duplicates, I’ll stop using this. Unfortunately when I try Export-CSV, it

      1. gives an error
      2. creates a file, but with messed up formatting

      I tried running this on my local machine and on a separate server, with the same results. Using Posh version 5.1.14409.1018. Error message:

      Export-Csv : Cannot bind argument to parameter 'InputObject' because it is null.
      At line:11 char:11
      + $newCSV | Export-Csv -Path "C:\PathToFile\file ...
      + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      + CategoryInfo : InvalidData: (:) [Export-Csv], ParameterBindingValidationException
      + FullyQualifiedErrorId : ParameterArgumentValidationErrorNullNotAllowed,Microsoft.PowerShell.Commands.ExportCsvCommand

      The original file looks something like:

      Employee_ID,Name,Status
      11111111,Joe Smith,Active
      22222222,Jane Smith,Active
      33333333,Bob Smith,Active
      
      

      The exported file looks something like:

      "@{Employee_ID=11111111; Name=Joe Smith; Status=Active}"
      "@{Employee_ID=22222222; Name=Jane Smith; Status=Active}"
      "@{Employee_ID=33333333; Name=Bob Smith; Status=Active}"
    • #216360
      Von
      Participant
      Topics: 3
      Replies: 12
      Points: 25
      Rank: Member

      I figured out the original file is formatted UNIX UTF-8. Looks like Posh changes the formatting to DOS UTF-8. Is Posh able to modify and save the file without changing the original formatting / encoding?

    • #216363
      Participant
      Topics: 12
      Replies: 1610
      Points: 2,508
      Helping Hand
      Rank: Community Hero

      Encoding can be specified for Import-CSV and Export-CSV:

      -Encoding
      Specifies the encoding for the exported CSV file. The default value is UTF8NoBOM.

      The acceptable values for this parameter are as follows:

      ASCII: Uses the encoding for the ASCII (7-bit) character set.
      BigEndianUnicode: Encodes in UTF-16 format using the big-endian byte order.
      OEM: Uses the default encoding for MS-DOS and console programs.
      Unicode: Encodes in UTF-16 format using the little-endian byte order.
      UTF7: Encodes in UTF-7 format.
      UTF8: Encodes in UTF-8 format.
      UTF8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
      UTF8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
      UTF32: Encodes in UTF-32 format.

    • #216426
      Von
      Participant
      Topics: 3
      Replies: 12
      Points: 25
      Rank: Member

      Thanks for the ideas, Rob. I went through and tried all the encoding options. Looks like I can’t keep the original UNIX file formatting, it always converts the file format to DOS.

Viewing 7 reply threads
  • You must be logged in to reply to this topic.