remove duplicates in csv

This topic contains 12 replies, has 3 voices, and was last updated by Profile photo of Michael Glenn Michael Glenn 1 year, 5 months ago.

  • Author
    Posts
  • #32501
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    stuid lastname firstname gradelevel status entrydate
    28372 ACEVEDO KEYLIANIS 0 E 8/31/2015
    28166 CAHILL ALLANNA 0 N 8/31/2015
    28166 CAHILL ALLANNA 0 E 9/24/2015

    ive got data in a csv as above. alot more lines of course with a few more duplicates. im trying to drop the duplicates, but keep the one of the two with the most recent "entrydate", and send it to a new csv along with the otheres that arent duplicates.
    having a difficult time figuring out how to go about this.

  • #32502
    Profile photo of Don Jones
    Don Jones
    Keymaster

    Well... hmm. Tough thing here is that, if the dates are different, then they're not actually duplicates as far as the computer is concerned. Do I have that correct?

  • #32503
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    my bad... the "stuid" would be determining that its a duplicate.... and keeping the stuid(row) with the most recent "entrydate" is what im trying to accomplish

  • #32506
    Profile photo of Don Jones
    Don Jones
    Keymaster

    I might start by passing them to Group-Object, grouping on the "stuid" field. That'll give you a group object for each student. I'd probably then pass them to a ForEach.

    For each one whose group contained one item, I'd just output the item since there was no duplicate. For each one with more than one, sort them on entry date in descending order and output only the first object from the collection. That'll give you the most recent.

    Import-Csv |
    Group-Object -prop stuid |
    ForEach {
    if ($_.Count -eq 1) { $_ }
    else { $_[0] }
    } |
    Export-Csv

    That's completely off the top of my head and will doubtless need some mangling to make it work, but that's the general approach I'd probably start with. Like, I need to see if there's a Count property on the collections output by Group-Object, but I bet there is. I'd need to see if that $_[0] syntax worked, but I bet it would. Or something similar.

  • #32512
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    ok... ive gotten the following code to ( i think.. lol ) give me the proper results on screen.
    how do i export this to a csv?

    $data = import-csv C:\ELEM-Student.csv
    
    $data | Foreach-Object {$_.entrydate = [DateTime]$_.entrydate; $_} | 
    Group-Object stuid | 
    Foreach-Object {$_.Group | Sort-Object entrydate | Select-Object -Last 1}
    
  • #32513
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    never mind... i thought it would be a little more difficult then just adding

    "| export-csv c:\filename.csv"

  • #32514
    Profile photo of Michael Glenn
    Michael Glenn
    Participant
    $data = import-csv C:\SapphireAutoTool\ELEM-Student.csv
    
    $data | Foreach-Object {$_.entrydate = [DateTime]$_.entrydate; $_} | 
    Group-Object stuid | 
    Foreach-Object {$_.Group | Sort-Object entrydate | Select-Object -Last 1} | export-csv c:\test1.csv -NoTypeInformation
    

    ok, so this is definitely working correctly for me. by that i mean its giving me the results i would expect, and it seems to work fairly quicky on almost 1800 rows. would this be the best way?

  • #32516
    Profile photo of Curtis Smith
    Curtis Smith
    Participant

    Does this have to be powershell? Seams it might be just a simple to open the CSV in Excel and use the remove duplicates tool.

  • #32517
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    Curtis,
    this is actually another step i have to add to a much larger powershell script im working on. so yes... it has to be powershell due to the way the data is coming down initially. i have no control over how its coming to me. everything is scripted so far, so im looking to script the "cleanup" of the original csv.

  • #32521
    Profile photo of Curtis Smith
    Curtis Smith
    Participant

    Then I would say what you have is probably the best way, although you might be able to improve it a little by moving your sort-object. Put it before the Group-Object so that you sort the whole list once rather than having to sort multiple times on the individual groups. You can also typecast your property for sort rather than converting it in a foreach loop first.

    $data = import-csv C:\SapphireAutoTool\ELEM-Student.csv
    
    $data | Sort-Object [datetime]entrydate |
    Group-Object stuid | 
    Foreach-Object {$_.Group | Select-Object -Last 1} |
    export-csv c:\test1.csv -NoTypeInformation
    
  • #32522
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    so by your suggestion, im sorting by the "entrydate" column first, then by using "group-object stuid" im running a foreach against it to get rid of the duplicates?

  • #32524
    Profile photo of Curtis Smith
    Curtis Smith
    Participant

    Correct. You are:
    1) Getting an unsorted list
    2) Sorting the list by entrydate with the latest entries at the bottom
    3) Grouping the sorted list by stuid
    4) Getting the last entry of each group
    5) Exporting the last entry of each group to CSV

  • #32535
    Profile photo of Michael Glenn
    Michael Glenn
    Participant

    thank you curtis, and don. i appreciate the help.

You must be logged in to reply to this topic.