Remove duplicates from csv taking too long

Welcome Forums General PowerShell Q&A Remove duplicates from csv taking too long

Viewing 20 reply threads
  • Author
    Posts
    • #218391
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member

      Hi

      I am trying to remove duplicates from csv file. The input file is of 5 mb contains 1.5 lakh records(rows). I am using:

      $data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv
      $data |Group-Object SourceIPAddress | Foreach-Object {$_.Group | Select-Object -First 1}|
      export-csv "C:\Users\biswajeet.kumar\Documents\test\report2.csv" -NoTypeInformation -Append

      But the is taking approx. 5 min to complete. I have to do this for 100s of csv. I can not sort the data. Is there any other ways to speed up this process?

      • This topic was modified 1 month, 1 week ago by biswajeet06.
      • This topic was modified 1 month, 1 week ago by biswajeet06.
    • #218406
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      Biswajeet, welcome to Powershell.org. Please take a moment and read the very first post on top of the list of this forum: Read Me Before Posting! You’ll be Glad You Did!.

      When you post code or error messages or sample data or console output format it as code using the code tags “PRE“, please. Thanks in advance.

      When you crosspost the same question at the same time to different forums you should at least post links to the other forums along with your question to avoid people willing to you help making their work twice or more.

      https://social.technet.microsoft.com/Forums/en-US/e731fbe1-0005-4f44-aece-d411387e2f56/powershell-script-taking-too-much-time?forum=ITCG

      The code you posted here looks different to the code you posted there!?

      As file system operations are usually the most “expensive” things you can do you should avoid writing to a file again and again and again. It’s better to collect the results in a variable and write it in one big chunk … like this:

      $data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv
      $Result = $data | 
          Group-Object SourceIPAddress | 
              Foreach-Object {
                  $_.Group | 
                      Select-Object -First 1
                  }
      Export-Csv -InputObject $Result -Path 'C:\Users\biswajeet.kumar\Documents\test\report2.csv' -NoTypeInformation -Append
    • #218415
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member

      Thanks OlAf,
      that’s a great advice. I will use it in my later codes also.
      Coming to the script it does not reduce the time a bit which is what I am looking for.
      Also gave me output as count, length longlength something like that…

      • This reply was modified 1 month, 1 week ago by biswajeet06.
    • #218421
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      OK. Could you please post a few sanitzed but still representative lines of your csv file? Format it as code as well please.

    • #218427
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member
      SourceIPAddress	DestinationProtocol	DestinationPort	TimeStamp
      172.28.2.2	tcp	88	Sun Aug 11 22:39:08 2019
      10.28.25.23	tcp	88	Sun Aug 11 22:40:08 2019
      172.28.2.2	tcp	88	Sun Mar 11 22:44:08 2020
      10.28.25.18	tcp	88	Sun Mar 11 22:46:08 2020
      10.28.25.23	tcp	88	Sun Mar 11 22:49:08 2020
      

      This is like the input csv.
      I am sorry, if making mistakes while posting, just joined the forum

      • This reply was modified 1 month, 1 week ago by biswajeet06.
    • #218433
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      There are no duplicate source ip addresses in your sample data.

    • #218442
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member

      Just edited it. I can not post original data (confidential). The original csv contains more than 1 lakh records though.

    • #218448
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      Just edited it. I can not post original data (confidential). The original csv contains more than 1 lakh records though.

      I did not say to post original data and I did not say to post all data. BTW: what is “1 lakh” ???

      Never mind … try this

      $Result = $Data  | 
          Group-Object SourceIPAddress | 
              Foreach-Object {
                  $_.Group | 
                      Select-Object -First 1
                  } 
      $Result.GetEnumerator() | Export-Csv -Path 'D:\sample\ResultOutput.csv' -NoTypeInformation
    • #218451
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member

      I can say 30 sec less time now. earlier it was 5 mins now it is taking 4.30 min

    • #218457
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      OK, I ask again: what is “1 lakh ???

    • #218463
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      I can say 30 sec less time now. earlier it was 5 mins now it is taking 4.30 min

      OK. Let’s try to measure where you loose time …

      $Measure = Measure-Command -Expression {
          $data = Import-Csv -Path C:\Users\biswajeet.kumar\Documents\test\a.csv
      }
      "import data '$($Measure.TotalSeconds)'"
      
      $Measure = Measure-Command -Expression {
          $groupedData = $Data  | Group-Object SourceIPAddress 
      }
      "group Data  '$($Measure.TotalSeconds)'"
      
      $Measure = Measure-Command -Expression {
          $result = $groupedData | 
              Foreach-Object {
                  $_.Group | 
                      Select-Object -First 1
                  } 
      }
      "pick first / create result  '$($Measure.TotalSeconds)'"
      
      $Measure = Measure-Command -Expression {
          $Result.GetEnumerator() | Export-Csv -Path 'D:\sample\ResultOutput.csv' -NoTypeInformation -Force
      }
      "write result  '$($Measure.TotalSeconds)'"

      Please post the output of this code. Thanks

    • #218466
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member

      There are 5 rows in the sample I pasted above. Like this there are more than 1 lakh rows in the csv file, almost 1.3 lakh. And I need to run this script for 100 csv files like this.
      So i needed to reduce some execution time.

    • #218469
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member
      import data '3.4784568'
      group Data  '240.3222332'
      pick first / create result  '13.3179946'
      write result  '0.7016121'

      Here is the output

    • #218472
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      … there are more than 1 lakh rows in the csv file, almost 1.3 lakh.

      I will ask this only once more: What is “1 lakh”?

      Here is the output

      As you can see the Group-Object command obviously consumes most of the time. Either you find another/faster solution for this or you have to live with that.

      You could try to speed up your general task by running some steps in parallel for your 100 files. 😉

    • #218475
      Participant
      Topics: 1
      Replies: 7
      Points: 26
      Rank: Member

      Yes, I will keep searching if not then this is it
      Thanks for your support, really appreciate it.

    • #218619
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      I created some test data myself and tested a little bit. The fastest I’ve got without any tricks was with this code:

      $Measure = Measure-Command -Expression {
          $data = Import-Csv -Path 'C:\Users\biswajeet.kumar\Documents\test\a.csv' | 
                  sort-object -property SourceIPAddress  -unique
          $data | Export-Csv -Path 'C:\Users\biswajeet.kumar\Documents\test\report2.csv' -NoTypeInformation -Force
      }
      "total seconds '$($Measure.TotalSeconds)'"
    • #219951
      Participant
      Topics: 11
      Replies: 23
      Points: 162
      Rank: Participant

      This is a very interesting post and I currently have the same but MUCH larger issue and why I came searching.  I found three ways to get unique values.

      Get-Unique
      select-object -unique
      sort-object -unique

      I have been using the Select-Object -unique and it is taking me 55 minutes on that single command:

      3321833.7962 milliseconds.

      I’m going to try the sort-object -unique but was wondering OLAF if you tested the Get-Unique.  I know there are issues with the top two ways of doing this with case sensitivity and stuff but this should not be an issue in my data because it all collected by getting AD objects which would be 100% the same.  Basically I’m collecting a bunch of members of groups adding them together and running the -unique command.  I will report back on the sort but was curious if you tested the Get-Unique?

      Also using the sort-object is it best to define a property to sort on as you are doing?  I really do not know how the command works if not defined.  I currently do not define a property using the select-object either.

      EDIT START:
      Switching to

      sort-object -unique

      from

      select-object -unique

      took
      21552.1296 milliseconds! (less than half a minute from over 55 minutes)
      I will now test to see if it is even faster with a property defined.
      EDIT STOP

      • This reply was modified 1 month, 1 week ago by lawson2312.
      • This reply was modified 1 month, 1 week ago by lawson2312.
    • #219978
      Participant
      Topics: 4
      Replies: 2231
      Points: 5,414
      Helping Hand
      Rank: Community MVP

      Jamie,

      I think that depends pretty much on your source data. For the data we have here – an array of objects with more than one property “Select-Object -Unique” and “Get-Unique” does not work.
      As I wrote above I’ve created test data with about 1.5 million elements. When I run the code I posted above it takes about 55 seconds reading the csv file, sorting the data and remove the doublettes and writing it to a new csv file. The amount of elements returned is 260 less than the source data.

      When I change “Sort-Object” to “Select-Object” or “Get-Unique” the amount of elements returned is 1 – more exactly I get only the first element back.

    • #220173
      Participant
      Topics: 3
      Replies: 310
      Points: 1,016
      Helping Hand
      Rank: Community Hero

      You could simply keep track of it yourself.

      Given this source CSV

      SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
      10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
      172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
      172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      
      

      Keep track of the property you want, seems like source IP

      $tracker = @()
      $data = Import-Csv C:\temp\source.csv
      $data | ForEach-Object {
          if($_.sourceipaddress -notin $tracker.sourceIP){
              $tracker += @{SourceIP=$_.SourceIPAddress}
              $_
          }
      } | export-csv c:\temp\noduplicates.csv -NoTypeInformation
      
      

      But for speed on large datasets replace the array @() with an arraylist.

      $tracker = New-Object System.Collections.ArrayList
      $data = Import-Csv C:\temp\source.csv
      $data | ForEach-Object {
          if($_.sourceipaddress -notin $tracker.sourceIP){
              [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
              $_
          }
      } | export-csv c:\temp\noduplicates.csv -NoTypeInformation
      
      

      Both create this output

      
      "SourceIPAddress","DestinationProtocol","DestinationPort","TimeStamp"
      "10.28.25.18","tcp","88","Sun Mar 11 22:46:08 2020"
      "172.28.2.2","tcp","88","Sun Aug 11 22:39:08 2019"
      "10.28.25.23","tcp","88","Sun Aug 11 22:40:08 2019"
      
      

      I hope this helps.

    • #220176
      Participant
      Topics: 3
      Replies: 310
      Points: 1,016
      Helping Hand
      Rank: Community Hero

      I had some fun testing. Given this 10 line source CSV

      SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
      10.28.25.14,tcp,88,"Sun Mar 21 12:16:02 2020"
      10.28.25.13,tcp,88,"Sun Mar 11 21:41:03 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
      10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
      172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
      172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      10.28.25.23,tcp,88,"Sun Mar 11 22:49:08 2020"
      
      

      I created a 150,000 line CSV

      1..15000 | foreach-object {$data | foreach{$_}}| export-csv c:\temp\largecsv.csv -NoTypeInformation
      
      

      Confirmed it. (of course I opened it too)

      import-csv C:\Temp\largecsv.csv | Measure-Object | select @{n='Lines';e={$_.count}}
      
      Lines
      -----
      150000
      
      

      Then ran it through the arraylist tracker

      $Measure = Measure-Command -Expression {
          tracker = New-Object System.Collections.ArrayList
          Import-Csv C:\temp\largecsv.csv | ForEach-Object {
              if($_.sourceipaddress -notin $tracker.sourceIP){
                  [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
                  $_
              }
          } | export-csv c:\temp\largenoduplicates.csv -NoTypeInformation
      }
      "total seconds '$($Measure.TotalSeconds)'"
      
      total seconds '8.6225727'
      
      

      8.6 seconds, nice. And the output

      Import-Csv C:\Temp\largenoduplicates.csv
      
      SourceIPAddress DestinationProtocol DestinationPort TimeStamp
      --------------- ------------------- --------------- ---------
      10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
      10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
      10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
      172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
      10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019
      
      

      Just to see, ran it up to 1 million. That’s a 50MB CSV!

      total seconds '57.1582381'
      
      Import-Csv C:\Temp\hugenoduplicates.csv
      
      SourceIPAddress DestinationProtocol DestinationPort TimeStamp
      --------------- ------------------- --------------- ---------
      10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
      10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
      10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
      172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
      10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019
      
      

      57 seconds for a million, output is still the same.

       

    • #220179
      Participant
      Topics: 3
      Replies: 310
      Points: 1,016
      Helping Hand
      Rank: Community Hero

      I had some fun testing. Given this 10 line source CSV

      
      SourceIPAddress,DestinationProtocol,DestinationPort,TimeStamp
      10.28.25.14,tcp,88,"Sun Mar 21 12:16:02 2020"
      10.28.25.13,tcp,88,"Sun Mar 11 21:41:03 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
      10.28.25.23,tcp,88,"Sun Aug 11 22:40:08 2019"
      172.28.2.2,tcp,88,"Sun Aug 11 22:39:08 2019"
      172.28.2.2,tcp,88,"Sun Mar 11 22:44:08 2020"
      10.28.25.18,tcp,88,"Sun Mar 11 22:46:08 2020"
      10.28.25.23,tcp,88,"Sun Mar 11 22:49:08 2020"
      
      

      I created a 150,000 line CSV

      
      1..15000 | foreach-object {$data | foreach{$_}}| export-csv c:\temp\largecsv.csv -NoTypeInformation
      
      

      Confirmed it. (of course I opened it too)

      
      import-csv C:\Temp\largecsv.csv | Measure-Object | select @{n='Lines';e={$_.count}}
      
      Lines
      -----
      150000
      
      

      Then ran it through the arraylist tracker

      
      $Measure = Measure-Command -Expression {
      $tracker = New-Object System.Collections.ArrayList
      Import-Csv C:\temp\largecsv.csv | ForEach-Object {
      if($_.sourceipaddress -notin $tracker.sourceIP){
      [void]$tracker.add(@{SourceIP=$_.SourceIPAddress})
      $_
      }
      } | export-csv c:\temp\largenoduplicates.csv -NoTypeInformation
      }
      "total seconds '$($Measure.TotalSeconds)'"
      
      total seconds '8.6225727'
      
      

      8.6 seconds, nice. And the output

      
      Import-Csv C:\Temp\largenoduplicates.csv
      
      SourceIPAddress DestinationProtocol DestinationPort TimeStamp
      --------------- ------------------- --------------- ---------
      10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
      10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
      10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
      172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
      10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019
      
      

      Just to see, ran it up to 1 million. That’s a 50MB CSV

      
      total seconds '57.1582381'
      
      Import-Csv C:\Temp\hugenoduplicates.csv
      
      SourceIPAddress DestinationProtocol DestinationPort TimeStamp
      --------------- ------------------- --------------- ---------
      10.28.25.14 tcp 88 Sun Mar 21 12:16:02 2020
      10.28.25.13 tcp 88 Sun Mar 11 21:41:03 2020
      10.28.25.18 tcp 88 Sun Mar 11 22:46:08 2020
      172.28.2.2 tcp 88 Sun Aug 11 22:39:08 2019
      10.28.25.23 tcp 88 Sun Aug 11 22:40:08 2019
      
      

      57 seconds for a million, output is still the same.

Viewing 20 reply threads
  • You must be logged in to reply to this topic.