Help importing overlapping date CSVs - comparing whole csv lines

This topic contains 4 replies, has 2 voices, and was last updated by  Pete 3 weeks, 4 days ago.

  • Author
    Posts
  • #80627

    Pete
    Participant

    Hello.

    I was hoping someone my be able to help me. I am trying to import and append csv logs using powershell. These csv logs overlap each other, so there are duplicate lines which i need to remove.

    The problem i have is that when I import csvs into a variable using import-csv i can't seem to evaluate matches based on the whole line and not just a specific property. I need to be able to export in csv after removing the duplicates, i would like to do this without specifying each header title I wish to compare, as i need to do this for many different logs and each have different headers.

    I have tried 'select-object * | sort-object' then 'select-object * -unique' but it doesn't seem to work.

    If I use Get-Content to import the csvs as txt (not import-csv) and then the 'select-object * | sort-object' | select-object -unique' switch after sorting, i can remove line duplicates. However, I don't seem to be able to export this data as csv after using export-csv and the header title of each csv (always on the first line of the each csv) is sorted into the data which is not ideal. With import-csv I can use the -append switch to stop this happening.

    For example (this is a simplified version but illustrates the point):

    CSV1
    Date;Colour;Number
    19/Aug/2016 10:01:01;Yellow;1
    19/Aug/2016 10:01:01;Red;1
    19/Aug/2016 10:01:01;Yellow;2
    19/Aug/2016 10:00:57;Blue;1

    CSV2 (older)
    19/Aug/2016 10:01:01;Yellow;2
    19/Aug/2016 10:00:57;Blue;1
    19/Aug/2016 10:00:21;Blue;1
    19/Aug/2016 10:00:20;Blue;1

    Desired Output (duplicate lines removed – note if I sorted and removed dups on any other of the 3 properties i would not get this result)
    19/Aug/2016 10:01:01;Yellow;1
    19/Aug/2016 10:01:01;Red;1
    19/Aug/2016 10:01:01;Yellow;2
    19/Aug/2016 10:00:57;Blue;1
    19/Aug/2016 10:00:21;Blue;1
    19/Aug/2016 10:00:20;Blue;1

    Can anyone help me do this?

  • #80690

    Jeremy Brun
    Participant

    You were SO close on your own! Try this..

    $CSVContent = Import-Csv .\CSV1.csv
    $CSVContent += Import-Csv .\CSV2.csv
    
    $CSVContent | Sort-Object * -Unique
    

    The asterisk (*) in Sort-Object tells the Cmdlet to use the entire row for sorting then returning unique items.

  • #80716

    Pete
    Participant

    Thanks! Can't believe I spent ages on that!

    As it happened I needed another solution anyway as i found the sort for -unique was messing up the item order of log items that were logged in the same second. This wasn't desired.

    Therefore I used a hashtable which does preserve the order as well – this is the full code in case anyone finds it useful.

    $FilePath = "C:\Users\peter_mitchell\OneDrive\WorkSync\MSA Insurance\Projects\AD Eventlog Auditing\AD EL Forwarding\Event LF 2016\Eventlog PS\ZIP Script\CSV Col PS"
    $filefilter = "RES-WinE-GPO-OBJ-Version-Number-Change-Only*.csv"
    $txtexport = $FilePath + "\csvMerged-GC-Hash.csv"
    
    #THIS SCRIPT LOADS CSV LINES IN A FOLDER SPECIFIED WHICH MATCH A FILTER
    #THEN COLLATES TOGETHER AND REMOVES DUPLICATES
    #NOTE: LINES MUST BE UNIQUE TO BE RETAINED - INCLUDE SECONDS IN EVENT LOGS FOR EXAMPLE
    
    #Sets the current time-date 'Culture' locale for the Powershell session only - useful for export-csv dates in the UK and then importing to Excel. Or for csv exports where you need the seconds.
    #Local to the powershell Window length only
    #$(get-date).ToShortDateString()
    $(Get-Culture).DateTimeFormat.ShortDatePattern = 'dd-MMM-yyyy'
    $(Get-Culture).DateTimeFormat.ShortTimePattern = 'HH:mm:ss'
    #$(get-date).ToShortDateString()
    #can also try ShortTimePattern or LongTimePattern or LongDatePattern. Expect other locale settings in Get-culture may be useful with international servers.  
    
    #Get file list in folder specified matching filter
    $FileListbyLastWrite = Get-ChildItem $FilePath -Filter $filefilter | Where-Object { !$_.PSIsContainer} | Sort-Object -Descending lastwritetime
    
    #Note you could use the following lines instead of the above if the files are formatted in a YYYY-MM-DD format within the csv filename
    #$FileListbyname = Get-ChildItem $FilePath -Filter $filefilter | Where-Object { !$_.PSIsContainer} | Sort-Object -Descending name
    #$FileListbyLastWrite = $FileListbyname
    
    #blanks the varibles we are going to use later - just in case of ISE running
    $imports = $null
    $importsuni = $null
    $headers = $null
    $hashtable = $null
    
    #For each file matching filter, load but miss off the headers.
    Foreach ($File in $FileListbyLastWrite) {
    $imports += get-content $File.fullname | Select-Object -Skip 1}
    
    $hashtable = @{}                 # Define an empty hashtable
    
    #Magic line that only adds the key value to the hash table if it isn't there already
    #Warning: Lines must match exactly, even opening and re-saving csv in Excel can mess formatting up so lines are not exact.
    #If exact lines and the csv last write dates are in the correct order (i.e oldest lastwritedate is oldest csv) then line order is not changed. 
    $importsuni = $imports | %{if($hashtable.$_ -eq $null){$_};$hashtable.$_=1}
    
    #Troubleshooting
    #$imports.count
    #$importsuni.count
    #$hashTable.Count
    
    #This line was put in to sort all the events in date order - it does not work as the text is not reconised as a date
    #Is not uncessary as long as the csv has a lastwritetime in order (i.e the oldest file has the oldest lastwritetime and so on
    #$importsuni = $importsuni | Sort-Object -Descending
    
    #Get only the headers from the last file processed
    $headers = get-content $File.fullname | Select-Object -first 1 
    
    #|The -join command keeps the results on different lines - as opposed to adding the strings together which messes up the formatting
    #https://blogs.technet.microsoft.com/heyscriptingguy/2014/07/15/keep-your-hands-clean-use-powershell-to-glue-strings-together/
    $Export = -join $headers,$importsuni 
    
    #Exports the file using Out-file to the location specified. Encoding in ASCII so it can be opened as a CSV in Excel
    #https://stackoverflow.com/questions/23862028/powershell-out-file-doesnt-work-for-csv
    $Export | Out-File $txtexport -Encoding ascii
    
    • #80861

      Pete
      Participant

      You can also do this for multipul csv locations by subbing in the following lines:

      #Get file list in folder specified matching filter
      $FileListbyLastWrite1 = Get-ChildItem $FilePath -Filter $filefilter | Where-Object { !$_.PSIsContainer} | Sort-Object -Descending lastwritetime
      $FileListbyLastWrite2 = Get-ChildItem $FilePath2 -Filter $filefilter | Where-Object { !$_.PSIsContainer} | Sort-Object -Descending lastwritetime
      
      #Creates an array to deal with multipul results in system object format. -join works if only one file in each folder but messes up the formatting if there is more then one. 
      $FileListbyLastWrite = @($FileListbyLastWrite1.FullName,$FileListbyLastWrite2.FullName)
      
  • #80740

    Jeremy Brun
    Participant

    Great to hear you got a solution working! As a quick aside I've found the .NET HashSet (MSDN Documentation) to work splendid (and fast) for large amounts of data that you want to grab unique values from. Tended to work quicker than Sort-Object -Unique with large data in my cases. Obviously depends on the specifics of what you need to be able to do with the data though. Well done!

You must be logged in to reply to this topic.