LARGE csv issue (7+ GB - ~25 million lines)

Welcome Forums General PowerShell Q&A LARGE csv issue (7+ GB - ~25 million lines)

Viewing 9 reply threads
  • Author
    Posts
    • #26403
      Participant
      Topics: 1
      Replies: 2
      Points: 1
      Rank: Member

      Trying to quantify corruption and data loss in my ECM. Did a query in SQL to get the number of items that are SUPPOSED to be there. Query returned 25 million lines. Saved as csv becomes almost 8 GB. I think there are multiple lines per file.

      I want to run PowerShell against the csv and get a new csv with only unique values. Plan to then use that csv with PowerShell (Test-Path ) to get a list of what is/isn't there so I can try to remediate.

      I have tried gc large.csv | get-unique > new.csv – I ended up with an 11 GB csv.

      I am trying:

      Import-CSV .\large.csv | srt filepath -Unique | Export-CSV .\new.csv – NoTypeInformation

      As you can imagine, it is using huge amounts of time and huge amounts of RAM. What am I doing wrong? How can I do this more efficiently?

      Thank you for your help!

      Bruce Olschewski

    • #26405
      Member
      Topics: 9
      Replies: 2322
      Points: 0
      Rank: Member

      Anything you do with a CSV that large is going to be ridiculously slow, assuming it even works at all (since you may run into Out of Memory exceptions at some point.)

      It might be easier to just count the number of lines in the file. The fastest way to do that in PowerShell 3.0 or later is probably this:

      [Linq.Enumerable]::Count([IO.File]::ReadLines("$pwd\large.csv"))

      This way, the .NET framework (compiled C# code) is doing all of the heavy lifting, and you're also not having to store the whole file in memory, since this is all being done with IEnumerable.

    • #26408
      Moderator
      Topics: 2
      Replies: 525
      Points: 24
      Team Member
      Rank: Member

      Bruce, I think to increase efficiency you should use SELECT DISTINCT in SQL or SSIS instead doing the heavy work outside. The SQL engine is much more efficient.

      Just my 2 cents.

      Best,
      Daniel

    • #26441
      Participant
      Topics: 1
      Replies: 2
      Points: 1
      Rank: Member

      Thank you Daniel! I used that and reduced it to about 17 million lines. 🙂

      Is there a way I can read records out of the csv to do a check-path on the record and write a $false (if missing) to another csv without having to load 5 GB of data into memory?

    • #26443
      Participant
      Topics: 3
      Replies: 79
      Points: 0
      Rank: Member

      Two suggestions:

      1) Use SQL to get unique rows, not Powershell. The functionality is built into the T-SQL syntax.
      2) Use Powershell to query the database directly, without storing the data in an interim CSV file. This way Powershell can process the results row by row without wasting both diskspace and memory.

    • #26445
      Participant
      Topics: 1
      Replies: 302
      Points: 145
      Helping Hand
      Rank: Participant

      Bruce, have a look at the Netnerds blog. Chrissy LeMaire has written some great articles on working with large CSV files:

      Quickly Find Duplicates in Large CSV Files using PowerShell

    • #26447
      Participant
      Topics: 8
      Replies: 1262
      Points: 974
      Helping Hand
      Rank: Major Contributor

      Can you explain the whole scenario and what you are trying to do? You have data in a database and it sounds like a lot of data. SQL has an engine meant to work with millions of records and perform advanced queries, but Powershell isn't the best place to work with millions of records. If you explain what you are trying to do with the data and the end goal, we'll be able to tell you the options to make it work as efficiently as possible.

    • #26460
      Participant
      Topics: 1
      Replies: 2
      Points: 1
      Rank: Member

      Here is what I am trying to do. My document management system experienced some severe problems. The NAS that copy 2 was on died. That was when we found data corruption in copy 1. I need to figure out what is missing as I try to recover. The database for the system is MS SQL 2005. I queried the db to get a list of what it THINKS is there (getting doc handle and filepath as part of the query). That is the source of the huge csv.

      What I want to do is take the list of what the db thinks exists and use PowerShell to check-path each item and then write if the file is missing to another csv so I can find what is missing and work to get my system trustworthy again.

      This is the query I used to get the large csv:
      select distinct
      hsi.itemdata.itemnum, hsi.itemdata.itemname, hsi.itemdata.datestored, hsi.itemdatapage.filepath
      From
      hsi.itemdata, hsi.itemdatapage
      Where
      hsi.itemdatapage.diskgroupnum = 104
      and
      hsi.itemdata.status 16
      and
      hsi.itemdata.itemnum = hsi.itemdatapage.itemnum

      Now I want to read the csv (if possible) without loading it entirely into memory and do a "for each" to test each file.

      Does that clarify the situation? Is there a better way?

      Thanks,

      Bruce

    • #26463
      Participant
      Topics: 4
      Replies: 262
      Points: 61
      Rank: Member

      I'd argue that it may be better if you adjust your SQL query to create a new database, and keep the data in SQL. Get it to add another (blank) column though.

      Two options then might be to either

      (1) Skip PowerShell completely, and create a stored procedure / function via VB/C# that carries out the checks, returning a true/false based on the result, and write a new SQL query using the SP/Function. It's a fairly straightforward process to be honest.

      or

      (2) Create a PowerShell script which executes performs a retrieval of a specified number of rows at a time (via SQL's ROW_NUMBER), and checks for the files existence. If it exists, update the table column to indicate a true or false for if the file exists. You could run this as a job in SQL so you can leave it churning away. You might also be able to write your code as a workflow and use parallel execution to speed it up.

      I'm not a SQL guru by any manner or means, but i'd go for option 1 if speed is of the essence.

    • #26477
      Participant
      Topics: 8
      Replies: 1262
      Points: 974
      Helping Hand
      Rank: Major Contributor

      Why not just use xcopy? If you have two directories and you just want to compare one with the other and copy missing files, I don't see the need to even involve or use SQL if you just need to compare file structures.

Viewing 9 reply threads
  • The topic ‘LARGE csv issue (7+ GB - ~25 million lines)’ is closed to new replies.