LARGE csv issue (7+ GB - ~25 million lines)

Welcome Forums General PowerShell Q&A LARGE csv issue (7+ GB - ~25 million lines)

This topic contains 9 replies, has 7 voices, and was last updated by

 
Participant
4 years, 1 month ago.

  • Author
    Posts
  • #26403

    Participant
    Topics: 1
    Replies: 2
    Points: 1
    Rank: Member

    Trying to quantify corruption and data loss in my ECM. Did a query in SQL to get the number of items that are SUPPOSED to be there. Query returned 25 million lines. Saved as csv becomes almost 8 GB. I think there are multiple lines per file.

    I want to run PowerShell against the csv and get a new csv with only unique values. Plan to then use that csv with PowerShell (Test-Path ) to get a list of what is/isn't there so I can try to remediate.

    I have tried gc large.csv | get-unique > new.csv – I ended up with an 11 GB csv.

    I am trying:

    Import-CSV .\large.csv | srt filepath -Unique | Export-CSV .\new.csv – NoTypeInformation

    As you can imagine, it is using huge amounts of time and huge amounts of RAM. What am I doing wrong? How can I do this more efficiently?

    Thank you for your help!

    Bruce Olschewski

  • #26405

    Member
    Topics: 9
    Replies: 2322
    Points: 0
    Rank: Member

    Anything you do with a CSV that large is going to be ridiculously slow, assuming it even works at all (since you may run into Out of Memory exceptions at some point.)

    It might be easier to just count the number of lines in the file. The fastest way to do that in PowerShell 3.0 or later is probably this:

    [Linq.Enumerable]::Count([IO.File]::ReadLines("$pwd\large.csv"))

    This way, the .NET framework (compiled C# code) is doing all of the heavy lifting, and you're also not having to store the whole file in memory, since this is all being done with IEnumerable.

  • #26408

    Moderator
    Topics: 2
    Replies: 525
    Points: 24
    Team Member
    Rank: Member

    Bruce, I think to increase efficiency you should use SELECT DISTINCT in SQL or SSIS instead doing the heavy work outside. The SQL engine is much more efficient.

    Just my 2 cents.

    Best,
    Daniel

  • #26441

    Participant
    Topics: 1
    Replies: 2
    Points: 1
    Rank: Member

    Thank you Daniel! I used that and reduced it to about 17 million lines. 🙂

    Is there a way I can read records out of the csv to do a check-path on the record and write a $false (if missing) to another csv without having to load 5 GB of data into memory?

  • #26443

    Participant
    Topics: 3
    Replies: 79
    Points: 0
    Rank: Member

    Two suggestions:

    1) Use SQL to get unique rows, not Powershell. The functionality is built into the T-SQL syntax.
    2) Use Powershell to query the database directly, without storing the data in an interim CSV file. This way Powershell can process the results row by row without wasting both diskspace and memory.

  • #26445

    Participant
    Topics: 1
    Replies: 302
    Points: 141
    Helping Hand
    Rank: Participant

    Bruce, have a look at the Netnerds blog. Chrissy LeMaire has written some great articles on working with large CSV files:

    Quickly Find Duplicates in Large CSV Files using PowerShell

  • #26447

    Participant
    Topics: 8
    Replies: 1191
    Points: 646
    Helping Hand
    Rank: Major Contributor

    Can you explain the whole scenario and what you are trying to do? You have data in a database and it sounds like a lot of data. SQL has an engine meant to work with millions of records and perform advanced queries, but Powershell isn't the best place to work with millions of records. If you explain what you are trying to do with the data and the end goal, we'll be able to tell you the options to make it work as efficiently as possible.

  • #26460

    Participant
    Topics: 1
    Replies: 2
    Points: 1
    Rank: Member

    Here is what I am trying to do. My document management system experienced some severe problems. The NAS that copy 2 was on died. That was when we found data corruption in copy 1. I need to figure out what is missing as I try to recover. The database for the system is MS SQL 2005. I queried the db to get a list of what it THINKS is there (getting doc handle and filepath as part of the query). That is the source of the huge csv.

    What I want to do is take the list of what the db thinks exists and use PowerShell to check-path each item and then write if the file is missing to another csv so I can find what is missing and work to get my system trustworthy again.

    This is the query I used to get the large csv:
    select distinct
    hsi.itemdata.itemnum, hsi.itemdata.itemname, hsi.itemdata.datestored, hsi.itemdatapage.filepath
    From
    hsi.itemdata, hsi.itemdatapage
    Where
    hsi.itemdatapage.diskgroupnum = 104
    and
    hsi.itemdata.status 16
    and
    hsi.itemdata.itemnum = hsi.itemdatapage.itemnum

    Now I want to read the csv (if possible) without loading it entirely into memory and do a "for each" to test each file.

    Does that clarify the situation? Is there a better way?

    Thanks,

    Bruce

  • #26463

    Participant
    Topics: 4
    Replies: 262
    Points: 60
    Rank: Member

    I'd argue that it may be better if you adjust your SQL query to create a new database, and keep the data in SQL. Get it to add another (blank) column though.

    Two options then might be to either

    (1) Skip PowerShell completely, and create a stored procedure / function via VB/C# that carries out the checks, returning a true/false based on the result, and write a new SQL query using the SP/Function. It's a fairly straightforward process to be honest.

    or

    (2) Create a PowerShell script which executes performs a retrieval of a specified number of rows at a time (via SQL's ROW_NUMBER), and checks for the files existence. If it exists, update the table column to indicate a true or false for if the file exists. You could run this as a job in SQL so you can leave it churning away. You might also be able to write your code as a workflow and use parallel execution to speed it up.

    I'm not a SQL guru by any manner or means, but i'd go for option 1 if speed is of the essence.

  • #26477

    Participant
    Topics: 8
    Replies: 1191
    Points: 646
    Helping Hand
    Rank: Major Contributor

    Why not just use xcopy? If you have two directories and you just want to compare one with the other and copy missing files, I don't see the need to even involve or use SQL if you just need to compare file structures.

The topic ‘LARGE csv issue (7+ GB - ~25 million lines)’ is closed to new replies.