Parsing Large CSV

This topic contains 7 replies, has 4 voices, and was last updated by Profile photo of Simon Wåhlin Simon Wåhlin 2 years, 3 months ago.

  • Author
    Posts
  • #18134
    Profile photo of Mark Noble
    Mark Noble
    Participant

    I found some data I want to work with as a tab-delimited text file and wrote this line of PowerShell to put it in an object (replace URL with address of tab-delimited file).

    $obj = ConvertFrom-Csv ((Invoke-WebRequest URL).Content) -Delimiter "`t"

    The file is rather large though (~150 MB) and it takes a while to download – but according to the ISE status bar, that part seems to be going fine.

    It's the next step where things start going bad I think.

    Instead of parsing the data into an object, it works for a while, then stops. Only working it's way through about 35,000 of the records (as opposed to the 700,000 or so that exist in the file).

    Earlier I just downloaded the file from the web manually, then used Import-Csv to put the file into an object... that seemed to work fine.

    Any idea why this seems to be timing out? It's kind of frustrating to pull down a data set and find that it's only a small portion of the total data needed. 🙁

  • #18142
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    I created a ~64MB CSV file as a test, and if I try to import it and save the results to a variable (eg, $obj = Import-Csv .\huge.csv), I get an OutOfMemoryException. (I wasn't able to reproduce the behavior you observed of having a partial import, though.)

    I'm not sure what you're hoping to do with this $obj variable after the import, but you may have better luck switching over to a streaming approach: Use Invoke-WebRequest with the -OutFile parameter to download the CSV file, then when you import it with Import-Csv, don't save the results to a variable. Pipe it to a function (or ForEach-Object, etc) to process the objects one at a time without having to store them all in memory.

  • #18143
    Profile photo of Mark Noble
    Mark Noble
    Participant

    I tried it on a different machine and the exact same script executed without incident – pulled in 799,153 records... So I guess it's not the script.

    Having it in memory makes my ISE instance waver back and forth between (Not Responding) territory though... I think I'll look at your ForEach-Object option.

    The thing is, I was hoping to be able to process the large file once a day and keep it around so I could quickly search it for an individual.

    Importing the huge file took 20 minutes when I timed it once... but once you have it, it only takes about 2 seconds to filter it down and get the data I need.

    Also, maybe I have multiple copies of it open... is there any way to see a list of objects I've created and their sizes? Then I could clear up any I don't need anymore.

    And one more thing – is there a good way to get the object to persist – like if I need to close PS or reboot or something? I thought these objects only stuck around during the PS Session.

  • #18145
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    It's difficult to test without an example. You can attempt to actually download the file and then just do Import-CSV like you did manually, try something like this: http://gallery.technet.microsoft.com/scriptcenter/files-from-websites-4a181ff3

    If there is an example dataset I can look at, I can do some testing, but something small like this works fine:

    $test = ConvertFrom-CSV (Invoke-WebRequest -Uri "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-countries.txt").RawContent -Delimiter " " -Header Code, Location
    
  • #18148
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    If you want the data to be that accessible, why not just have a background task send it to SQL or a database instance nightly and use that versus trying to keep it in your session memory? You could create a function to connect to the instance in your powershell profile and even assign it an alias for quick queries

  • #18149
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Rob Simmers wrote:If you want the data to be that accessible, why not just have a background task send it to SQL or a database instance nightly and use that versus trying to keep it in your session memory? You could create a function to connect to the instance in your powershell profile and even assign it an alias for quick queries

    That was my thought as well. You're well into territory where simple scripts and CSV files won't scale well, and leveraging a real DBMS will give you much better performance.

  • #18154
    Profile photo of Mark Noble
    Mark Noble
    Participant

    I've found that SQL Server with SSIS packages and the like were overkill for what I was doing.

    The main goal is to take this rather large and unwieldy CSV – then narrow it down to smaller CSV's and distribute them to individuals on a regular basis.

    The people it's going to know how to use Excel to open and work with CSV's. They aren't likely to learn SQL.

    Compounding the issue is that the data source format changes more often than is handy for using SSIS packages (which I was having to modify as often as I was having to run them).

    PS also enables me to create scripts for people to allow them to generate their own CSV subsets of the larger data set as they need it.

    I think Dave was on the right track when discussing piping the large CSV through filters to arrive at the smaller CSV's I need without ever storing the huge CSV as an object... the only down-side with that is that (as I mentioned earlier), it takes me about 20 minutes to process the large CSV – but 2 seconds to filter the object once it's created. So I think if I used Dave's approach I'd be looking at 20 minutes any time I wanted to create the smaller CSV's. That'd be ok if I ran it as a scheduled job (20 minutes is nothing if I'm already sleeping!) but less than ideal if someone is running it locally, on-demand, with not-so fancy hardware.

    I'd share the data set I'm working with (it's public data) but since it includes personal information I'd rather not immortalize scripts for accessing it in a public forum. 🙂

  • #18157
    Profile photo of Simon Wåhlin
    Simon Wåhlin
    Participant

    How about writing a function called Out-SQL that takes a list of PSCustomObjects and stores each NoteProperty in a database table.

    If the Begin-block opens a connection to the database, drops your table and recreates it, the Process-block inserts each object to the table and the End-block closes the connection, piping input from Import-CSV would probably go smooth.

    Schedule a script to run this on regular intervals and you'll always have the data available (except while the import-script it running of course)

    If the source format changes very often, you could use a parameter on the Out-SQL command to define the table design (maybe an array of hashtables to define column names and types).

    If the import process still is too slow, try to split the input-file in chunks and process them in parallel.

You must be logged in to reply to this topic.