Small Script – Massive Memory consumption, low CPU but takes hours to complete??

Welcome Forums General PowerShell Q&A Small Script – Massive Memory consumption, low CPU but takes hours to complete??

Viewing 20 reply threads
  • Author
    Posts
    • #270397
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Hello – I’m very new to PS. I have written a small script which looks at a huge CSV (Anything up to about 350MB with anything up to 8 million lines) then looks at the 3rd field in each line, compares this field with every entry in a reference CSV (which currently only has about 20 lines in it) and if the reference line matches this 3rd field, it replaces the field with whatever is in the reference file entry. A way of truncating the 3rd field based on entries in a ref file. It then stops looking through the ref file and continues through each line of the original CSV. It then makes a new CSV with a different extension.

      Running this with small test files, it works great. However, I’ve just tried it on a 280MB csv and my PC is currently going crazy. Memory usage is up to 10.6GB, CPU nice and low on 4.2%, but it is taking hours to process this file. Running for about 40 minutes, it has process about 2MB of the 280MB file. I have another 27 similarly large files to get through. So might be done by Christmas.

      Is there something very wrong with the script causing it to run slow slowly and why does it take 10.6GB of memory to read in a 280MB file???

      Any help most appreciated.

      • This topic was modified 3 weeks ago by grokkit. Reason: code formatting - please read the guide
    • #270412
      Participant
      Topics: 5
      Replies: 170
      Points: 656
      Helping Hand
      Rank: Major Contributor

      Are these actually csv files?  If so, why not use Import-Csv and avoid manipulating strings?  I’m not sure, but I bet the string manipulation is taking up some time.

      • #270436
        Participant
        Topics: 1
        Replies: 11
        Points: 29
        Rank: Member

        Hi, yes they are CSV files although pipe delimited. I can easily convert to , del with F.A.R.T

    • #270421
      Participant
      Topics: 16
      Replies: 1855
      Points: 3,568
      Helping Hand
      Rank: Community Hero

      This is somewhat of a common question. Here is a breakdown of the methods to open files, but you’re still loading a lot into memory with files that large:

      http://www.happysysadm.com/2014/10/reading-large-text-files-with-powershell.html

      There are also 3 for loops. At first glance, you could be doing a Where clause or HashTable lookup rather than adding for loops with if statements to get a match.

      • #270454
        Participant
        Topics: 1
        Replies: 11
        Points: 29
        Rank: Member

        Wow! I just added the -Raw to the Get-Content line and it is ploughing through at an incredible rate!! Memory consumption down to 12% and it has already processed 10MB in about 4 minutes. Looks like that might be all that is needed.

        • This reply was modified 3 weeks ago by kevvo6902.
    • #270442
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Thanks for the reply. I will look at that article. As I say, I’m just new to all of this so just feeling my way.

      I have plenty of CPU power and memory available but that doesn’t help if it still takes weeks to process the data

    • #270445
      Participant
      Topics: 0
      Replies: 4
      Points: 19
      Rank: Member

      See if this is quicker

      • This reply was modified 3 weeks ago by prazoninstrum.
      • This reply was modified 3 weeks ago by grokkit. Reason: code formatting - please read the guide
    • #270475
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Hi, many thanks for looking at this, but doesn;t seem to have made anything any quicker. Adding -Raw so far has made the biggest difference, but it is still slower than I can really do with

    • #270487
      Participant
      Topics: 8
      Replies: 168
      Points: 818
      Helping Hand
      Rank: Major Contributor

      Try removing these lines:

      $fileSplit = $file.split(`n)
      $file = $null

      And changing this line:

      foreach ($line in $fileSplit)

       

      To this:

      foreach ($line in $file)

      It has been my experience that PS automatically parses each line using Get-Content so you should not have to use Split at all.

    • #270490
      Participant
      Topics: 0
      Replies: 4
      Points: 19
      Rank: Member

      is there any change you can share, anonymized possibly, samples of the files?

      a onedrive link with either the ref and real file.

      Maybe on the real file, just strip all out and keep the first 10 lines.

      just to have an idea of the formats.

      Please make sure there are no sensitive or private data

    • #270493
      Participant
      Topics: 16
      Replies: 1855
      Points: 3,568
      Helping Hand
      Rank: Community Hero

      Personally, I’d scale things back. Just concentrate on loading the file and processing each line to see how fast you can get that to work. Then you can determine the impact that additional parsing is taking. This seems expensive:

      A faster approach might be using regex to match what you are looking for than splitting the line and then processing the columns in another loop.

    • #270496
      Participant
      Topics: 0
      Replies: 4
      Points: 19
      Rank: Member

      Agree on the scale down, i had a similar scenario a couple or 3 years ago, with kinda 70-80Gb of data in file spawning from 450 to 600Mb (around 1M lines each).

      I remember on the first trial, after more than 8h, processing was still in deep sea. With the final version i was able to process everything in 20-30 minutes.

      issue was not loading the files, but the memory tradeoff playing with the arrays in memory (and no matter the matter, even a vm in azure with 56gb wasnt enough).

      Thats as first step i suggested to mod approach in dumping the array to file every 2000 records.

       

       

    • #270511
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Try removing these lines:

      $fileSplit = $file.split(`n)

      $file = $null

      And changing this line:

      foreach ($line in $fileSplit)

      To this:

      foreach ($line in $file)

      It has been my experience that PS automatically parses each line using Get-Content so you should not have to use Split at all.

      Hi, I must be doing something wrong – I can’t see theses exact lines in my script. I think I know what you are driving at but when it seems to break the script

       

      Thanks for looking though

    • #270514
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Hi, yes – I’ll see if I can organise a sample file to send across. Many thanks

    • #270520
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      As I’m only interested in modifying the 3rd column in these files, is there a way I can only read in this column, do what I need to do and then merge it back into the original?

      I looked into REGEX but for a novice, it is beyond my comprehension. It’s taken me days to get this match/replace to work.

       

      Many thanks all for looking. I’ll stick at it!

       

      Kev

    • #270559
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      See if this is quicker

      Hi. just did some more testing on a large file and this is considerably slower than my 1st script.

      But simply using the -Raw has enabled me to process a 280MB file in around 80 minutes.

      • This reply was modified 3 weeks ago by kevvo6902.
    • #270571
      Participant
      Topics: 0
      Replies: 4
      Points: 19
      Rank: Member

      seems still too much anyway.

      if you don`t want to post the sample publicly, you can send them to me here

      $mail = “cAByAGEAegBvAG4AaQBuAHMAdAByAHUAbQBAAGMAbABvAHUAZABtAGUAdABhAHYAZQByAHMAZQAuAGMAbwBtAA==”
      [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($mail))

      need the ref, and sample of file to process

    • #270577
      Participant
      Topics: 8
      Replies: 168
      Points: 818
      Helping Hand
      Rank: Major Contributor

      Sorry for not being clear. My comments were in reference to the script prazoninstrum posted.

      To summarize, I dont see why you need to split on the new line char:

      $file.split(“`n”)

      Not sure if it will save you any time removing this. When I parse CSV files I simply use:

      $file = Get-Content -Path ‘\\…….whatever’

      foreach($line in $file) ……

      Again, sorry for not being clear.

    • #270580
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Sorry for not being clear. My comments were in reference to the script prazoninstrum posted.

      To summarize, I dont see why you need to split on the new line char:

      $file.split(“n”)

      Not sure if it will save you any time removing this. When I parse CSV files I simply use:

      $file = Get-Content -Path ‘\\…….whatever’

      foreach($line in $file) ……

      Again, sorry for not being clear.

      Hi, no prob. I took out the $file.split(“n”) and then used foreach($line in $file){ …. referring to $line instead of the old $_ and the script runs but does nothing at all. No errors but nothing output to console

      Kev

    • #270583
      Participant
      Topics: 8
      Replies: 168
      Points: 818
      Helping Hand
      Rank: Major Contributor

      Hmm … I believe you have gone back to your original script which would be:

      $file | foreach { ….

      I was commenting based on the code provided by prazoninstrum

      Again, not sure removing the split will help performance.

    • #270586
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      seems still too much anyway.

      if you don`t want to post the sample publicly, you can send them to me here

      $mail = “cAByAGEAegBvAG4AaQBuAHMAdAByAHUAbQBAAGMAbABvAHUAZABtAGUAdABhAHYAZQByAHMAZQAuAGMAbwBtAA==”

      [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($mail))

      need the ref, and sample of file to process

      Hi, I’ve sent over a couple of files via email. They are tiny but it should give you an idea

    • #270589
      Participant
      Topics: 1
      Replies: 11
      Points: 29
      Rank: Member

      Hmm … I believe you have gone back to your original script which would be:

      $file | foreach { ….

      I was commenting based on the code provided by prazoninstrum

      Again, not sure removing the split will help performance.

      Ah, sorry – total crossed wires.

      I think I need to find a solution whereby only the 3rd column is brought it to be looked at which will cut down on file size.

      Kev

    • #270682
      Participant
      Topics: 2
      Replies: 380
      Points: 513
      Helping Hand
      Rank: Major Contributor

      Chrissy LeMaire has written some great articles on processing large CSV files.

      https://blog.netnerds.net/2015/01/powershell-high-performance-techniques-for-importing-csv-to-sql-server/

      You can modify her script to do other stuff with the data, you don’t have to upload to a SQL server.

      My modified example below just outputs matches in column3 when comparing to a list of data in a text file – similar to what you want to do.

      I used the All Countries text file (1.2GB and 10 million+ rows) linked in her article with a random list of 23 countries in my lookup file.  It took just under 11 minutes to process 12 million rows.

      You can modify the data  in the data table directly if you want and then export it to CSV.

      There are some examples here although unfortunately some of the example images are missing:

      Dancing on the Table with PowerShell

       

Viewing 20 reply threads
  • You must be logged in to reply to this topic.