This is my first post of “The Fastest Series” on Powershell.org,
The goal is to measure different techniques to perform a specific task and returns the fastest. All these measures are performed with the same task, under the same conditions, in the same environment and with the same criteria, it’s just about pure SPEED.
Like a Formula 1 race, we just need to know which one is the fastest to win. If the winner saved more gas than other players, it’s better, but it’s not the main goal. That’s why I called it “The Fastest” and not “The Cheapest”. For example, here are some of the tests I will be publishing (the list is not exhaustive at all, a lot more is coming):
- The fastest Powershell #1: Remove carriage return and replace with a comma
- The fastest Powershell #2: Count all users in an Active Directory domain
- The fastest Powershell #3: Count all files in a large network share
- The fastest Powershell #4: Count all files in a NTFS Hard Disk
- The fastest Powershell #5: Read a text file
It’s really important to underline that the “Fastest Series” is not a “static” one, we are working in an environment where there are always new updates, releases and technologies. Some commands can be deprecated, others can be updated, or faster techniques can be found. For all these reasons, I invite you to often come here to check the latest results, I will keep that updated (you can look at the bottom of this article to check the latest updates).
To improve the reading, I will keep the same syntax and style in all my posts. Besides, I always apply the same methodology to determine the fastest, I call them the “3E“:
The result returned by these commands has to be equal, otherwise I consider it’s not relevant enough. For example, if I compare several commands, I have to double check that the number of items is the same and that I do not forget some filters or parameters that could return less (or more) items.
I am going to measure which one is the fastest between these 2 queries, and if I forget to set the “PageSize” property, what will happen? Well, the query02 will be definitely faster than the query01, but it’s not relevant because the query02 does not contain all the items: there are less items returned. In this case, with PageSize=1000 there are 28281 items returned and without there are less items (limit of records).
When setting the property PageSize to the maximum value of 1000, we will get the first 1000 items, there will be a pause, then the next 1,000 items, and so on.
To be sure, I am used to counting the total number of items before starting the measure, if one result is different from the others, I do not start the comparaison before investigating the root cause. For one of my tests (I will post about that in the future), I compared 11 queries and I ensured that all these queries contained the same number of items.
While the script runs, I am not performing some additional activities or tasks on the computer, that’s very important. If the script is running and in the same time I am running a lot of applications which will use a lot of resources, the computer will probably slow down and it could have an impact on the measure, especially when there is not much time between the tasks. That’s why I check that there is no overload task while the script (to measure) is running: I run, I wait and I compare.
Usually, I start Process Explorer into the System Information (Physical Memory) from the beginning to the end. For your information, most of my tests are performed with my personal laptop (Windows 8.1, CPU Intel i7 and 16 GB DDR3 RAM).
The measure unit is selected based on the task. If it’s about a very short task, I would probably use (milli)seconds, otherwise if the task needs more time to run I will use minutes (or hours eventually). Then, all the results are added to a hashtable in ascending order. I think it’s more readable to display only the unit we need from the Measure-Command. The Key contains the method’s name and the Value contains the execution time.
Using static .NET methods tend to be faster than cmdlets, and I consider that in some heavy tasks it could be very useful to use .NET methods. However, cmdlets are also great because by using .NET methods we are losing some benefits such as pipelining, WhatIf/Confirm, Get-Help native documentation, etc.
That’s why I really want to expose all the possible techniques (cmdlets, .NET, executables), and not only the fastest. Besides, it’s also interesting to know that such commands exist, and sometimes you spend less time to write “gc .\test.txt” than “[System.IO.File]::ReadAllText(“C:\scripts\test.txt”)” to read a small text file, speed concept is relative. As for me, I use cmdlets and .NET methods, it depends on the situation, the context and the environment.
Question: What is the fastest solution to remove newline / carriage return and replace with a delimiter (comma) from a TXT file?
Answer: To answer this question, I will compare 4 different commands:
I am using a TXT wordlist (1 GB with 120 947 450 lines).
First, I just confirm that all these commands return the same output before measuring that. I created a temporary test file and assigned the wordlist variable, then I ran these 4 commands one by one.
We can start to measure as results have the same values (a,b,c) delimited by comma.
I added all the queries into a hashtable and I rounded the results of the measures to 2 decimals for better reading.
I decided to use a hashtable because I prefer to have an automatic sort instead of a manual one. In a future post here, I will compare 11 commands so it makes no sense to sort myself the results. Moreover, hashtables are “cleaner” and are especially designed for that: key/value pairs.
This is the history graph for the memory usage (from the start to the end of the script). We can see at the beginning that the memory is increasing when parsing the content of the wordlist and loading it into memory.
Conclusion: In this scenario, the fastest was:
The file size in question is 60 GB (don’t ask why). I need to replace the windows carraige return rather quickly.
How to replace windows carriage return with a space in a ginormous text file
Note : In my post, I chose a comma as a delimiter but it could be something else (space, for example). The above example is rare and extreme, but using a faster method over a slower method would significantly reduce the time execution in this case. Although the question is UNIX related, the concept is the same : Optimisation can increase the productivity.
Updated: March 04, 2015
Follow me on Twitter
My FAQ Powershell