How to optimize code that tries to remove rows from collection

Welcome Forums General PowerShell Q&A How to optimize code that tries to remove rows from collection

Viewing 9 reply threads
  • Author
    Posts
    • #284284
      Participant
      Topics: 3
      Replies: 7
      Points: 53
      Rank: Member

      Hello.

      I have a collection with around 250 000 rows called $shortlog. It contains fields date, time, csusername, cip. And I need to remove all lines where for same csusername where are all cip the same. Or to say it differently I want to keep only records for users who connected from more than 1 IP address. This is how I do it:

      $users = $shortlog | Select-Object csusername | sort-object -Property csusername | Get-Unique -AsString
      foreach ($user in $users) {
           $name = $user.csusername
           $count = ($shortlog | Where-Object {$_.csusername -eq $name} | Select-Object cip | sort-object -Property cip | Get-Unique -AsString | Measure-Object).Count
           if ($count -eq 1) {
                $shortlog = $shortlog | Where-Object {$_.csusername -ne $name}
           }
      }
      Could this be optimized? With approximately 350 users I am going to search 350 times 250 000 rows collection.
      Thank you
      Best regards
      Honza
    • #284293
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      I think it can. I don’t have a large data set to prove it, but I count 9 pipes in your script.  You can think of each pipe as a loop iterating over the data sent to it, so there is a good deal of overhead.  In contrast, my solution below has 3 pipes.  To know for sure, you can wrap each solution in a script block and use the measure-command cmdlet to determine exactly which one is more efficient.

       

      • #284305
        Participant
        Topics: 3
        Replies: 7
        Points: 53
        Rank: Member

        That is not it. The result of your script is:

        date time csusername cip
        —- —- ———- —
        1/5/21 10:44 me 192.168.1.1
        1/4/21 10:00 me 192.168.1.1
        1/5/21 10:22 you 192.168.1.2
        1/3/21 8:00 you 192.168.1.2

        but the result I need is:

        date time csusername cip
        —- —- ———- —
        1/5/21 10:44 me 192.168.1.1
        1/4/21 10:00 me 192.168.1.1
        1/4/21 9:00 me 192.168.1.2

        csusername “me” connected from more than 1 IP address, I need all his records

        csusernam “you” connected from single IP address, I do not need any of his records

        Honza

        • This reply was modified 2 weeks, 1 day ago by Honza.
    • #284317
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      I get it now.  Sorry for that.  It’s a little more complicated but got it down to 5 pipes.  Let me know if it is more efficient.

       

    • #284320
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      Actually if there is flexibility in the output format you can save the final pipe by including the group.  This will give you a list of the user and unique ip with a count of how many times they had that ip but the details of each (date, time) would be in the group property.

      Or this may be more efficient (since the data is already filtered) and includes the details

       

      • #284344
        Participant
        Topics: 3
        Replies: 7
        Points: 53
        Rank: Member

        What kind of witchcraft is this?

        My version ran 2 hours 16 minutes 35 seconds. Your version ran 13 seconds and the results seem to be the same. I am shocked! I need to check the results again. I liked playing with PowerShell, but after today … I should not touch it again.

        Thank you

        Honza

         

    • #284359
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      Well I’m no witchdoctor but glad I could help. 🙂  Let me know if you have questions or need me to break anything down for you.

    • #284401
      Participant
      Topics: 3
      Replies: 7
      Points: 53
      Rank: Member

      I have double checked it and the results match. Impressive. I still do not understand your part, but obviously I cannot understand how to fly when I just invented a wheel. How do I mark your reply as answer? Is the function here?

      I might have other part that would benefit from optimization if you are willing to help. But it is too long and as I am not a programmer, it might be difficult to understand. So I try to create simpler example:

      I am working with same collection. Lets use the same example, let us say that following data are part of $shortlog:

      I need to create list of distinct users

      And for each user I need to get rows with his/her records and I am comparing each two records in the collection. But I need to compare:

      A-B, A-C, A-D …

      B-C, B-D, B-E …

      C-D, C-E, C-F ….

      etc.

      I have created a copy of the collection, removed first row, after a cycle I always remove first row in the second collection. It looks like this:

      In the sample collection with two users it is no big deal. But if I have 350 users, the main cycle goes 350x, the inner cycles for users with 500 records goes 124750x [500! / 2* (500-2)!]. So I am making 62375000 comparisons. And it takes too long.
      I am just hoping that it makes some sense. I post the original part of my script if it helps. It is not secret, I just keep my brain busy. 🙂
      Is there a trick for this?
      Thank you.
      Honza
    • #284491
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      Honza, Glad the results worked out for you.  These forums do not have a mark answer feature.

      I might be able to help with you second issue, but still not quite clear on your requirements.  It looks like you are comparing unique ips for each user.  Is that right.  In other words for each unique ip the user has assigned do some processing.  To do that you can just use the -unique switch parameter of Select-Object.

      While I’m waiting for clarification on that.  The “trick” in the previous solution is to use the Group-Object command.  When you pipe a collection to Group-Object and specify the properties to group them on, it will return a collection of objects with 4 properties.  I really only use 3 of those: Count, Name, Group.  Count will be the number of objects in the group, Name will be the properties it was grouped on (comma delimited string), and Group will be a collection of all the objects in that group.

       

    • #285463
      Participant
      Topics: 3
      Replies: 7
      Points: 53
      Rank: Member

      Mike, I try to explain. MS Exchange server is my hobby. Strange, I know. I have decided to create a reporting / alerting for suspicious logons into on-premise Exchange server. That means logons that occurred from different places in time that is shorter than time needed to travel that far.

      I know there are tools for that. But I like to make it on my own. It helps me learn about IIS logs, PowerShell and many more. And it actually works already. It is not a piece of art, but it works. I am taking IIS logs at 9AM from previous and current day and process them. Thanks to your  optimization it takes under 3 hours to process.

      What it does:

      Reads the log files and stores the lines containing*Microsoft-Server-ActiveSync* or *EWS* or *OWA* into variable $iislog

      Then I create a $shortlog as I need only several columns. Some conditions are redundant, but they do not hurt.

       

      Then I unify usernames because users use different account format: [email protected], firstname.lastname, domain\firstname.lastname, etc. I have just firstname.lastname in the variable and all in lowercase.

      I also replace our internal private IP addresses to our external IP address, because I use public IP address database to acquire latitude and longitude and private address would not work.

      I read data of previously found IP addresses from CSV. If the address in log does not match address in CSV, I send a request to online database and then store the results into my CSV for next time.

      I remove all users with just 1 IP address in the log. That is the part you helped with:

      The I remove redundant records in the log. If the user connected 5 times in the row from single address, I just need the first and last occurrence.

       

      And then the main part comes:

      I go user by user, for each users I compare each two his records. Not unique records. Each two records. If the IP addresses are NOT same, I calculate distance between those two points. I have date and time for each records, so I can compute speed. Then I can use distance and speed to set severity.

      And I store the results for suspicious logons.

      Back to the main part. What do I do. Let us say, this is the log for my account:

      This is content of $collection1. I need to compare these rows, but I do not know how to do it, so I create $collection2 which is same as $collection1. I have two cycles outer foreach for $collection1 and inner foreach for $collection2. I do not want to compare same rows, so in each pass of $collection1, I remove first row from $collection2. $collection2.RemoveAt(0) Like this:

      Does it make sense?

      Thank you

      Honza

       

       

       

       

       

       

    • #285478
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      Wow, I admire the problem set.  Looks like you need to be able to compare objects within a collection.  Specifically, assuming your array is sorted in time order by user, calculate the time span from the current row to the next as long as the next row is the same user but different ip.  Does that sound right?  If so, the only way to compare objects within the same array is with a for loop instead of a foreach.  Below is a simplified example of what I mean.  In this I add a new property that gives you the time span to the next connection.  If there is not a subsequent row to compare it is simply set to zero.  What complicates it a little is that date and time are two string properties instead of a single [datetime] object, but I work around that by creating [datetime] objects when needed.

       

       

       

      • This reply was modified 1 week, 1 day ago by Mike R..
      • #285487
        Participant
        Topics: 3
        Replies: 7
        Points: 53
        Rank: Member

        Actually, collection always contains only records for single user. I am doing some changes before, so I always need to process one user at time. So it is really like all data for me, all data for you and so on. Does it change anything?

         

      • #285493
        Participant
        Topics: 5
        Replies: 253
        Points: 1,007
        Helping Hand
        Rank: Community Hero

        It wouldn’t change anything, just no need to check for those cases or sort on csusername.

    • #285490
      Participant
      Topics: 5
      Replies: 253
      Points: 1,007
      Helping Hand
      Rank: Community Hero

      BTW, this method would make the last step in your process moot because it would account for single connections by a user.

Viewing 9 reply threads
  • You must be logged in to reply to this topic.