Finding similar keys in HashTable to sum values

This topic contains 9 replies, has 3 voices, and was last updated by  Mahin 2 months, 3 weeks ago.

  • Author
    Posts
  • #76214

    Mahin
    Participant

    Hi folks. I'm pulling certain flagged high volume email senders from our Transport mail servers in Exchange onPrem and adding them to an open HashTable. Adding and reading the key/value pairs is working perfectly. The example hash key/value output is below. The first part of the Name value is the Exchange server host name (aws1288), and second part after "_" is the flagged email sender (plumber@contso.com). In the Value column, i have the amount of emails they sent (50 in first example row). What i'm trying to do is find all names which are identical, in case below row1 and row3 have identical suffix of "_plumber@contso.com" and row2 and row4 have identical suffix of "_DBuser@contso.com". I'm trying to sum up the values of all matching suffix email addresses i.e, sum up all plumber@contso.com values (50+30=80) and DBuser@contso.com values (60+10=70) and then output the email address with the highest value. In this case it would be plumber@contso.com with a value of 80. I've researched online heavily but see no direction. I saw examples where you do, if($hash.keys -like "string") {statement block}. but in my case there is no string i'm searching for but just want to see a similar match search for the hash keys. I also tried foreach($item in $hash.getEnumerator()){$_.name -like $item} and get 4 "False" values for the sample data below. I also tried {$_.name -contains $item} and that too returned 4 False values. Thanks for taking a look.

    Name Value
    —- ——

    aws1288_plumber@contso.com 50
    yws3143_DBuser@contso.com 60
    din7412_plumber@contso.com 30
    tel9132_DBuser@contso.com 10

  • #76216

    Don Jones
    Keymaster

    This is going to be a little hard, because hash tables in .NET aren't databases, and they don't have a lot of "set" operators meant to operate against the entire collection. Even an RDBMS isn't perfectly-suited for this. What you'd wind up doing in, say, SQL Server, is calculating something like a SOUNDEX() function for each row, and storing that, and then sorting by the SOUNDEX() value to "group" similar-"sounding" values. Purely with a hash table, you're going to end up taking the first value, enumerating the entire set to find "like" values, and then doing that with the second value, etc. If it's a big set, it's gonna take a bit.

    Hash tables are designed to quickly look up a value when you know the key; they're not an all-purpose data structure suitable for all purposes. What you're looking for isn't really what hash tables were made for. What's your ability to switch to a more suitable data structure, like an RDBMS, in-memory NoSQL, or something else?

    The only other thing I can think of, if the "suffix" is truly what you're after, is to first go through the set one at a time and construct a new hash table of JUST the prefixes. As you hit each row in the original set, you can see if that "suffix" key exists in the new set, and if so, increment its value. If not, add it to the new set and add in the original value. But that's only if your "suffixes" are indeed identical – not "like."

    That ".Keys -like" business is kind of a bugaboo, because what it's doing is an implicit ForEach. That's lovely and convenient, but it's hiding a performance pain you might want to be aware of and explicitly code for, rather than letting it occur through the shell's underlying hidden magic.

  • #76219

    Curtis Smith
    Participant

    I would take a similar approach to Don in that I would build a second hash table from the first, except that in the second hash table, my key would be the suffix (email address) rather than the prefix (server name). Reading the question it sounds like you are not so much concerned about the server that is used to send the email, but rather the email address used to send the email.

    I would:
    1) Start a new hashtable
    2) For each key in the hash table, use regex to extract just the email address
    3) Use the email address as the key for the new hashtable and add the value of the current key in the loop to the value of the new email based key in the new hashtable.

    The result will be that the original hash table is looped through one time and all the values are combined on their corresponding email addresses in the new hash table.

    IE

    Name                           Value
    ----                           -----
    plumber@contso.com             80
    DBuser@contso.com              70

    4) Enumerate the new hashtable and sort it by value.

  • #76240

    Mahin
    Participant

    Thank you Don. To answer your question, using RDBMS or NoSQL would require me to get a SQL engineer involved. So would like to utilize Powershell instead.

  • #76243

    Don Jones
    Keymaster

    Well... not necessarily. There's SQL Server Express, which you can install and run quite happily on your own, on your own machine. You could also get a SQLite engine, which literally doesn't have a server and just runs in-memory on your own machine. If you have Access, you could also just create a quick MDB database and blow it away when you're done.

    Point is, what you're doing is a database thing, and it's going to be painful without a database. It's like trying to drive a car, but wanting to not use an engine. You'll need to push really really hard and sweat a lot ;).

  • #76244

    Mahin
    Participant

    Thank you Curtis for the recommended steps. Using regex to just select the suffix email address is working successfully. I'm having two issues with the Powershell code.

    The original HashTable is named $ListofSenders and has all the key/value pairs with data.
    The new second Hashtable you recommended me to create is called $emailHash.
    My code to create and copy the values into new hashtable is at the very bottom.

    First, is that i'm noticing the 1st value for the $ListofSenders key value in the loop is being inserted into all the new hashtable values for $emailHash. In this example '50' is being copied into all the Key Values for $emailHash. My looping sequence must be wrong 🙁

    Second issue, is that it's not able to enter multiple keys with the same name, i.e, 'plumber@contso.com' has already been added into new $emailHash. error right below

    "Item has already been added. Key in dictionary: 'plumber@contso.com' Key being added: 'plumber@contso.com'"

    ————————-
    My CODE written up below
    ————————-

    $emailHash = @{}
    foreach($key in $ListofSenders.Keys)
    {if($key -match '_([^_]+)$')
    {$email =$matches[1]; $emailHash+=@{$email=$ListofSenders.Values}}
    }

  • #76271

    Curtis Smith
    Participant

    Yah, you don't want to add another hash entry, you want to add the value of the existing entry to the value of the current entry. More like this.

    $hash = @{
        'aws1288_plumber@contso.com'='50';
        'yws3143_DBuser@contso.com'='60';
        'din7412_plumber@contso.com'='30';
        'tel9132_DBuser@contso.com'='10'
    }
    
    $hashset=@{}
    $hash.keys | ForEach-Object {
        $email = ($_ | Select-String "_(.*)").Matches.Groups[1].value
        [int]$hashset[$email] += [int]$hash[$_]
    }
    $hashset.GetEnumerator() | Sort-Object Value -Descending

    Let me know if any of that does not make sense and I will explain what's going on there.

  • #76282

    Mahin
    Participant

    First and foremost, i'd like to thank you very much for writing that difficult part for me Curtis!
    I understand what line 9 is doing, piping thru all the original $hash keys.

    I get half of line 10, but not all of it. I get the select-string "_(.*)") part, but don't understand how you combined anything after .Matches.Group[1].value. Would you please break it down? i'm interested in learning/understanding and not just copying your code and calling it a day 🙂

    For line # 11, i'm assuming since you declared the variable as [int], you are doing some math sum here. I executed the $hashset[$email] by itself and got a sum of two identical email addresses (total count of messages). I know you are piping the output of [int]$hash[$_] into the $hashset table, but not sure what the latter code set really does? Maybe piping each of the original $hash table values for each unique email address and summing them up for a total value? Would love to know how you wrote it all up in such a concise manner please.

    I understand line 13, thankfully 🙂

  • #76303

    Curtis Smith
    Participant

    Line 10:
    If you are already familiar with PowerShell, you should be aware that PowerShell is an object oriented language. Objects have properties and methods. In PowerShell you normally see selection of properties using the pipeline and the select-object cmdlet.

    get-item c:\windows | Select-Object LastWriteTime

    Results

    LastWriteTime
    -------------
    6/30/2017 3:29:16 PM

    What is happening here is the get-item cmdlet is sending an object to the pipeline that represents the Windows folder and we are then selecting to show only the LastWriteTime property from the object.

    If we wanted to show just the value of the LastWriteTime property, we could use the -ExpandProperty option

    get-item c:\windows | Select-Object -ExpandProperty LastWriteTime

    Results

    Friday, June 30, 2017 3:29:16 PM

    There is another way to access the value of the property however using the . syntax that is common to .NET.

    Frequently you will see something like this instead.

    $folder = get-item c:\windows
    $folder.LastWriteTime

    Results

    Friday, June 30, 2017 3:29:16 PM

    In this case, the object that results from the get-item cmdlet is stored in a variable. We are then using the . syntax to get the value from the LastWriteTime property of the object stored in the $folder variable.

    What I am doing here, ($_ | Select-String "_(.*)").Matches.Groups[1].value, is the same thing, except that I am not storing the value in a variable first. I am using the () to indicate to PowerShell that the code inside the () should be executed first, then using the . syntax to look at the properties of the resulting object.

    Keeping with the previous example

    (get-item c:\windows).LastWriteTime

    Results

    Friday, June 30, 2017 3:29:16 PM

    Inside the () get-item results in an object, we then use the . syntax to look at the properties of that object.

    In the same way ($_ | Select-String "_(.*)") results in a "MatchInfo" object

    "aws1288_plumber@contso.com" | Select-String "_(.*)" | get-member

    Results

       TypeName: Microsoft.PowerShell.Commands.MatchInfo
    
    Name         MemberType Definition                                                       
    ----         ---------- ----------                                                       
    Equals       Method     bool Equals(System.Object obj)                                   
    GetHashCode  Method     int GetHashCode()                                                
    GetType      Method     type GetType()                                                   
    RelativePath Method     string RelativePath(string directory)                            
    ToString     Method     string ToString(), string ToString(string directory)             
    Context      Property   Microsoft.PowerShell.Commands.MatchInfoContext Context {get;set;}
    Filename     Property   string Filename {get;}                                           
    IgnoreCase   Property   bool IgnoreCase {get;set;}                                       
    Line         Property   string Line {get;set;}                                           
    LineNumber   Property   int LineNumber {get;set;}                                        
    Matches      Property   System.Text.RegularExpressions.Match[] Matches {get;set;}        
    Path         Property   string Path {get;set;}                                           
    Pattern      Property   string Pattern {get;set;}

    The MatchInfo object has a property called Matches which contains RexEx Match objects

    ("aws1288_plumber@contso.com" | Select-String "_(.*)").Matches | get-member

    Results

       TypeName: System.Text.RegularExpressions.Match
    
    Name        MemberType Definition                                                      
    ----        ---------- ----------                                                      
    Equals      Method     bool Equals(System.Object obj)                                  
    GetHashCode Method     int GetHashCode()                                               
    GetType     Method     type GetType()                                                  
    NextMatch   Method     System.Text.RegularExpressions.Match NextMatch()                
    Result      Method     string Result(string replacement)                               
    ToString    Method     string ToString()                                               
    Captures    Property   System.Text.RegularExpressions.CaptureCollection Captures {get;}
    Groups      Property   System.Text.RegularExpressions.GroupCollection Groups {get;}    
    Index       Property   int Index {get;}                                                
    Length      Property   int Length {get;}                                               
    Success     Property   bool Success {get;}                                             
    Value       Property   string Value {get;}

    The Match object Contains a property called Groups which contains RegEx Group objects

    ("aws1288_plumber@contso.com" | Select-String "_(.*)").Matches.Groups | get-member

    Results

       TypeName: System.Text.RegularExpressions.Group
    
    Name        MemberType Definition                                                      
    ----        ---------- ----------                                                      
    Equals      Method     bool Equals(System.Object obj)                                  
    GetHashCode Method     int GetHashCode()                                               
    GetType     Method     type GetType()                                                  
    ToString    Method     string ToString()                                               
    Captures    Property   System.Text.RegularExpressions.CaptureCollection Captures {get;}
    Index       Property   int Index {get;}                                                
    Length      Property   int Length {get;}                                               
    Success     Property   bool Success {get;}                                             
    Value       Property   string Value {get;}

    In my case, because my regex pattern produced multiple groups, I need to specify which group I wanted. In this case group 1 holds the value of the email address. So I use the brackets to specify group 1 and .value to get the value property of the group.

    ("aws1288_plumber@contso.com" | Select-String "_(.*)").Matches.Groups[1].value

    Results

    plumber@contso.com

    Line 11:
    [int]$hashset[$email] += [int]$hash[$_]

    This is adding the value of $hashset[$email] to $hash[$_] the problem is that at the beginning of the script, the $hash hashtable was populated with values indicated to be strings, not integers.

    $hash = @{
        'aws1288_plumber@contso.com'='50';
        'yws3143_DBuser@contso.com'='60';
        'din7412_plumber@contso.com'='30';
        'tel9132_DBuser@contso.com'='10'
    }

    When two strings are +, they are contatenated, even if they are numbers.

    IE.

    '50' + '40'

    Results

    5040

    In order to ensure that the values are seen as numbers (integers) and not strings, I "type cast" them as integers.

    [int]'50' + [int]'40'

    Results

    90

    Now when you say "I know you are piping the output of [int]$hash[$_] into the $hashset table, but not sure what the latter code set really does?", I believe you are referring to the use of +=.

    += is a way to add something to the value you already have, rather than just overwriting it.

    IE

    $a = 5
    $a
    
    $a = 10
    $a
    
    $a += 5
    $a

    Results

    5
    10
    15

    In the above example, we first set $a to 5, as a result $a contains the value 5. We then set $a to 10, so $a contains the value of 10. Lastly we use += to add 5 to the value already stored in $a. As a result instead of $a having 5, it has 15 since it already had 10.

    That is the same thing we are doing here.

    [int]$hashset[$email] += [int]$hash[$_]

    When we get to the "aws1288_plumber@contso.com" key in the hash table, $hashset['plumber@contso.com'] equals 0, and we are adding $hash['aws1288_plumber@contso.com'] to it which equals 50. So after this executes, $hashset['plumber@contso.com'] will equal 50 since 0 + 50 = 50

    When we get to the "din7412_plumber@contso.com" key in the hash table, $hashset['plumber@contso.com'] equals 50, and we are adding $hash['din7412_plumber@contso.com'] to it which equals 30. So after this executes, $hashset['plumber@contso.com'] will equal 80 since 50 + 30 = 80

    I hope that helps,

    Curtis

  • #76337

    Mahin
    Participant

    Curtis, thank you so much for going into such depth about what each portion of your code accomplishes. I'll run your lines individually to see the affect first hand on ISE. Much appreciated!

You must be logged in to reply to this topic.