Substrings are wonderful thing

This topic contains 11 replies, has 3 voices, and was last updated by Profile photo of Bob McCoy Bob McCoy 1 year, 8 months ago.

  • Author
    Posts
  • #31893
    Profile photo of Mark Hammonds
    Mark Hammonds
    Participant

    First I would like to thank all of the guys who post here especially Curtis. I know you don't see me ask any questions here and that's because I read every thread on this board and most of the things I run into have been answered before and I just go back to a previous thread.

    So I had an interesting request from my coworkers The notorious phone number attribute in AD. we are in the process of switching over to a new system and the fields required are country code then the number like this +44 4323454645

    because these fields can be populated by end users you can imagine the filter required to clean this up. so I started writing the code I have found out that some people where kind enough to put the country code in there for me. well that's great and all but for the code to work I would need to a have uniform clean data so I was going to need to trim the country code off of the front of the string well how in the heck do you do this?

    he is how. I created a CSV file that had the country code in one column and the country abbreviation in another. boom now I have somthing to compare to because country codes range from 1-3 numbers I cannot trim by count I had to trim by value. so as each aduser account passes through my loop I am able to select the correct country code based off an Ad attribute we have that contains there country abbreviation.

    $telcode = $Codes | ?{$_.ABBR -like $user.c}

    Now the fun stuff so now that I have the number I can simply trimstart the country code out right? Nope for some crazy reason if a country code is say 44 and the number is 44449837456 it will trim all the 4's out damn you powershell.

    So in short remembering a post by Curtis a while back I was able to use a substring to trim out the number of characters based on the number of characters of the country code from the start and subtract the number of characters of the country code to mark the end of the string

    $Cleannumber = $Cleannumber.Substring(($TelCode.Ccode.Length + 1),($Cleannumber.Length – ($TelCode.Ccode.Length +1)))

    out of 15,000 records I only have to clean up about 400 manually that passed through the filter

    once again I would like to thank you guys for your help. Although you don't see me ask you guys are a big part of my day.

    If any one is interested I can post the entire code.

  • #31896
    Profile photo of Bob McCoy
    Bob McCoy
    Participant

    This is really better suited for RegEx than for substring.

    Please provide a sample of your files, to include header lines. Sanitize as necessary. Do NOT use made up data.

  • #31897
    Profile photo of Curtis Smith
    Curtis Smith
    Participant

    @mark-hammonds, thanks for you kind words. I believe that this is exactly what @donj and the other founders of powershell.org had in mind when they launched the site. I two have learned a great deal by participating in the forums here. PowerShell is a wonderful tool to be explored.

    On you phone number challenge. Here is another approach you might find interesting. As I understand from reading your post. Some, but not all, phone numbers have the country code. Unless your are using and If statement and checking the length of your phone number first using .length, your substring could drop the first 2 numbers of your 10 digit phone number. Additionally, if the + sign, you need to trim 3 characters off the front, and if there is a space in-between, you need to trim 4 characters. One alternative approach you could take is to use RegEx to return just the 10 consecutive digits from the end of the number.

    In the example below I have the same phone number 7 different ways, some with prefix, some without, some with leading or trailing spaces, some with no spaces, some with + sign, some with none, etc, etc, etc. I use -match operator and apply a RegEx expression against the phone number after first triming the whitespace from the beginning and end of the number. The RegEx pattern uses a labeled capture (a technique I learned from @bobmccoy in this very forum: https://powershell.org/forums/topic/text-file-pattern/), and matches 10 digits at the end of the string. -match automatically populates the $matches variable with the matches to the RegEx pattern. Since we label the pattern, we can easily get that match by name.

    "+44 4323454645", " 4323454645", "44 4323454645", "4323454645", "+44 4323454645 ", "+444323454645", "444323454645" |
    ForEach-Object {
        $_.trim()  -match "(?'BaseNum'\d\d\d\d\d\d\d\d\d\d$)"
        $matches['BaseNum']
    }
    

    Results:

    True
    4323454645
    True
    4323454645
    True
    4323454645
    True
    4323454645
    True
    4323454645
    True
    4323454645
    True
    4323454645
    
  • #31898
    Profile photo of Mark Hammonds
    Mark Hammonds
    Participant

    I am using regex as well there are so many things I have to fiter out that may or may not be there like trunk code (o) after the country code leading 0's if there is not country code a 0 with out the () that some people may have added. some people add x9832 for extension some use ext0984 some in other countries use al9348 it gets really strange. I will definitely try what you have posted. im always learning

    there are hundreds of patterns I have to scan for. This just cleans the data I have not added the code to update the attribute yet.

    Set-StrictMode -Version Latest
    $Error.Clear()
    $output = @()
    $SkippedAccounts = @()
    $TrashValue = @()
    $users = Get-ADUser -SearchBase "OU=_Employees,OU=XXX,DC=XXX,DC=XXX" -Filter * -Properties c,telephoneNumber,sAMAccountName | ?{$_.telephoneNumber -ne $null -and $_.c -ne $null}
    $Codes = Import-Csv -Path C:\temp\Country.csv # Document with country codes
    $Filter1 = "(0)","Mobex", "Tel","PBX", "Rijswijkoffice" # Filters out trash
    $Filter2 = "local","extension","Officedial","office","orext","Extn","EXT","ext","Ext","loc","EX","xt","ex","al","or","x" # Filters to replace with ;ext=
    Foreach($user in $users){
        $telcode = $Codes | ?{$_.ABBR -like $user.c}
            if(!($TelCode -eq $null)){
                $Cleannumber = $user.telephoneNumber.Replace(" ", "")
                if($Cleannumber -like "*/*"){$Cleannumber = $Cleannumber.Split("/")[0]}
                if($Cleannumber -like "*,*"){$Cleannumber = $Cleannumber.Split(",")[0]}
                $Cleannumber = $Cleannumber -replace "[^0-9a-zA-Z+()]"
                [string]$Clean1 = $filter1 | ?{$Cleannumber -clike "*$($_)*"}  
                if(!($Clean1 -eq "")){
                    $Clean1 =  $clean1.Split(" ")[0]
                    $Cleannumber = $Cleannumber.Replace($Clean1, "")
                }
                [string]$Clean2 = $filter2 | ?{$Cleannumber -clike "*$($_)*"}  
                if(!($Clean2 -eq "")){
                    $Clean2 =  $clean2.Split(" ")[0]
                    $Cleannumber = $Cleannumber.Replace($Clean2, ";ext=")
                }
                $Cleannumber = $Cleannumber -replace "[()]"
                if(!($Cleannumber -eq $null)){
                    if(!(($user.telephoneNumber -replace "[^0-9+]").Length -lt 8)){
                    while($Cleannumber[0] -eq "0"){$Cleannumber = $Cleannumber.TrimStart("0")} #This Trunk Code Must be dropped when dialing internationaly -- if 0's proceed number
                    if($Cleannumber.Substring(0,($TelCode.Ccode.Length)) -eq $TelCode.Ccode){$Cleannumber = "+$($Cleannumber)"}
                    if(!($Cleannumber -like "*+*")){$Cleannumber  = ("+$($TelCode.Ccode)" + "$($Cleannumber)")}
                    if(!($Cleannumber -like "*;ext=*" )){$Cleannumber = $Cleannumber -replace "[^0-9+]"} 
                            if(!($Cleannumber.Substring(0,$TelCode.Ccode.length) -eq $TelCode.Ccode)){
                                if(!($Cleannumber -eq "" -or $Cleannumber.Length -lt 8)){ 
                                    Write-Host "User $($user.SamAccountName) Updated $($user.telephoneNumber) to $($Cleannumber) $($user.c)"
                                    if($Cleannumber.Substring(0,$TelCode.Ccode.length + 1) -eq "+$($TelCode.Ccode)"){
                                        $Cleannumber = $Cleannumber.Substring(($TelCode.Ccode.Length + 1),($Cleannumber.Length - ($TelCode.Ccode.Length +1)))
                                        while($Cleannumber[0] -eq "0"){$Cleannumber = $Cleannumber.TrimStart("0")} #This Trunk Code Must be dropped when dialing internationaly -- if 0's come after country code
                                        $Cleannumber = ("New +$($TelCode.Ccode) " + $Cleannumber)
                                    }Else{
                                        $Cleannumber = ("Fix " + $Cleannumber)
                                    }
                                    $vcheck = 0
                                    $var2 = $Cleannumber.Replace(" ", "")
                                    if(!($user.telephoneNumber -like "*+*")){$var1 = $TelCode.Ccode + $user.telephoneNumber}
                                    Else{$var1 = $user.telephoneNumber}
                                    if(($var1 -Replace("[^0-9]")) -eq ($var2 -replace("[^0-9]"))){$vcheck = 1}
                                    $output += New-Object PSObject -Property([ordered]@{UserID = $user.SamAccountName;OldNumber = "Old $($user.telephoneNumber)";NewNumber = $($Cleannumber); CountryCode = $user.c;Vcheck = $vcheck})
                                }Else{
                                    #not enough characters Write code to clear feild
                                    $TrashValue += New-Object PSObject -Property([ordered]@{UserID = $user.SamAccountName;OldNumber = $user.telephoneNumber;NewNumber =  $Cleannumber; CountryCode = $user.c})}                           
                            }
                        
                    }Else{
                        #Trash in Number Field  Write code to clear feild
                        $Cleannumber = ("New +$($TelCode.Ccode) " +$Cleannumber.TrimStart("+$($TelCode.Ccode)"))
                        $TrashValue += New-Object PSObject -Property([ordered]@{UserID = $user.SamAccountName;OldNumber = "Old $($user.telephoneNumber)";NewNumber =  $Cleannumber; CountryCode = $user.c})}
                }Else{
                    # if new number field is blank
                    Write-Host "User $($user.SamAccountName) Skipped Country ID $($user.c) not found in file!"
                    $SkippedAccounts += New-Object PSObject -Property([ordered]@{UserID = $user.SamAccountName;OldNumber = "Old $($user.telephoneNumber)";CountryCode = $user.c})} 
            }Else{
                #country code not found
                Write-Host "User $($user.SamAccountName) Skipped Country ID $($user.c) not found in file!"
                $SkippedAccounts += New-Object PSObject -Property([ordered]@{UserID = $user.SamAccountName;OldNumber = "Old $($user.telephoneNumber)";CountryCode = $user.c})} 
    }
    
  • #31899
    Profile photo of Mark Hammonds
    Mark Hammonds
    Participant

    also the length of the number can be from 8 -17 numbers depending on country or if they have an extension so there is no absolute length

    +33 (0) 47 11 68 31 (or ext.1351)
    +31 70 35 7904 Rijswijk office
    +39 02 6196017 / Houston 8326197103
    +40 244.36.600 ext.173
    +44113276415 (internet link)
    +49 5141 806 787 or 348
    82-2-3210-1496 (Dir)
    +966 (0)1 3 86 1617 Ext. 286
    Nigeria: 0810 345 9633
    PBX – 55 12 227-4400
    Tel. 52.93.316.9219 al 22
    VOIP 713460826
    993 3 39 01 51 / 52
    382861175,93
    868-66-3663 / 868-66-6124
    780-463-868 x6823
    22tel34 45 2345
    +44 local 203049586 
    710-317-2117  EXT. 6
    58 282 424897 or  58 282 426615
    54 11 4314 520 / 540 ext 270
    403-570-227 local 557
    404-291-4814 extension 2245
    
    

    Sample of CSV file

    Ccode	ABBR	Country
    1	US	United States
    1	CA	Canada
    20	EG	EG Egypt
    212	MA	Morocco
    213	DZ	Algeria
    216	TN	Tunisia
    218	LY	Libya
    220	GM	Gambia
    221	SN	Senegal
    
    
  • #31900
    Profile photo of Curtis Smith
    Curtis Smith
    Participant

    Wow, that is truly ugly. (the Data, not the Code) 🙂 I'm actually very interested in what Bob comes up with. I've seen some pretty impressive stuff from him.

  • #31901
    Profile photo of Mark Hammonds
    Mark Hammonds
    Participant

    yeah I it is worse in some places like I said my script fix 14,600 numbers out of 15,000 i'm sure there are simpler ways to implement it but i'm not that familiar with regex. With some of the stuff in the attribute adding a filter will break other parts of the code. this is about as clean as I can get it.

    at first I had each word that needed to be filtered on a separate line. Then I though why not have all my filtered words in a variable and roll through the variable for matches if a match is found remove it.

    [string]$Clean1 = $filter1 | ?{$Cleannumber -clike "*$($_)*"}
    if(!($Clean1 -eq "")){
    $Clean1 = $clean1.Split(" ")[0]
    $Cleannumber = $Cleannumber.Replace($Clean1, "")
    }

    the spit is in case it triggers more then one match to use the first one. like extension, then ext, then ex, then x, so a number like 404-291-4814 extension 2245 will match x,ex and ext because it triggered a match with extension first that will be the one used.

  • #31902
    Profile photo of Bob McCoy
    Bob McCoy
    Participant

    OK, this is way more hairy than originally suggested. Here is my solution based on the original set of data you proposed above. And it did indeed solve all of those use cases. However, by the time I got ready to post it, your additional data made it way more complicated.

    Maybe this will get you to the 80% solution, the rest having to be hand-coded. At any rate take a look at it and see if you can figure out where I was going.

    $pattern1 = "^00(\d+)\s(\d+)$"; $replace1 = '+$1 $2'
    $pattern2 = "^\+(\d+)\s\(0\)(\d+)$"; $replace2 = '+$1 $2'
    $pattern3 = "^(\d+)$"; $replace3 = '+44 $1'
    $pattern4 = "^(\d+)\sOffice\s(\d+)$"; $replace4 = '+44 $1;ext=$2'
    $pattern5 = "^\+(\d+)\s0\s(\d+)\s(\d+)\s(\d+)$"; $replace5 = '+$1 $2$3$4'
    $pattern6 = "^(\d+)x(\d+)$"; $replace6 = '+44 $1;ext=$2'
    $pattern7 = "^(\d{3})(\d+)/.+$"; $replace7 = '+$1 $2'
    $pattern8 = "^(\d{2})(\d+);.+$"; $replace8 = '+$1 $2'
    $pattern9 = "^(\d{2})(\d+)ex(\d+)$"; $replace9 = '+$1 $2;ext=$3'
    $pattern10 = "^(\d{2})(\d+) al (\d+)$"; $replace10 = '+$1 $2;ext=$3'
    $pattern11 = "^my\soffice\s(\d+)\s(\d+)\s(\d+)$"; $replace11 = '+1 $1$2$3'
    
    Get-Content -Path C:\Ephemeral\numbers.txt | foreach {
        $test = $PSItem.ToString().Trim()
        Switch -Regex ($test)
        {
            $pattern1 { $update = $test -replace $pattern1, $replace1; break }
            $pattern2 { $update = $test -replace $pattern2, $replace2; break }
            $pattern3 { $update = $test -replace $pattern3, $replace3; break }
            $pattern4 { $update = $test -replace $pattern4, $replace4; break }
            $pattern5 { $update = $test -replace $pattern5, $replace5; break }
            $pattern6 { $update = $test -replace $pattern6, $replace6; break }
            $pattern7 { $update = $test -replace $pattern7, $replace7; break }
            $pattern8 { $update = $test -replace $pattern8, $replace8; break }
            $pattern9 { $update = $test -replace $pattern9, $replace9; break }
            $pattern10 { $update = $test -replace $pattern10, $replace10; break }
            $pattern11 { $update = $test -replace $pattern11, $replace11; break }
            default { $update = $test }
        }
        Write-Verbose "$test changed to $update"
        "$update"
    }
    

    The rules for translation are setup in the pattern/replace pairs at the start of the script. There is no magic to the current order in which the patterns are evaluated. But if you can, you probably do want to evaluate them in the most to least popular order.

    There are definitely more refined ways to do this, however, this seemed like a one-time effort to standardize.

  • #31903
    Profile photo of Mark Hammonds
    Mark Hammonds
    Participant

    man I like Regex can you suggest a book to learn it? I love how simple that is all tho the regex looks like gibberish to me lol

    I had to do this for my main filter

    $Filter2 = "local","extension","Officedial","office","orext","Extn","EXT","ext","Ext","loc","EX","xt","ex","al","or","x"
    
                [string]$Clean2 = $filter2 | ?{$Cleannumber -clike "*$($_)*"}  
                if(!($Clean2 -eq "")){
                    $Clean2 =  $clean2.Split(" ")[0]
                    $Cleannumber = $Cleannumber.Replace($Clean2, ";ext=")
                }
    
    
  • #31904
    Profile photo of Bob McCoy
    Bob McCoy
    Participant

    The tool I use most often is RegEx Buddy (http://www.regexbuddy.com/). It was well worth the $39 and you can run it on multiple machines if you are the exclusive user, or multiple users on a single macine.

    there are relatively few patterns I uses a lot. The others I have to do a little digging for.

    There are also several sites that will test regex for you.

    The problem is that you are asking the computer to recognize patterns. This is something that humans doe fairly well even in the event of inconsistent and varying patterns. Computers on the other had do not deal with ambiguity very well. For instance, from your original data (now gone from this post) ...

    239439658459/ 392432932923
    935495849394; 324934945944
    

    One yielded a 2-digit country code and the other a 3-digt CC. So I based the RegEx recognition of the CCs based on the delimiter (slash or semi-colon). You have to "stick it" to something.

    I have the following O'Reilly books:

    • Mastering Regular Expressions
    • Regular Expressions Cookbook
    • Regular Expressions Pocket Reference

    However, with the online references and RegEx Buddy I can usually figure out what I want.

  • #31905
    Profile photo of Mark Hammonds
    Mark Hammonds
    Participant

    I just bought

    Beginning Regular ExpressionsFeb 4, 2005
    by Andrew Watt
    Paperback

  • #31906
    Profile photo of Bob McCoy
    Bob McCoy
    Participant

    PowerShell and the attendant RegEx capability that bubbles up from the .NET implementation is an unbeatable combination for text manipulation.

You must be logged in to reply to this topic.