regex

Tagged: , , ,

This topic contains 17 replies, has 3 voices, and was last updated by  Tony Pagliaro 1 year, 11 months ago.

  • Author
    Posts
  • #30229

    Tony Pagliaro
    Participant

    I need to extract server name and KB numbers from a PDF document. I know regex can handle this, but I am illiterate. Was wondering if someone could lend a hand.

    The PDF is broken into sections, one for each server, and lists missing KBs before it moves to the next server. I'd like to make a script to extract the info in plain text so I can use powershell to test if the KBs are installed/available and what options they have. That last part I can handle myself.

    So if I copy/paste the PDF text into note pad it looks like this.


    Hostname:
    server-123.domain-123.com
    IP:
    158.39.128.12
    OS:
    Microsoft
    Windows
    Server
    2008
    R2
    Service
    Pack
    1
    Critical
    Microsoft
    XML
    Parser
    (MSXML)
    and
    XML
    Core
    Services
    Unsupported
    Critical
    MS15-­‐034:
    Vulnerability
    in
    HTTP.sys
    Remote
    Code
    Execution
    (3042553)
    Critical
    MS15-­‐034:
    Vulnerability
    in
    HTTP.sys
    Remote
    Code
    Execution
    (3042553)
    (uncredentialed
    check)
    High
    Microsoft
    Windows
    Unquoted
    Service
    Path
    Enumeration
    High
    MS
    KB2269637:
    Insecure
    Library
    Loading
    Could
    Allow
    ....

    You can see the paste action puts a CR after every word... annoying.
    Notice how sometimes the numeral only is enclosed in (#####)'s but sometimes it says KB##### as well.

    And I want it to look like this (I can worry about duplicates later):

    server-123.domain-123.com
    KB3042553
    KB3042553
    KB2269637

    The "Hostname: xxxxxxxx.domain-123.com" will always be the same throughout the document, and can be used to mark a new server (section). I would eventually like the output into a PS object that has server name and an array of associated (unique) KB#s.

  • #30236

    Flynn Bundy
    Participant

    Unless someone has some cool new way of reading PDF files in PowerShell you're only real option is to look into iTextSharp.dll

    Its briefly discussed here, plus a bit more from just a simple google.

    http://stackoverflow.com/questions/15684699/how-to-parse-pdf-content-to-database-with-powershell

    Unfortunately PowerShell cannot natively read PDF documents.

    That being said if you just port this over to a txt file you can do something like this:

    Get-Content C:\textfile.txt | select-string '\w.domain-123.com','KB\d+'
    From there you could pipe this to out-file accordingly.

  • #30237

    Tony Pagliaro
    Participant

    That's cool. I can work around by pasting it into notepad just like in my example.
    The real question is the regex problem.

  • #30241

    Tony Pagliaro
    Participant

    OK, to prove I'm not lazy I figured out this much of it.

    $t = gc .\cimc.txt
    $regex = '([^A-Z]\d{6,8})|KB\d{6,8}'
    $t -match $regex
    
    (3059317)
    KB2269637:
    (3057839)
    3009008:
    (3035132)
    

    How do I get rid of the parentheses and the colons? I just want the numbers, but the parentheses were used to qualify the numbers within.
    I thought another line of code and a simple '\d' would do it but I guess I was wrong.

  • #30242

    Flynn Bundy
    Participant

    Try this:

    (gc .\cimc.txt | Select-String '\d{6,8}|KB\d{6,8}:' -AllMatches).Matches.Value

  • #30243

    Curtis Smith
    Participant

    You could used trim or -replace

    $t = gc .\cimc.txt
    $regex = '[^A-Z]\d{6,8}|KB\d{6,8}'
    $t -match $regex | ForEach-Object {$_.Trim("K","B","(",")",":")}
    

    or

    $t = gc .\cimc.txt
    $regex = '[^A-Z]\d{6,8}|KB\d{6,8}'
    ($t -match $regex) -replace '[^\d]'
    

    3042553
    3042553
    2269637

  • #30245

    Tony Pagliaro
    Participant

    I was wondering why you both removed my parentheses, then I did some research. Turns out I need that listed as "\(" and "\)" ... baby steps.

    Thanks Curtis. I used your trim option, event though the replace looks more elegant. Seeing the characters helps me comprehend it when I read it later.

    $RawData = gc .\cimc.txt
    $regex = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
    $KBNums = $RawData -match $regex | ForEach-Object {'KB' + $_.Trim("K","B","(",")",":")}
    

    I added a few things and this code gets me a nice, consistent list of KB#s I can use for lookup.

    So that gets me halfway there. Now I need the server at the top, and remove duplicates. Every time I hit a new hostname I make a new server and unique list.

    The pattern can be as follows:

    Hostname:
    dyn-­-172-­-79-­-158-­-145.domain-123.com
    IP:
    
    or
    
    Hostname:
    mer03943.domain-123.com
    IP:
    

    ...but i want to ignore any where there's nothing in the middle–the next line will always be "IP:"

    Hostname:
    IP:
    

    I'll comment again with my first attempt. I think I need a hash table? Can a value in a hash table be an array? Time to play..

  • #30246

    Curtis Smith
    Participant

    Ha, I removed them just playing around with the regex, didn't mean to leave the off :).

    Yes, you have have an array as a value of a hash.

    PS C:\> $hash = @{
        String1 = "Value1";
        String2 = "Value2";
        Array1 = @("Value3", "Value4", "Value5");
        String3 = "Value6"
    }
    
    $hash
    
    Name                           Value                                                                                                                                                 
    ----                           -----                                                                                                                                                 
    String3                        Value6                                                                                                                                                
    String2                        Value2                                                                                                                                                
    String1                        Value1                                                                                                                                                
    Array1                         {Value3, Value4, Value5}
    
  • #30249

    Tony Pagliaro
    Participant

    Cool, that'll come in handy when I figure out this other thing.

    I got the hostname and IP lines, and even the URL, but all separate.

    $RegexKB = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
    $RegexHN = 'Hostname:'
    $RegexURL = '([\da-z\.-]+)\.([\da-z\.-]+)\.([a-z]{2,6})'
    $RegexIP = 'IP:'
    

    How do I find all three together, so that I can ignore when the middle one isn't there?

    Eventually I need to parse this txt file, every time the script runs into a HN/URL/IP trio of lines, it dumps that URL string into a hash table and for its value it uses the array of string values concocted in the above solution, but stops searching for new values when it hits the next URL or a HN/IP pair of lines without a URL.

    and come out with a hash table like so..

    $RawData = gc .\cimc.txt
    $KBNums = $RawData -match $RegexKB | ForEach-Object {'KB' + $_.Trim("K","B","(",")",":")} | select -Unique
    
    $hash = @{
    $RegexURL[0] = $KBNums;
    $RegexURL[1] = $KBNums;
    $RegexURL[2] = $KBNums;
    $RegexURL[3] = $KBNums;
    $RegexURL[4] = $KBNums;
    }
    

    I know it doesn't make sense the way I wrote it here, but you get the idea: my end goal right?

  • #30250

    Curtis Smith
    Participant

    Well, are those values always at the very top of the file like in the example content? If so you can just pull the directly without doing any filter and test on them.

    $t = Get-Content .\cimc.txt
    
    $t[0]
    $t[1]
    $t[2]
    $t[3]
    
    If ($t[1]) {
        "Hostname not blank"
    }
    

    Results:
    Hostname:
    server-123.domain-123.com
    IP:
    128.59.238.12
    Hostname not blank

  • #30251

    Tony Pagliaro
    Participant

    No, it's just one file with hundreds of KB#s and dozens of server URLs.
    Basically it's a security report formatted so that IT guys pull their hair out trying to review the data. We are starting to get a lot of them and this would increase productivity a billion times (in my head).
    This function should output KB numbers so that I can pipe them into a script that searches for them in WUAU and outputs patch status and other info. But I'm getting ahead of myself.

    I tried using \n and \r and \n\r and \r\n but when I try something like
    '(Hostname:)\n(long regex string for URL)\n(IP:)'
    I get no matches.

  • #30252

    Curtis Smith
    Participant

    ya the match is going to compare the regex on one line at a time, it's not going to compare against all three lines. I think what you are going to have to do is loop through each line one at a time so that you can find your "start of record" which is indicated by a hostname: value then check the next line to see if it has a value, and decided what to do based on that result.

    For confirmation, the first instance of

    Hostname:

    IP:

    that you find, you want to stop searching completely. No need to finish the rest of the file?

  • #30253

    Tony Pagliaro
    Participant

    Well, sort of. I want to come out with multiple sections. If i have to use a split in the beginning to accomplish this I guess that will have to do, then I'm iterating a bunch of times and creating that many txt files or some other output I really don't want.

    The file contains hostnames as URLs, but some devices such as network switches do not have hostnames but do appear on the report. It's random. You could switch the logic to say find a URL, then start collecting KB#s until you hit the next 'hostname:' line, then search for the next URL. That actually sounds a lot simpler.

    I'm not sure how I would loop thru inside of a regex.

  • #30254

    Curtis Smith
    Participant

    I think this is kinda what you are asking for. It will at least give you the building blocks to expand upon. Of course I only have the one sample data, so I repeated it 3 time in my input file and blanked out the line below "HOSTNAME:" on the second instance.

    Update: I went back and added comments throughout the script so hopefully it will make since as to how this script is functioning and provide a better understanding on how to parse seemingly haphazard text in the future.

    # Regex String for matching KB number in input text
    $RegexKB = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
    
    # Initilize record variable as null
    $record = $null
    
    #get content from input file
    $cimc = Get-Content .\cimc.txt
    
    # use the variable $i in a for loop where $i starts at 0 and increments by +1
    # until it is no longer less the number of lines in the input file
    For ($i=0; $i -lt ($cimc.count); $i++) {
    
        # For the current line in the loop switch code execution based on the value
        Switch ($cimc[$i])
        {
            # If the current line's value is "Hostname:" then we have found the
            # beginning of a new record
            "Hostname:"
            {
                # Check to see if we are already working on a record and if so,
                # send it to the Pipeline as long as the Hostname value is not blank
                If ($record)
                {
                    If ($record.Hostname) {
                        $record
                    }#if
    
                    # Start a new record setting Hostname as the value on the next
                    # line (current line + 1), and the IPaddress as the value on
                    # the next 3rd line (current line + 3)
                    $record = [PSCustomObject]@{
                                    Hostname = $cimc[$i+1];
                                    IPAddress = $cimc[$i+3];
                                    KBs = @()
                              }
                }#if
                Else
                {
                    # Start a new record setting Hostname as the value on the next
                    # line (current line + 1), and the IPaddress as the value on
                    # the next 3rd line (current line + 3)
                    $record = [PSCustomObject]@{
                        Hostname = $cimc[$i+1];
                        IPAddress = $cimc[$i+3];
                        KBs = @()
                    }
                }#else
            }#Hostname:
    
            # If the current line's value is anything else then this default action
            # will be taken
            default
            {
                # For the current line, check and see if it match a KB number based
                # on our Regex Expression
                If ($cimc[$i] -match $RegexKB)
                {
                    # Replace all characters that are not numerical with nothing,
                    # prefix it with KB, and add it to the KBs array in the record
                    $record.KBs += @("KB$($cimc[$i] -replace '[^\d]')")
                }#if
            }#default
        }#switch
    }#for
    
    # Check to see if the last record has a Hostname value and output it to the
    # pipeline if so
    If ($record.Hostname) {
        $record
    }#if
    

    Results Like:

    Hostname                         IPAddress                       KBs                            
    --------                         ---------                       ---                            
    server-123.domain-123.com        128.59.238.12                   {KB3042553, KB3042553}         
    server-123.domain-123.com        128.59.238.12                   {KB3042553, KB3042553}         
    
  • #30272

    Tony Pagliaro
    Participant

    Looks good now.. so much thanks

    $RegexKB = '\([^A-Z]\d{6,8}|KB\d{6,8}\)'
    $record = $null
    $RawData = Get-Content .\cimc.txt
    
    For ($i=0; $i -lt ($RawData.count); $i++) {
        Switch ($RawData[$i]) 
        {
            "Hostname:"
            {
                If ($RawData[$i+1] -ne 'IP:') 
                {
                    If ($record) {
                        If ($record.Hostname -and $record.KBs) {
                            $record
                        }
                        $record = [PSCustomObject]@{
                                        Hostname = if ($RawData[$i+1] -match "^dyn[0-9-]") {
                                                        (($RawData[$i+1].TrimStart('dyn-­-') -split '\.')[0]) -replace '-­-','.'; 
                                                   } else {
                                                        ($RawData[$i+1] -split '\.')[0]
                                                   }
                                        KBs = @()
                                  }
                    } Else {
                        $record = [PSCustomObject]@{
                            Hostname = if ($RawData[$i+1] -match "^dyn[0-9-]") {
                                             (($RawData[$i+1].TrimStart('dyn-­-') -split '\.')[0]) -replace '-­-','.'; 
                                       } else {
                                             ($RawData[$i+1] -split '\.')[0]
                                       }
                            KBs = @()
                        }
                    }
                } Else {
                    $record = $null
                }
            }#Hostname:
            default {
                If ($RawData[$i] -match $RegexKB) {
                    #replace all characters that are not numerical with nothing, prefix it with KB, and add it to the array
                    $record.KBs += @("KB$($RawData[$i] -replace '[^\d]')")
                }
            }#default
        }#switch
    }#for
    If ($record.Hostname -and $record.KBs) {
        $record
    }
    

    gives

    Hostname               KBs                  
    --------               ---                  
    174.59.10.178         {KB3017349, KB3057181, KB3058985, KB3033857}                      
    174.59.10.189         {KB3042553, KB3042553, KB2722479, KB3011443...}                   
    svrad01               {KB3000483, KB3041836, KB3032323, KB3046306...}                   
    174.59.10.131         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    svrcitrix             {KB3042553, KB3042553, KB2500212, KB2962486...}                   
    174.59.10.153         {KB3042553, KB3011443, KB3011780, KB3017349...}                   
    174.59.10.159         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    174.59.10.165         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    174.59.10.190         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    174.59.10.139         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    174.59.10.149         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.111         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    174.59.10.174         {KB3042553, KB3042553, KB3000483, KB3046306...}                   
    174.59.10.129         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.129         {KB3042553, KB2500212, KB3011443, KB3000483...}                   
    174.59.10.125         {KB3042553, KB2500212, KB3011443, KB3000483...}                   
    174.59.10.172         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.137         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    svrfile02             {KB3042553, KB3042553, KB3000483, KB3041836...}                   
    svrarchive            {KB3000483, KB3041836, KB3032323, KB3046306...}                   
    174.59.10.107         {KB2500212, KB3000483}                                            
    webmail               {KB3042553, KB3042553, KB3000483, KB3041836...}                   
    174.59.10.150         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.101         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.198         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    printer4              {KB3000414, KB2992611, KB3042553, KB2500212...}                   
    174.59.10.145         {KB3042553, KB3011780, KB3017349, KB3021674...}                   
    174.59.10.195         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.155         {KB3042553, KB3000483, KB3046306, KB3049576...}                   
    174.59.10.186         {KB3042553, KB3011443, KB3000483, KB3046306...}                   
    tradedev              {KB3042553, KB2962486, KB3000483, KB3041836...}                   
    svrdev                {KB3042553, KB3000483, KB3041836, KB3032323...}                   
    invest-­-serv         {KB3042553, KB3042553, KB3000483, KB3041836...}                   
    174.59.10.192         {KB3042553, KB3042553, KB3000483, KB3041836...}                   
    svrproddb             {KB3042553, KB3042553, KB3000483, KB3041836...}                   
    connect4              {KB3042553, KB3000483, KB3041836, KB3032323...}                   
    svrprodws             {KB3042553, KB3000483, KB3041836, KB3032323...}                   
    174.59.10.191         {KB3042553, KB3000483, KB3041836, KB3032323...}                   
    printer6              {KB3042553, KB3000483, KB3041836, KB3032323...}  
    
  • #30289

    Tony Pagliaro
    Participant

    Question, having issues running this on PS v2.0. Anything in this script that would make the output all weird and wonky? I have to write things for the lowest common denominator.

    Name                           Value
    ----                           -----
    Hostname                       251.59.174.178
    KBs                            {KB3017349, KB3057181, KB3058985, KB3033857}
    Hostname                       251.59.174.189
    KBs                            {KB3042553, KB3042553, KB2722479, KB3011443...}
    Hostname                       svrad01
    KBs                            {KB3000483, KB3041836, KB3032323, KB3046306...}
    Hostname                       251.59.174.131
    KBs                            {KB3042553, KB3011443, KB3000483, KB3046306...}
    Hostname                       svrcitrix
    KBs                            {KB3042553, KB3042553, KB2500212, KB2962486...}
    

    edit:
    just found this

    and it seems like I may need a new thread.

  • #30292

    Curtis Smith
    Participant

    Yes, PowerShell 2.0 does not support [PSCustomObject]@{}. That was introduced in 3.0.

    If you have to stick to 2.0, you will have to revert to using.

    New-Object -TypeName PSObject -Property @{}

  • #30309

    Tony Pagliaro
    Participant

    Perfect! Easy!

You must be logged in to reply to this topic.