Regex help

This topic contains 18 replies, has 6 voices, and was last updated by Profile photo of Curtis Smith Curtis Smith 6 months ago.

  • Author
    Posts
  • #41026
    Profile photo of ertuu85
    ertuu85
    Participant

    I have a multi lined string that I'm trying to grab a portion of such as:

    $body
    
    [html]
    whatever
    whatever
    whatever
    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]
    ...
    ...
    ...
    [/table]
    whatever
    [/html]
    

    I've tried

    $body -match '[table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"].*[/table]'
    

    Which just returns false. I imagine it's only returning one line and not reading until EOF. How can I get it to read everything between [table...[/table]?

    edited to remove and replace with [ ]

    • This topic was modified 6 months ago by Profile photo of ertuu85 ertuu85.
  • #41039
    Profile photo of Don Jones
    Don Jones
    Keymaster

    -match is supposed to return True/False, but it also creates the $matches collection, which is what you'd look at to see what it matched. Whether it matches the first instance or continues to look for additional instances depends on whether your regular expression was written to do that. And honestly, for this purpose, you might find Select-String to be a bit more useful than -match.

    But to go further, -match is only designed to tell you _if it found a match or not_. If you want to _capture_ what it matched, you need to write a capturing (group) subexpression in your regex. That will populate $matches with what it captured. You can even give your capture group a name in your regex, and $matches will use that name, making it easier to reference what it found.

  • #41052
    Profile photo of ertuu85
    ertuu85
    Participant

    Not sure how to use select-string here to grab and return my match, this below returns false...

    Select-String -InputObject $body -simplematch"[table class=`"MsoNormalTable`" border=`"1`" cellspacing=`"0`" cellpadding=`"0`" width=`"900`" style=`"width:675.0pt;border:solid black 1.0pt`"]*[/table]"
    
  • #41057
    Profile photo of Don Jones
    Don Jones
    Keymaster

    Well, a couple of things. -SimpleMatch isn't a regular expression; it's just a wildcard match. And, by default, letting you know you have a match is all the cmdlet is supposed to do.

    Also, if you delimit your pattern in single quotes, you can use double quotes within and not have to escape them ;).

    You should also know a bit about how regular expressions and patterns work. They're fairly literal – meaning if the attributes in that TABLE tag are in a different order, it won't match them. I'm assuming you already thought of that, and that the HTML you're using is consistent. But a -SimpleMatch isn't intended to _capture_ anything. As I wrote earlier, you need a _capturing subexpression_ in a regex.

    That means using -Pattern to specify your pattern. And, instead of "*" to match the inside of the TABLE, you're probably going to want to use something like (*+). Keep in mind that * only matches a single character; *+ means match more than one. The (parentheses) create a capturing subexpression. However, that example is a _greedy_ subexpression. That means, if your HTML contains more than one TABLE, it'll match from the beginning of the first one to the end of the last one, and everything in between. I'm not sure what your HTML looks like, or what your goal is, but you may need to modify it to be a _non-greedy_ subexpression.

    You probably want to use the -AllMatches switch, also.

    What you're trying to do is certainly straightforward, I think, but regular expressions aren't as straightforward as I wish they were ;). It'd be worth some time to read up on capturing subexpressions and greedy vs. non-greedy subexpressions, so you can figure out what the right technique is to meet your goal.

  • #41060
    Profile photo of ertuu85
    ertuu85
    Participant

    Here is an example of the HTML: http://pastebin.com/MtSa06ue

    Basically I just want to grab the pertinent table and analyze the data in it

    The table will always start

    [table class=`"MsoNormalTable`" border=`"1`" cellspacing=`"0`" cellpadding=`"0`" width=`"900`" style=`"width:675.0pt;border:solid black 1.0pt`"]
    

    I can get it to match on

    Select-String -InputObject $body -pattern "[table class=`"MsoNormalTable`" border=`"1`" cellspacing=`"0`" cellpadding=`"0`" width=`"900`" style=`"width:675.0pt;border:solid black 1.0pt`"].*"
    

    But its but I cant get it to return until it hits [/table].

    But I've wasted more than enough of your time and I'll do some more research on my own, I'm sure experienced users are saying 'HE TOLD YOU WHAT TO DO ALREADY!!' 😉

    Thanks Don!

  • #41064
    Profile photo of Dan Potter
    Dan Potter
    Participant

    What do you aim to do with that string? Would it be easier to work with objects?

    $web = Invoke-WebRequest -Uri 'http://www.w3schools.com/html/html_tables.asp'
    $Web.ParsedHtml.getElementsByTagName("TABLE") | select -First 1

  • #41071
    Profile photo of ertuu85
    ertuu85
    Participant

    I just need to grab the table starting on line 747 and ending on 868.

    I thought I could just use regex since it will always start (and should be unique) with:

    [table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt"]

    and all the text between it to include the [/table].

    So at the end I would have the complete [table]...[/table] which I could create reports/alerts for and send in email form

  • #41081
    Profile photo of Don Jones
    Don Jones
    Keymaster

    You know, if it's consistently at those line numbers, it's easy:

    Get-Content filename.html | Select -skip 747 -first 121

    😉

  • #41083
    Profile photo of ertuu85
    ertuu85
    Participant

    I wish it was that easy 😉

    The entire HTML will actually be the body of an email that was retrieved through powershell, never makes it to a file. And I'm not sure if it always starts on 747, but the table header should be unique.

    If $body is the entire powershell, and I do a

    $body -match '.*' it only matches the first line, how would I make it so it makes the entire string?

  • #41089
    Profile photo of Dan Potter
    Dan Potter
    Participant
    
    $body = @'
    [html]
    whatever
    whatever
    whatever
    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]
    ...
    ...
    ...
    [/table]
    whatever
    [/html]
    '@
    
    
    ($body -split 'table class' | ? {$_ -like "=*"}).trimstart('=')
    
    
  • #41091
    Profile photo of random commandline
    random commandline
    Participant
    $body = '
    [html]
    whatever
    whatever
    whatever
    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]
    random text 1
    [/table]
    whatever
    [/html]
    '
    $body -match "table(?'table'.*)\[/table" ; $Matches.table
    
    • #41097
      Profile photo of ertuu85
      ertuu85
      Participant

      Random Comandline, when I run your example, it comes back false

  • #41095
    Profile photo of Dan Potter
    Dan Potter
    Participant

    I guess if you just want that string and not what follows it.

    $body -split "`n" | ? {$_ -match 'table class'}

  • #41099
    Profile photo of Dan Potter
    Dan Potter
    Participant
    
    Import-Module -Name "C:\Program Files\Microsoft\Exchange\Web Services\2.0\Microsoft.Exchange.WebServices.dll"
    
    $s = New-Object Microsoft.Exchange.WebServices.Data.ExchangeService([Microsoft.Exchange.WebServices.Data.ExchangeVersion]::Exchange2010_SP1)
    
    $s.Credentials = New-Object Microsoft.Exchange.WebServices.Data.WebCredentials('me', 'Password', 'domain')
    
    $s.AutodiscoverUrl('me@domain.com', { $true })
    
    $inbox = [Microsoft.Exchange.WebServices.Data.Folder]::Bind($s, [Microsoft.Exchange.WebServices.Data.WellKnownFolderName]::Inbox)
    
    $emails = $inbox.FindItems(1)
    
    $emails.load()
    
    $emails.body.text |ConvertTo-Html | Select-String -Pattern 'head' -Context 0,3
    
    
    • This reply was modified 6 months ago by Profile photo of Dan Potter Dan Potter.
  • #41116
    Profile photo of random commandline
    random commandline
    Participant

    Make sure you run it in the consolehost not ISE.

  • #41118
    Profile photo of ertuu85
    ertuu85
    Participant

    From console

    >> $body -match "table(?'table'.*)\[/table" ; $Matches.table
    False
    PS C:\Users\user>
    
  • #41130
    Profile photo of random commandline
    random commandline
    Participant

    Ok, I took a different approach to this. Not sure why my previous post didn't work for you, but try this.

    $body = '
    [html]
    whatever
    whatever
    whatever
    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]
    random text 1
    [/table]
    whatever
    [/html]
    ' 
    $newbody = ($body -split "\[table")[1] 
    ($newbody -split "\[/table]")[0]
    
  • #41224
    Profile photo of Rob Campbell
    Rob Campbell
    Participant

    See if this doesn't work for you:

    
    $matchstring = '[table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]'
    $matchstring = [regex]::Escape($matchstring)
    
    $regex = 
    '(?ms)\[html\].+?' + $matchstring + '(.+?)\[/table\]'
    
    if ($body -match $regex)
    {$lines = $Matches[1].Split("`n")}
    
    $lines
    
  • #41336
    Profile photo of Curtis Smith
    Curtis Smith
    Participant

    Hey @aaron-miller, this is the deal. Based on the description of your results, it appears that $body is of type System.String[] rather than System.String. Meaning it is an Array of strings, not a single string. RegEx does not process against an array like it would a string. You have two options here.

    Note: Below is tested using provided sample input:

    $body = @'
    [html]
    whatever
    whatever
    whatever
    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]
    random text 1
    [/table]
    whatever
    [/html]
    '@ -split "`n"
    

    1) If you don't care about the content being on separate lines, make the body a single string using -join

    $body = $body -join ""
    
    ($body | Select-String "\[table class=.*\[/table\]").Matches.Value
    

    Results

    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]random text 1[/table]
    

    2) If you do need to maintain the line uniqueness, loop through the body to find the start and stop of your table, then pull that section.

    for ($i=0; $i -lt $body.count; $i++) {
        If ($body[$i] -match "\[table class=|\[/table]") {
            Switch ($body[$i].Substring(0,8)) {
                "[table c" {$tablestart=$i}
                "[/table]" {$tablefinish=$i}
            }
        }
    }
    
    $body | Select-Object -Skip $tablestart -First ($tablefinish-$tablestart+1)
    

    Results

    [table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="900" style="width:675.0pt;border:solid black 1.0pt"]
    random text 1
    [/table]
    

You must be logged in to reply to this topic.