Find text on a web page

This topic contains 15 replies, has 6 voices, and was last updated by Profile photo of Clarkcooke Clarkcooke 5 days, 21 hours ago.

  • Author
    Posts
  • #16352
    Profile photo of bvi1998 .
    bvi1998 .
    Participant

    Hi,
    I am accessing an internal web page which has a list of servers and uuids from which I would like to search for a server and get its uuid.

    Each line on the html page looks like this:
    server1,7a063332-2e05-53338-ff5b-5843116ad838
    server2,5d883442-ab28-122b-9646-1cb44b8c344d
    server5,36c915bf-3d21-a222-b0a9-6f10d22d7b011

    After using the invoke-restmethod command and exporting to a csv file, the exported format becomes:
    server1,7a063332-2e05-53338-ff5b-5843116ad838server2,5d883442-ab28-122b-9646-1cb44b8c344dserver5,36c915bf-3d21-a222-b0a9-6f10d22d7b011

    #1 is there any way to easily just find the server and uuid with the invoke-restmethod ?
    #2 if #1 cannot be done, how can I extract it from the csv file?

    And no, I cannot get the uuid from the server at this point. I need this list.
    Thanks in advance!

  • #16355
    Profile photo of Don Jones
    Don Jones
    Keymaster

    You might look into Invoke-WebRequest; I'm not sure exactly what you need to do, though.

    If the web server is returning that as raw text, rather than something XML- or JSON-encoded, then it can be difficult to detect the end-of-line characters, which is why you're getting a single line of output in your CSV. I'd try to solve that. If the items were coming through as individual lines, your task would be very easy – just use Import-Csv.

    Try looking at the raw data being returned by the server and see if it includes CRLF (carriage return, line feed) at the end of each line, or something else.

  • #16358
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    You probably could use the HTML DOM model to pull the data from the site. In IE, if you hit F12 it will open a DOM explorer and determine what HTML object (DIV, SPAN, TABLE, TEXTBOX, etc) contains your data, you can do something like this: http://powershell.com/cs/blogs/tips/archive/2013/09/02/importing-website-tables-into-excel.aspx

    In this blog, they are pulling information out of a table. If the raw data is place in a DIV, like so (standard tags replace so HTML would not render);

    {html}
    {head}
    {title}Awesome Website{/title}
    {/head}
    {body}
    {div id="div_Hdr"}Server Data{/div}
    {div id="div_ServerData"}
    server1,7a063332-2e05-53338-ff5b-5843116ad838{br/}
    server2,5d883442-ab28-122b-9646-1cb44b8c344d{br/}
    server5,36c915bf-3d21-a222-b0a9-6f10d22d7b011{br/}
    {/div}
    {/body}
    {html}

    You could do something like:

    $content = $data.ParsedHtml.getElementsByTagName("div"))[1].InnerHTML

    The .getElementsByTagName will find all DIV's in the website (it could be a LOT), but in this example there are 2, so 0 represents div_Hdr and 1 represents div_ServerData (object array). You want the actual RAW content, so you want to get what is contained in the DIV, so it would be .InnerHTML. If you explore the DOM and see that the container does have an ID, it's cleaner to just get the object by ID (or Name):

    $content = $data.ParsedHtml.getElementById("div_ServerData").innerHTML)

    As Don eluded, the question is really what format the data is in to be able to parse it and make it viable Powershell object. If it's a div with {br /} line breaks, then you can split the content on that and commas and create a custom PSObject from that. If it's a table, which is ideal, just do a search on 'parsing table dom powershell' and take it from there. The built-in DOM explorer in IE (feature is available in most browsers) you should be able to narrow down pretty quickly what container object and if there is any easy way to identify (e.g. ID or Name) of the object to start your parsing. Good luck!! If you figure it out, post some code for others to use.

    Edit: Any forum guys, is there a proper way to show something as literal like HTML without rendering it? I tried the pre, blockqoute, nothing and all of them rendered the html. Thanks.

  • #16359
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    As a fun example, take this website for instance. I hit F12 in my IE browser, hit the cursor in the box on the top middle of the bar and place it on the forum topics object and clicked it. I could see the container object with all of the topics is a UL (unordered list) with an ID of bbp-forum-2683. Then I started parsing the objects until I finally arrived at the title which was UL (bbp-forum-2683) > UL > LI > A > title. Note that when I parse the LI that there are other A tags in collection, but I only wanted the forum title, so I looked at the returned object attributes and saw there was a CSS class 'bbp-topic-title' that I used to filter the results:

    $URL = "https://powershell.org/forums/forum/windows-powershell-qa/"
    
    # reading website data:
    $data = Invoke-WebRequest -Uri $URL 
    
    # get the first table found on the website and write it to disk:
    $data.ParsedHtml.getElementByID("bbp-forum-2683") | foreach{
        $_.getElementsByTagName("ul") | foreach{
            $_.getElementsByTagName("li") | Where{$_.className -eq 'bbp-topic-title'} | foreach{
                $_.getElementsByTagName("a") | foreach{
                    $_.Title
                }
            }
        }
    }
    

    [b]Returns:[/b]

    Forums Tips and Guidelines
    Check these External Forums for Specific Topics
    Find text on a web page
    Help emailing formatted HTML – Please
    Need a script to use a threading.
    Powershell v3 & v2 Compatibilty
    Advanced Function Optional Parameter Problem
    Help Bulk Permission changes O365 Conferance Rooms
    Variable Output in Email
    Detecting param variable input only
    Scenario Training?
    How to Format a list to be emailed?
    how to temporarily un-protect certain objects??
    Powershell 2.0 and Sql server 2005 help
    import same.csv | For Each {
    Strange behavior with the Get-Alias cmdlet
    Redundancy of Code in Advanced Function

  • #16360
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Edit: Not finding a good way to post HTML code on the new forum plugin yet. It's even unwrapping double-escaped HTML.

  • #16361
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    Dare I try to type in some JavaScript and see if that executes? It's impossible, right TweetDeck πŸ™‚

  • #16362
    Profile photo of bvi1998 .
    bvi1998 .
    Participant

    Wow thanks for the help everyone!

    I used the F12 and browsed the DOM. It showed up as this, for example (written in poor format due to the limitations of posting html:

    after META, it says http equiv = content type content text/html

    all of the lines are within the body of the html

    Is the object body, or ? Hmm sorry I know nothing about html πŸ™ but now it's on my list to learn πŸ™‚

    What if I just take this file and remove the line breaks, or replace them?

    The result I am looking for is that I can search for the server name, then set a variable for its uuid found.

  • #16369
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    It's possible it's just in the BODY. You can just try:

    $data.ParsedHtml.getElementsByTagName("body").InnerHTML

    See if that contains the server data, then it's just parsing it

  • #17787
    Profile photo of bvi1998 .
    bvi1998 .
    Participant

    Hi,
    I'm back πŸ™‚
    I now have an HTML file as output , and in it a table. It looks like this, with headers:

    hostname uuid
    server01 5d7b0d42-760d-cf02-4c0d-b00848b20a38
    server02 5d7b0d42-760d-cf02-4c0d-b00848b20a38

    What I am trying to do is search for server02, and set a variable to the server's uuid.

    Can someone help?
    Thanks!

  • #17795
    Profile photo of Adnan Rashid
    Adnan Rashid
    Participant

    Rob that is well good – pretty impressive what you can parse and how to get those details out.

  • #17796
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    You are just using standard Powershell commands\logic now that you have a PSObject:

    $object = @()
    $object += New-Object -TypeName PSObject -Property @{Hostname="Server1";UUID="5d7b0d42-760d-cf02-4c0d-b00848b20a38"}
    $object += New-Object -TypeName PSObject -Property @{Hostname="Server2";UUID="5d7b0d42-760d-cf02-4c0d-b00848b20a38"}
    
    $object | Where{$_.HostName -eq "server1"} | foreach{ Some-Command -UUID $_.UUID }
  • #17811
    Profile photo of bvi1998 .
    bvi1998 .
    Participant

    Thanks.
    But this is an html file I am searching in, so no psobject.. I think I am missing something?
    Thanks!

  • #17826
    Profile photo of Rob Simmers
    Rob Simmers
    Participant

    You just create a blank object and then redirect what is being outputted into the object, like so:

    PS C:\>
     $URL = "https://powershell.org/forums/forum/windows-powershell-qa/"
     
    # reading website data:
    $data = Invoke-WebRequest -Uri $URL 
    $myPSObject = @() 
    # get the first table found on the website and write it to disk:
    $myPSObject = $data.ParsedHtml.getElementByID("bbp-forum-2683") | foreach{
        $_.getElementsByTagName("ul") | foreach{
            $_.getElementsByTagName("li") | Where{$_.className -eq 'bbp-topic-title'} | foreach{
                $_.getElementsByTagName("a") | foreach{
                    $_ | Select Title, HREF
                }
            }
        }
    }
    
    $myPSObject
    
    title                                                  href                                                  
    -----                                                  ----                                                  
    Forums Tips and Guidelines                             https://powershell.org/forums/topic/forums-tips-a...
    Check these External Forums for Specific Topics        https://powershell.org/forums/topic/check-these-e...
    Little scripting help                                  https://powershell.org/forums/topic/little-script...
    WINRM authentication                                   https://powershell.org/forums/topic/winrm-authent...
    WINRM kerberos & Negotiate                             https://powershell.org/forums/topic/winrm-kerbero...
    WinRM with non-domain joined machine using Certs       https://powershell.org/forums/topic/winrm-with-no...
    Exchange cmdlet error change in PS 3 vs PS 4           https://powershell.org/forums/topic/exchange-cmdl...
    Find text on a web page                                https://powershell.org/forums/topic/find-text-on-...
    using select-object                                    https://powershell.org/forums/topic/using-select-...
    File Copy Access is denied                             https://powershell.org/forums/topic/file-copy-acc...
    Dell Warranty Information                              https://powershell.org/forums/topic/dell-warranty...
    Foreign Security Principals                            https://powershell.org/forums/topic/foreign-secur...
    Module review                                          https://powershell.org/forums/topic/module-review/  
    winrm g & e swithch diffrence                          https://powershell.org/forums/topic/winrm-g-e-swi...
    Help with setting up PSRemoting                        https://powershell.org/forums/topic/help-with-set...
    Help with adding a script method (and quite possibl... https://powershell.org/forums/topic/help-with-add...
    LOG for Copy-item                                      https://powershell.org/forums/topic/log-for-copy-...
    
    PS C:\> $myPSObject | Where{$_.Title -like "*module*"}
    
    title                                                  href                                                  
    -----                                                  ----                                                  
    Module review                                          https://powershell.org/forums/topic/module-review/  
    
    
  • #17913
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    Do you know of a way to get this to work with a file on disk? If you pass a file:// URI to Invoke-WebRequest, you get back a different type of object which doesn't contain the parsed HTML objects.

    I've been able to use the [url="http://htmlagilitypack.codeplex.com/"]HTML Agility Pack[/url] for this, which requires downloading an extra DLL (and it helps to be familiar with XPath), but is also quite fast compared to accessing the HTML DOM through the objects that Invoke-WebRequest returns:

    Add-Type -Path .\HtmlAgilityPack.dll
    
    $doc = New-Object HtmlAgilityPack.HtmlDocument
    $doc.Load("$pwd\forum.html")
    
    $links = $doc.DocumentNode.SelectNodes('//li[@class="bbp-topic-title"]/a')
    
    $properties = @(
        @{ Name = 'Title'; Expression = { $_.GetAttributeValue('Title', "") } }
        @{ Name = 'Href';  Expression = { $_.GetAttributeValue('href', "") } }
    )
    
    $links | Select-Object -Property $properties
    
  • #17914
    Profile photo of Dave Wyatt
    Dave Wyatt
    Moderator

    BTW, you can use the same library to handle parsing of live webpages as well. This has the benefit of keeping the faster performance of the HTML Agility Pack library, and consistent code regardless of where the HTML came from. To do so, use Invoke-WebRequest as before, and pass its Content property to the LoadHtml method of HtmlAgilityPack.HtmlDocument:

    Add-Type -Path .\HtmlAgilityPack.dll
    
    $URL = "https://powershell.org/forums/forum/windows-powershell-qa/" 
    $data = Invoke-WebRequest -Uri $URL 
    
    $doc = New-Object HtmlAgilityPack.HtmlDocument
    $doc.LoadHtml($data.Content)
    
    $links = $doc.DocumentNode.SelectNodes('//li[@class="bbp-topic-title"]/a')
    
    $properties = @(
        @{ Name = 'Title'; Expression = { $_.GetAttributeValue('Title', "") } }
        @{ Name = 'Href';  Expression = { $_.GetAttributeValue('href', "") } }
    )
    
    $links | Select-Object -Property $properties
    
    
  • #58897
    Profile photo of Clarkcooke
    Clarkcooke
    Participant

    Use β€œLong path tool” software and keep yourself cool.

You must be logged in to reply to this topic.