Convert web page source to HTML object

Welcome Forums General PowerShell Q&A Convert web page source to HTML object

This topic contains 2 replies, has 2 voices, and was last updated by

 
Participant
3 weeks, 3 days ago.

  • Author
    Posts
  • #173263

    Participant
    Topics: 6
    Replies: 15
    Points: 14
    Rank: Member

    Hi all,

    I am rewriting my script that used the com object InternetExplorer.Application to login, navigate, find elements (links) in page source. So now I am using selenium module to work with FireFox driver to do the same.

    I have reached a point where I am able to do the login and navigate, I just can't find a way to convert the page source (HTML string) into an object that I can filter by tag name and class and select hrefs.

    I found an example at a website that I could create a com object of HTML and write the page source so I would get a DOM object, however it still doesn't work when I try to look for 'a' tags there nothing found, like this data was not added to the object.

    I need your help converting a page source (HTML string) into an object that I can work with.

    Here is the sample code which uses 'selenium' module:

    
    $PSCred = Get-Credential
    $FFDriver = Start-SeFirefox
    $FFDriver.Navigate().GoToURL('https://login.somewebsite.com/')
    $FFDriver.FindElementByName('email').sendkeys($PSCred.Username)
    $FFDriver.FindElementByName('password').sendkeys($PSCred.GetNetworkCredential().password)
    $FFDriver.FindElementByName('password').submit()
    Start-Sleep -Seconds 3
    $FFDriver.Navigate().GoToURL('https://www.somewebsite.com/aaa/bbb/')
    $FFDriver.Title
    $Source = $FFDriver.PageSource
    # Create HTML file Object
    $HTML = New-Object -ComObject "HTMLFile"
    # Write HTML content according to DOM Level2
    $HTML.IHTMLDocument2_write($Source)
    $LinkElements = $HTML.getElementsByTagName('a') | where{$_.href -like "$xxx*"} | where{$_.className -eq 'xxx'}
    
    # cleanup
    Remove-Variable -Name PSCred
    $FFDriver.Close()
    $FFDriver.Quit()
    $FFDriver.Dispose()
    
    
  • #173272

    Participant
    Topics: 0
    Replies: 100
    Points: 363
    Helping Hand
    Rank: Contributor

    ofergnant,

    Please provide some examples of the data you are looking for if you can provide a public site this will help. Are you not able to use the Invoke-WebRequest to call the website?

    Invoke-WebRequest -Uri https://www.google.com

     

  • #173416

    Participant
    Topics: 6
    Replies: 15
    Points: 14
    Rank: Member

    ofergnant,

    Please provide some examples of the data you are looking for if you can provide a public site this will help. Are you not able to use the Invoke-WebRequest to call the website?

    1
    Invoke-WebRequest Uri https://www.google.com
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I would be better if I could use Invoke_WebRequest to login, the website is using auth0 (3 redirects) with javascript and ajax, when I tried using it there was no forms created, I guess because they take more time to load pulling the js from external source. I wasn't sure also if I could keep the session logged in while parsing ad following links. But I think -sessionvaraible parameter does that, so if I manage to login I am good for the rest of the script.

    For now, I have found a workaround solution using HTMLagility pack suggested here: https://powershell.org/forums/topic/find-text-on-a-web-page/ . I managed to create HTML object the same way I done before, only that their 'load' method to write the web source to the object is doing a better job. Using his example I managed to find links, and filter by class name then pull out titles and hrefs.

    I still need to learn better about this agility package filter options as it has a syntax I am not familiar with.

    If anyone can suggest me with a simpler approach using Invoke-WebRequest to work I would prefer it as I don't like many dependencies in my automation scripts.

    BTW, the website I am not able to login using Invoke-WebRequest is https://login.thetimes.co.uk

    Here is the code addition:

    Register-PackageSource -Name MyNuGet -Location https://www.nuget.org/api/v2 -ProviderName NuGet
    Install-Package HtmlAgilityPack
    
    $Source = $FFDriver.PageSource
    # Create HTML object
    $doc = New-Object HtmlAgilityPack.HtmlDocument
    $doc.LoadHtml($Source)
    # Get all today's new links
    $Links = $doc.DocumentNode.SelectNodes('//a[@class="classname"]')
    
    

You must be logged in to reply to this topic.