Convert web page source to HTML object

Welcome Forums General PowerShell Q&A Convert web page source to HTML object

Viewing 2 reply threads
  • Author
    Posts
    • #173263
      Participant
      Topics: 6
      Replies: 15
      Points: 14
      Rank: Member

      Hi all,

      I am rewriting my script that used the com object InternetExplorer.Application to login, navigate, find elements (links) in page source. So now I am using selenium module to work with FireFox driver to do the same.

      I have reached a point where I am able to do the login and navigate, I just can't find a way to convert the page source (HTML string) into an object that I can filter by tag name and class and select hrefs.

      I found an example at a website that I could create a com object of HTML and write the page source so I would get a DOM object, however it still doesn't work when I try to look for 'a' tags there nothing found, like this data was not added to the object.

      I need your help converting a page source (HTML string) into an object that I can work with.

      Here is the sample code which uses 'selenium' module:

      
      $PSCred = Get-Credential
      $FFDriver = Start-SeFirefox
      $FFDriver.Navigate().GoToURL('https://login.somewebsite.com/')
      $FFDriver.FindElementByName('email').sendkeys($PSCred.Username)
      $FFDriver.FindElementByName('password').sendkeys($PSCred.GetNetworkCredential().password)
      $FFDriver.FindElementByName('password').submit()
      Start-Sleep -Seconds 3
      $FFDriver.Navigate().GoToURL('https://www.somewebsite.com/aaa/bbb/')
      $FFDriver.Title
      $Source = $FFDriver.PageSource
      # Create HTML file Object
      $HTML = New-Object -ComObject "HTMLFile"
      # Write HTML content according to DOM Level2
      $HTML.IHTMLDocument2_write($Source)
      $LinkElements = $HTML.getElementsByTagName('a') | where{$_.href -like "$xxx*"} | where{$_.className -eq 'xxx'}
      
      # cleanup
      Remove-Variable -Name PSCred
      $FFDriver.Close()
      $FFDriver.Quit()
      $FFDriver.Dispose()
      
      
    • #173272
      Participant
      Topics: 0
      Replies: 115
      Points: 433
      Helping Hand
      Rank: Contributor

      ofergnant,

      Please provide some examples of the data you are looking for if you can provide a public site this will help. Are you not able to use the Invoke-WebRequest to call the website?

      Invoke-WebRequest -Uri https://www.google.com

       

    • #173416
      Participant
      Topics: 6
      Replies: 15
      Points: 14
      Rank: Member

      ofergnant,

      Please provide some examples of the data you are looking for if you can provide a public site this will help. Are you not able to use the Invoke-WebRequest to call the website?

      1
      Invoke-WebRequest Uri https://www.google.com
      XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

      I would be better if I could use Invoke_WebRequest to login, the website is using auth0 (3 redirects) with javascript and ajax, when I tried using it there was no forms created, I guess because they take more time to load pulling the js from external source. I wasn't sure also if I could keep the session logged in while parsing ad following links. But I think -sessionvaraible parameter does that, so if I manage to login I am good for the rest of the script.

      For now, I have found a workaround solution using HTMLagility pack suggested here: https://powershell.org/forums/topic/find-text-on-a-web-page/ . I managed to create HTML object the same way I done before, only that their 'load' method to write the web source to the object is doing a better job. Using his example I managed to find links, and filter by class name then pull out titles and hrefs.

      I still need to learn better about this agility package filter options as it has a syntax I am not familiar with.

      If anyone can suggest me with a simpler approach using Invoke-WebRequest to work I would prefer it as I don't like many dependencies in my automation scripts.

      BTW, the website I am not able to login using Invoke-WebRequest is https://login.thetimes.co.uk

      Here is the code addition:

      Register-PackageSource -Name MyNuGet -Location https://www.nuget.org/api/v2 -ProviderName NuGet
      Install-Package HtmlAgilityPack
      
      $Source = $FFDriver.PageSource
      # Create HTML object
      $doc = New-Object HtmlAgilityPack.HtmlDocument
      $doc.LoadHtml($Source)
      # Get all today's new links
      $Links = $doc.DocumentNode.SelectNodes('//a[@class="classname"]')
      
      
Viewing 2 reply threads
  • The topic ‘Convert web page source to HTML object’ is closed to new replies.