Help getting table data out of HTML (Scraping)

Welcome Forums General PowerShell Q&A Help getting table data out of HTML (Scraping)

This topic contains 9 replies, has 4 voices, and was last updated by

 
Participant
1 month, 2 weeks ago.

  • Author
    Posts
  • #117396

    Participant
    Points: 45
    Rank: Member

    Hey guys. I am totally stuck trying to pull a table out of some HTML.
    I am logging onto a website and pulling back the HTML information but for the life of me, I can't seem to get that data into a useful format.

    I'm scraping using 'New-Object -com' because I couldn't work out how to use 'Invoke-Webrequest' and log in so I fell back to InternetExplorer.Application as this was working for me (plus I can set it to visible and see it).

    After login I am collecting the information like so

    $tables = @($ie.Document.getElementsByTagName('table'))
    

    and inside that, I have a field called 'innerHTML' that contains my table. I just can't work out how to convert the information in that table into something readable and I'm wanting to perform a foreach at the end to upload some cells via an API to another site.

    I've tried converting to XML\JSON, and splitting on "`n"I've followed every link on the internet that I can find and tried all their solutions but none seem to work for me. I have also tried a bunch of functions sourced from the net but they also no data returned.

    I have put an edited copy of the table in this pastebin.. it only has 2 rooms, but my original data will have 50 (unless I can work out how to return all object and not just the first page, then it will have about 300)

    https://pastebin.com/R7FfHCvX
    

    If anyone could point me in the right direction as i'm all out of ideas?
    I think that this would be a fairly common task in PowerShell so I'm guessing that I just don't know the terminology to search for it on google.

  • #117406

    Participant
    Points: 45
    Rank: Member

    This is what the table looks like on the website if it helps.

    View post on imgur.com
  • #117531

    Participant
    Points: 10
    Rank: Member

    What site are you logged into and what information do you need from 'innerHTML'? Also, using the 'getElementById()' method will be faster.

    • #117552

      Participant
      Points: 45
      Rank: Member

      The site is zoom.us, its just a meeting room application and I'm accessing its dashboard to collect the status of the rooms in error so that I can update my CMDB and send alerts off it if required. The image in the link above will explain better what I am wanting to pull out of the table, but I am after the following Row\Cell information
      for Room Name, Health, Issue

      In the Pastebin, they are on the following lines.
      Room Name – Line 58
      Health – Line 64
      Issue – Lines 67,68

      I did try 'getElementById()' method but there is no ID for that information, just the table (unless I am missing something).

      I thought that I could just dump it all into an Object and then just access the information by dot sourcing it or maybe by using ConvertTo-XML first but that doesn't work. So I tried a whole heap of examples that I found online but none of the examples worked for me no matter what I tried. I can always get to the step of accessing the table, but whatever the next step is in the example, and no matter how I reworked the function it always failed for me.

    • #117553

      Participant
      Points: 45
      Rank: Member

      Ok, I was being stupid. I see what you mean about using the 'getElementById()' method.
      So I can pull just the first entry. I'll just have to work out how to pull the others, then I can do a replace or regex on the string or something.

      Edit:
      I still may not be able to do it that way as each section reuses the ID and according to this, it can't be reused.

      https://stackoverflow.com/questions/9435616/powershell-ie-automation-getelementbyid-with-multiple-entries
      
  • #117666

    Participant
    Points: 45
    Rank: Member

    If I use the 'getElementById()' method then I can access the data for the first row

    , but as each row all have the same name I can't get the others. e.g ID="RoomName".
    Then if I instead take the HTML and split on "< .?tr>" so that I can iterate through each row\tr then I can no longer access the data via the method (which would have solved my issue, as "outerText" has the info that I need in plain text).

    I'm still not sure how this should be grabbed.
    Do you have any other ideas for me?

    I managed to get invoke-webrequest to work, so will try with ParsedHtml to see if that make a difference.

  • #117733

    Participant
    Points: 307
    Helping Hand
    Rank: Contributor

    How are you walking the web site to ensure you are on / using the right object?

    For example, If I wanted to use say pwpush.com and interact with it.

    $url = 'https://pwpush.com'
    
    ($FormElements = Invoke-WebRequest -Uri $url -SessionVariable fe) 
    
    ($Form = $FormElements.Forms[0]) | Format-List -Force
    
    $Form | Get-Member
    
    $Form.Fields
    
    
    # Full pass
    
    Clear-Host
    
    $password = '1234'
    $loginUrl = 'https://pwpush.com'
    
    $ie = New-Object -com internetexplorer.application
    $ie.visible = $true
    $ie.navigate($loginUrl)
    
    while ($ie.Busy -eq $true) { Start-Sleep -Seconds 1 }
    
    ($ie.document.getElementById('password_payload') | select -first 1).value = $password
    Start-Sleep -Seconds 1 
    
    $ie.Document.getElementsByName('commit').Item().Click();
    Start-Sleep -Seconds 1 
    

    Now pulling table info from a page, you'd do similar stuff, using something like this...

    # get the first table found on the website and write it to disk:
    @($data.ParsedHtml.getElementsByTagName("table"))[0].OuterHTML
    

    … of course that is only after you've walked the site to get the correct element.

  • #117913

    Participant
    Points: 60
    Rank: Member

    Hey Gary, I've taken a look at their website, and think you might be making it a bit too tricky than it needs to be. They have an API available which you can use https://zoom.github.io/api/ , which will be able to give you all this information without the need for scraping. Judging by the number of rooms you're talking about, it sounds like you have a license which would allow you to use it with an API key and will give you data back to do with what you want and not need to scrape.

    For any websites, I recommend in taking a look using Chrome with Developer Tools to have a look at what's happening in the background. Take a look particularly for XHR requests, which are API calls. But either way, you get lots of information that help hugely to automate getting data.

    Scraping has its uses, but very quickly builds up technical debt that is as important to focus on as the coding itself.

    • #118311

      Participant
      Points: 45
      Rank: Member

      Wow. I had no idea they had an API. I will use that.

      For completeness, I was doing the following.

      $postParams = @{email="email@domain.com";password="password"}
      try{
      Invoke-WebRequest "https://zoom.us/signin" -SessionVariable Zoom -Body $postParams -Method Post
      $Dashboard = Invoke-WebRequest "https://zoom.us/account/metrics/zp" -WebSession $Zoom -Body $postParams -Method Post
      }
      catch {
      $err=$_.Exception}
      
      $splitTR = $Dashboard.ParsedHtml.documentElement.innerHTML -split "< .?tr>"
      
      

      But I will definitely switch to the API. It would have been good if they told me they had one when I last spoke with there support about there dashboard.

  • #118348

    Participant
    Points: 45
    Rank: Member

    So turns out that you can't see the dev options until you have registered on another site (https://developer.zoom.us).. not sure why they didn't tell me about the API when I was asking them questions about the dashboard but using the API obviously helped.

    Anyway, that was a simple solution, took me about 10min after I sorted that out.

    For completeness I was doing this to scrape (now changed to API call).

    $postParams = @{email="email@domain.com";password="password"}
    try{
        Invoke-WebRequest "https://zoom.us/signin" -SessionVariable Zoom -Body $postParams -Method Post
        $Dashboard = Invoke-WebRequest "https://zoom.us/account/metrics/zp" -WebSession $Zoom -Body $postParams -Method Post
        $Dashboard.ParsedHtml.nameProp}
    catch {
        $err=$_.Exception}
    $splitTR = $Dashboard.ParsedHtml.documentElement.innerHTML -split "< .?tr>"

You must be logged in to reply to this topic.