Extracting parent tables on HTML

Welcome Forums General PowerShell Q&A Extracting parent tables on HTML

Viewing 5 reply threads
  • Author
    Posts
    • #200177
      Participant
      Topics: 5
      Replies: 11
      Points: 67
      Rank: Member

      Hi, second post (soon I will become to be helpful to others),

      I’m trying to parse a local html report, to extract some data and output it to csv.

      The problem comes with a report with nested tables, inside tds…

      And what I want to do is get the tables on depth 0, and from there iterate row by row to extract the other data. I’ve been searching how to get that tables, but the documentation is… disperse, and I didn’t find anything nearly like that.

      Thanks for all the help!

      text 1 text2 text3
      text 4 text5 text6
      text7 text8
      text9.1 text9.2 text9.3
      text9.4 text9.5 text9.6
      text9.7 text9.8 text9.9
      • This topic was modified 2 months ago by Jcstdio83.
    • #200195
      Participant
      Topics: 12
      Replies: 1489
      Points: 1,987
      Helping Hand
      Rank: Community Hero

      You cannot place HTML or XML in forum directly, you need to use Gist to post that content. Traditionally, you want to get the data from the source and if you cannot the directly query that, then ideally you could get the report in another format like CSV or JSON that allow direct parsing. With that said, there are many examples of how to parse HTML and luckily there are pre-written functions\cmdlets to do the parse:

      https://www.powershellgallery.com/packages/PowerHTML/0.1.6/Content/Public%5CConvertFrom-HTML.ps1
      https://stackoverflow.com/questions/26984199/how-to-extract-specific-values-from-a-html-table-and-copy-them-horizontally-to-a

    • #200201
      Participant
      Topics: 5
      Replies: 11
      Points: 67
      Rank: Member

      Yes, sorry for the html, I changed it into the table format in the page.

      For the second part, I already saw that scripts, but neither of them say anything about working with the depth of the tables, or how to deal with the nested tables.

    • #200258
      Participant
      Topics: 12
      Replies: 1489
      Points: 1,987
      Helping Hand
      Rank: Community Hero

      Have you actually tested any of them? It is possible that it may not parse as expected, but it does show you how to do basic parsing so that you can modify it to work as required. There are many times I’ve gotten example code to just understand the fundamentals and then updated to do what I specifically need. If have specific questions about how to modify code and can explain what is or is not working, then post your questions with examples (in Gist)

    • #200369
      Participant
      Topics: 5
      Replies: 11
      Points: 67
      Rank: Member

      Hi, thanks for the reply, here I paste the code:

      foreach($n_file in $node_files){
      
          # create the object to treat the html document
      
          $html = New-Object -ComObject "HTMLFile"
      
          # read the content of the html
      
          $source = Get-Content -Path $n_file.FullName -Raw
      
          # assign the content to the object
      
          $html.IHTMLDocument2_write($source)
      
          # get the tables of the document
      
          $tables = $html.getElementsByTagName('table')
      
          # create the array to store the info
      
          $arrayTable = @(@())
      
          # iterate over all the tables in the html file
      
          foreach($table in $tables){
      
              # get all the tr elements in the current table
      
              $trs = $table.getElementsByTagName('tr')
      
              # iterate over the tr elements
      
              foreach($tr in $trs){
      
                  # get all the td elements in the current table
      
                  $tds = $tr.getElementsByTagName('td')
      
                  # create the string to contain the data of the row
      
                  $line = ""
      
                  # iterate over all the td elements in the current row
      
                  foreach($td in $tds){  
      
                      # add the current td element text to the string
      
                      $line += $td.innerText+"," 
      
                  }
      
                  # add the string as a row to the array
      
                  $arrayTable+=,(@($line))
      
              }
      
          }
      
          # create the variable with the ouput file
      
          $output_file = $rf+"\"+($n_file.Name -replace ".html", "")+"_file.csv"
      
          # iterate over each row of the array
      
          foreach($row in $arrayTable){
      
              # add the row to the outputfile
      
              Add-Content -Path $output_file -Value $row
      
          }
      
      }

      $node_files is varible that has stored in it the files which I want to parse, that work fine.

      All this part work, but nested tables displaces the information that goes after them.
      My goal, and still thinking about it, is to get only the parent tables in the structure, and not the nested tables which are inside the td tags.
    • #200576
      Participant
      Topics: 5
      Replies: 11
      Points: 67
      Rank: Member

      Ok, solved it in the most stupid way I can think.

      As the reports comes in the same format always, and the nested tables have a different size than the parent tables, I only need to count the cells on row of the table, and if it has the correct number of columns, is a parent table and I can extract the data.

       

Viewing 5 reply threads
  • You must be logged in to reply to this topic.