Scrape IBM Connections forum for posts

This topic contains 3 replies, has 2 voices, and was last updated by  Sriram N A 3 weeks, 2 days ago.

  • Author
    Posts
  • #103058

    Sriram N A
    Participant

    I am trying to retrieve posts from an IBM Connections forum topic on the intranet. Here's where I've gotten so far, from an examination of the page with F12 in the browser:

    $URI = "https://.../forums/html/threadTopic?id=359b658c-54a4-4672-86e1-a6c7e2396fc7"
    $WebResponse = Invoke-WebRequest $URI -UseDefaultCredentials
    
    # Get Authors
    $WebResponse.ParsedHtml.getElementsByClassName('hentry lotusPost  ') | foreach {
        $_.getElementsByClassName('lotusPostName')} | foreach {
            $_.getElementsByClassName('email')} | foreach {
                  $_.textcontent
                }
    
    #Get Post Content
    $WebResponse.ParsedHtml.getElementsByClassName('entry-content lotusPostDetails')|foreach {$_.innertext} 
    

    The div classes above have a grandfather class which pairs the authors and corresponding posts. My objective is to export the (long) list of authors and corresponding posts to a file. What is the best approach to get a paired array of these values for this purpose?

  • #103147

    random commandline
    Participant

    I don't know what the page's code looks like, but using 'getElementById' method could be faster. Look for 'id' near div classes in your 'DOM Explorer' (Example: div class="text" id="text"). If an 'id' is available, you should be able to find innertext then parse it without the use of foreach-object.

    # Example
    $test = $WebResponse.ParsedHtml.getElementById('hentry lotusPost').innertext
    
    • #103150

      Sriram N A
      Participant

      Did look for that – no IDs to be found. At any rate, I have a handle on the data that I need.
      I have on hand two ordered arrays – one of authors and the other of the corresponding posts. My question was more about putting these together into an accessible table or two-dimensional array.

    • #103400

      Sriram N A
      Participant

      This worked:

      # Get Authors
      $authors = @()
      $WebResponse.ParsedHtml.getElementsByClassName('hentry lotusPost  ') | foreach {
          $_.getElementsByClassName('lotusPostName')} | foreach {
              $_.getElementsByClassName('email')} | foreach {
                  $authors += $_.textcontent
              }
      
      #Get Post Content
      $PostText = @()
      $WebResponse.ParsedHtml.getElementsByClassName('entry-content lotusPostDetails')|foreach {$PostText += $_.innertext}
      
      #Create a table
      $Posts = 0..($Authors.Length-1) | Select @{n="Poster";e={$Authors[$_]}}, @{n="Content";e={$PostText[$_]}}
      
      #Create a csv file
      $Posts | Export-Csv -Path .\Posts.csv
      
      #Parse the content column for DL names mentioned
      $Regex = "\bMYCO\/\w+.*\b"
      $Posts.content | foreach {$_ -match $Regex} | %{$matches} 
      

      Only thing left is to figure out how to improve the Regex, to correctly fetch the distribution list names embedded in the various posts, but that's a task for another day.

You must be logged in to reply to this topic.