Assistance with Invoke-WebRequest

Welcome Forums General PowerShell Q&A Assistance with Invoke-WebRequest

Viewing 11 reply threads
  • Author
    Posts
    • #192160
      Participant
      Topics: 7
      Replies: 16
      Points: 107
      Rank: Participant

      Hi everyone,

      Hoping someone here can help me with some website scraping I'm trying to do using Invoke-WebRequest. I would like to scrape the comments section of a website I frequent. After poking around and doing counts and href lookups, I have figured out that it's a WordPress blog and the comments are in a #comments subdirectory. If I do the following:

      (Invoke-Webrequest -uri https://thewpblogsite.com/directory/#comments).Content, I get all the data from the comments section, but of course I only want the comments themselves. I've noticed that each comment is wrapped in the following tags...

      I've tried a ton of ways to narrow this down in a way that PS will understand but I'm still fairly new to Powershell so I'm not advanced enough to know how to extract this out.

      Any assistance is greatly appreciated. Thanks in advance.

      Nelson

    • #192169
      Senior Moderator
      Topics: 8
      Replies: 1153
      Points: 4,006
      Helping Hand
      Rank: Community Hero

      Can you share an example output. You can a gist in gist.github.com and copy paste the url here to share the XML snippet.

    • #192241
      Participant
      Topics: 7
      Replies: 16
      Points: 107
      Rank: Participant

      Sure.

      Here is the website whose comments I wish to scrape, in my variable.... $URI = "https://frugalvagabond.com/get-non-lucrative-residence-visa-spain/#comment"

      And my other variable....

      $HTML = Invoke-WebRequest -Uri $URI

      Based on the tag and class name I see for the comments I want to scrape, here is what I am entering....

      ($HTML.ParsedHtml.getElementsByTagName("SECTION") | Where{ $_.className -like "comment-content comment" } ).innertext

      But I get nothing in return.

      I have run GM and OGV as well and while I think I am choosing the right tag and class, the comment data doesn't come up. I assume it's a different tag and class but not sure which I should be choosing since it's all that really stands out to me and all I want is the text of the actual comments on this webpage.

      Thank you in advance.

      Nelson

    • #192244
      Participant
      Topics: 10
      Replies: 1381
      Points: 1,509
      Helping Hand
      Rank: Community Hero

      < !– .comment-content –>

      That looks like more of a comment in HTML for a CSS class, but regardless, in order to assist we need an example of the XML or JSON that is being sent in the response to provide parsing opportunities.

    • #192247
      Participant
      Topics: 7
      Replies: 16
      Points: 107
      Rank: Participant

      Thank you, Rob. Sorry but this is all fairly new to me. Happy to get this for you but how would I go about getting that output? I see kvprasoon suggested gist.github.com, went there but I don't see where I could enter the web address.

      A small nudge in the right direction would be greatly appreciated. Thanks!

      Nelson

    • #192256
      Senior Moderator
      Topics: 8
      Replies: 1153
      Points: 4,006
      Helping Hand
      Rank: Community Hero

      Create the gist and just paste the link here. This forum will pull it from gist and show it here.

    • #192316
      Participant
      Topics: 7
      Replies: 16
      Points: 107
      Rank: Participant

      Not sure if this is what you're looking for, but here is my gist of the pieces I'd like to be able to scrape....

      https://gist.github.com/nelsonsaenz/7f921bd16976c82195c32609a4b815c2

    • #192355
      Participant
      Topics: 10
      Replies: 1381
      Points: 1,509
      Helping Hand
      Rank: Community Hero

      Here is a start. HTML is nested and there are multiple layers, but just picking out the comments, you could do something like this:

      $URI = "https://frugalvagabond.com/get-non-lucrative-residence-visa-spain/#comment"
      $HTML = Invoke-WebRequest -Uri $URI
      
      #Start at the level where all of the comments are
      $div = $html.ParsedHtml.getElementById('comments')
      
      $results = foreach ($divElem in $div) { 
          #Get the title of the blog that is being commented on
          $title = $divElem.getElementsByTagName('h2') | Select -ExpandProperty innerText
      
          #Loop through all of the li elements, which is basically each post
          foreach ($liElem in $divElem.getElementsByTagName('li')) { 
              #Grab the LI element ID
              $id = $liElem.id
              $paragraph = @()
              #Loop through each P, which is each paragraph and create array
              foreach ($pElem in $liElem.getElementsByTagName('p')) {
                  $paragraph += $pElem.innerText
              }
      
              [pscustomobject]@{
                  Title   = $title
                  Id      = $id
                  Comment = $paragraph -join [environment]::NewLine
              }
          }
      }
      

      Output:

      PS C:\WINDOWS\system32> $results.Count
      1106
      
      PS C:\WINDOWS\system32> $results | Select -First 5
      
      Title                                                                  Id               Comment                                                                                                                                                                  
      -----                                                                  --               -------                                                                                                                                                                  
      1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-22751 Congratulations! What a process that was and what a resource you created. Cannot wait to follow along!...                                                                
      1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-22752 Woo hoo! Thank you! And thank you for being one of the early secret-keepers about this journey! I hope the post will help someone down the road (though I guess it's a...
      1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-35849 Well its helping me,… and i would love to pay for it with a coffee or lunch...                                                                                           
      1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-35903 Thanks, Imran. I'm thrilled it's helping! Once you make it to this end look me up and we'll grab some coffee                                                             
      1,106 thoughts on “How to Get a Spanish Non-Lucrative Residence Visa”  li-comment-41052 Do you need to have your visit within 3 months of leaving? If I want to go mid July do I have to wait mid April for an appointment or can I go in sooner?...             
      
    • #192382
      Participant
      Topics: 7
      Replies: 16
      Points: 107
      Rank: Participant

      Rob,

      Thank you so much for your help. I've been studying what you sent over and just had a few questions so that I can better understand how you went about this....

      So it looks like I was confusing what I think you said were CSS tags for HTML tags? I see that you ended up going with tagnames h2, li, etc. I would assume these are HTML tags and then also finding what would be the unique identifier for each comment which looks to be id.

      Thank you again, I will definitely study your code as a template moving forward. Really appreciate it.

      Nelson

       

    • #192484
      Participant
      Topics: 10
      Replies: 1381
      Points: 1,509
      Helping Hand
      Rank: Community Hero

      So it looks like I was confusing what I think you said were CSS tags for HTML tags? I see that you ended up going with tagnames h2, li, etc. I would assume these

      Your question trailed off there, but if you were searching text and performing a parse would be more difficult. The parsing using in HTML is Document Object Model (DOM), which is designed for JavaScript, but regardless we can use these methods to programatically parse HTML. If you are familiar with HTML (or XML), they are a nested structure of nodes, so there is a standard structure with HTML > BODY and then it's up to the developer. Typically, a good place to review the structure is the developer tab (F12) in the browser when you're on the page. It allows you to search and expand\collapse the HTML to narrow down where to start, which I used the named DIV, so when you say you want all P tags, it is only under that DIV. The P is paragraph, so if you have multiple paragraphs in your comment, we put them in an array append a carriage return. In summary, it's about narrowing things down as much as possible and looping through the nodes to get what you want

    • #192496
      Participant
      Topics: 10
      Replies: 1381
      Points: 1,509
      Helping Hand
      Rank: Community Hero

      Something odd is going on with this post. I think some of the html tags broke something because the option to edit, qoute, etc. posts is missing in this entire thread which is most likely what happened to your message as well

    • #192625
      Participant
      Topics: 7
      Replies: 16
      Points: 107
      Rank: Participant

      Hi Rob,

      OK, that makes sense but I also did think that since this is way outside my wheelhouse, was very possible I was asking my question in a confusing manner.

      In any event, thank you again!! Greatly appreciated.

       

Viewing 11 reply threads
  • You must be logged in to reply to this topic.