Assistance with Invoke-WebRequest

Welcome Forums General PowerShell Q&A Assistance with Invoke-WebRequest

Viewing 11 reply threads
  • Author
    Posts
    • #192160
      Participant
      Topics: 9
      Replies: 21
      Points: 140
      Rank: Participant

      Hi everyone,

      Hoping someone here can help me with some website scraping I’m trying to do using Invoke-WebRequest. I would like to scrape the comments section of a website I frequent. After poking around and doing counts and href lookups, I have figured out that it’s a WordPress blog and the comments are in a #comments subdirectory. If I do the following:

      (Invoke-Webrequest -uri https://thewpblogsite.com/directory/#comments).Content, I get all the data from the comments section, but of course I only want the comments themselves. I’ve noticed that each comment is wrapped in the following tags…

      <section class=”comment-content comment”>

      </section><!– .comment-content –>

      I’ve tried a ton of ways to narrow this down in a way that PS will understand but I’m still fairly new to Powershell so I’m not advanced enough to know how to extract this out.

      Any assistance is greatly appreciated. Thanks in advance.

      Nelson

    • #192169
      Senior Moderator
      Topics: 9
      Replies: 1309
      Points: 4,784
      Helping Hand
      Rank: Community Hero

      Can you share an example output. You can a gist in gist.github.com and copy paste the url here to share the XML snippet.

    • #192241
      Participant
      Topics: 9
      Replies: 21
      Points: 140
      Rank: Participant

      Sure.

      Here is the website whose comments I wish to scrape, in my variable…. $URI = “https://frugalvagabond.com/get-non-lucrative-residence-visa-spain/#comment&#8221;

      And my other variable….

      $HTML = Invoke-WebRequest -Uri $URI

      Based on the tag and class name I see for the comments I want to scrape, here is what I am entering….

      ($HTML.ParsedHtml.getElementsByTagName(“SECTION”) | Where{ $_.className -like “comment-content comment” } ).innertext

      But I get nothing in return.

      I have run GM and OGV as well and while I think I am choosing the right tag and class, the comment data doesn’t come up. I assume it’s a different tag and class but not sure which I should be choosing since it’s all that really stands out to me and all I want is the text of the actual comments on this webpage.

      Thank you in advance.

      Nelson

    • #192244
      Participant
      Topics: 15
      Replies: 1761
      Points: 3,167
      Helping Hand
      Rank: Community Hero

      < !– .comment-content –>

      That looks like more of a comment in HTML for a CSS class, but regardless, in order to assist we need an example of the XML or JSON that is being sent in the response to provide parsing opportunities.

    • #192247
      Participant
      Topics: 9
      Replies: 21
      Points: 140
      Rank: Participant

      Thank you, Rob. Sorry but this is all fairly new to me. Happy to get this for you but how would I go about getting that output? I see kvprasoon suggested gist.github.com, went there but I don’t see where I could enter the web address.

      A small nudge in the right direction would be greatly appreciated. Thanks!

      Nelson

    • #192256
      Senior Moderator
      Topics: 9
      Replies: 1309
      Points: 4,784
      Helping Hand
      Rank: Community Hero

      Create the gist and just paste the link here. This forum will pull it from gist and show it here.

    • #192316
      Participant
      Topics: 9
      Replies: 21
      Points: 140
      Rank: Participant

      Not sure if this is what you’re looking for, but here is my gist of the pieces I’d like to be able to scrape….

      https://gist.github.com/nelsonsaenz/7f921bd16976c82195c32609a4b815c2

    • #192355
      Participant
      Topics: 15
      Replies: 1761
      Points: 3,167
      Helping Hand
      Rank: Community Hero

      Here is a start. HTML is nested and there are multiple layers, but just picking out the comments, you could do something like this:

      Output:

    • #192382
      Participant
      Topics: 9
      Replies: 21
      Points: 140
      Rank: Participant

      Rob,

      Thank you so much for your help. I’ve been studying what you sent over and just had a few questions so that I can better understand how you went about this….

      So it looks like I was confusing what I think you said were CSS tags for HTML tags? I see that you ended up going with tagnames h2, li, etc. I would assume these are HTML tags and then also finding what would be the unique identifier for each comment which looks to be id.

      Thank you again, I will definitely study your code as a template moving forward. Really appreciate it.

      Nelson

       

    • #192484
      Participant
      Topics: 15
      Replies: 1761
      Points: 3,167
      Helping Hand
      Rank: Community Hero

      So it looks like I was confusing what I think you said were CSS tags for HTML tags? I see that you ended up going with tagnames h2, li, etc. I would assume these

      Your question trailed off there, but if you were searching text and performing a parse would be more difficult. The parsing using in HTML is Document Object Model (DOM), which is designed for JavaScript, but regardless we can use these methods to programatically parse HTML. If you are familiar with HTML (or XML), they are a nested structure of nodes, so there is a standard structure with HTML > BODY and then it’s up to the developer. Typically, a good place to review the structure is the developer tab (F12) in the browser when you’re on the page. It allows you to search and expand\collapse the HTML to narrow down where to start, which I used the named DIV, so when you say you want all P tags, it is only under that DIV. The P is paragraph, so if you have multiple paragraphs in your comment, we put them in an array append a carriage return. In summary, it’s about narrowing things down as much as possible and looping through the nodes to get what you want

    • #192496
      Participant
      Topics: 15
      Replies: 1761
      Points: 3,167
      Helping Hand
      Rank: Community Hero

      Something odd is going on with this post. I think some of the html tags broke something because the option to edit, qoute, etc. posts is missing in this entire thread which is most likely what happened to your message as well

    • #192625
      Participant
      Topics: 9
      Replies: 21
      Points: 140
      Rank: Participant

      Hi Rob,

      OK, that makes sense but I also did think that since this is way outside my wheelhouse, was very possible I was asking my question in a confusing manner.

      In any event, thank you again!! Greatly appreciated.

       

Viewing 11 reply threads
  • The topic ‘Assistance with Invoke-WebRequest’ is closed to new replies.