Web Scraping. I want to pull only one value within one of the tag in the xml.

This topic contains 9 replies, has 2 voices, and was last updated by  KAMALANATHAN DORAIRAJ 3 months ago.

  • Author
    Posts
  • #98328

    Hi,

    I am trying to do web scrap to get a value for of a particular message. Below is the xml content i am interested of all in the webpage, I have to search only the href="browse.jsp;jsessionid=oillma25wtod1qzbhi3jenj5t?JMSDestination=Consumer.Siebel.VirtualTopic.catalog_changed_events" and within that I want only the first value 0.

    I tried ParsedHtmlbytagname for Tr and tried with inner text but nothing working. Kindly let me know if anyone has any thoughts on this.

    ——————————

    Consumer.Siebel.VirtualTopic.catalog_changed_ev... Consumer.Siebel.VirtualTopic.catalog_changed_events

    0
    10
    0
    0

    Browse
    Active Consumers
    Active Producers

    Send To
    Purge
    Delete

    ———————-

  • #98392

    Fredrik Kacsmarck
    Participant

    As far as I can remember this forum doesn't deal well with XML code pasted into the post.
    So it's better if you paste the XML into Gist and add the Gist URL in the post.

    • #98968

      Thanks for your input. Below is the link for my xml.

  • #98965

    gist:5909bda28943fde8d80c475c09a5e09d

  • #99051

    Fredrik Kacsmarck
    Participant

    Not 100% sure what you're after.

    But if you have the above data in a variable in my example called htmlData you could do something like this.

    $htmlData = Get-Content test.html -Raw # I put your example into a file, so you would change this to whatever suits you.
    $htmlValue = $htmlData | ConvertFrom-String | Select P7
    

    You can skip the Select P7 just to see the layout of the data.

    This will only work if the data is consistant, meaning that the value entry will always end up in P7, otherwise you would need something to identify the specific tag you're searching for.

    Another but a bit crude option would be to:

    $htmlValues = $htmlData -split ""
    

    Edit: The split operator would be the /TD tag but I can't add the chevrons in the example, since it will be scrubbed for the same reason I mentioned above.

    Which will create an array based on splitting the raw data on the /TD tag.

    Otherwise you may want to check html parsers like Html Agility Pack and so forth.
    But then you're kind of leaving the Powershell realm and go into C#, XPath and Linq.

    • #99061

      WOW. It worked. Can I ask one last help?.

      From your script, below is the output of it:

      ————————
      P7

      0
      ————————

      I want the value '0' that is between the tag 0. I tried the regular expression and -match or -pattern but nothing is working. Below is the output of the Get-Member of the variable storing the above value.

      PS C:\Users\kd****> $htmldata | Get-Member

      TypeName: Selected.System.Management.Automation.PSCustomObject

      Name MemberType Definition
      —- ———- ———-
      Equals Method bool Equals(System.Object obj)
      GetHashCode Method int GetHashCode()
      GetType Method type GetType()
      ToString Method string ToString()
      P7 NoteProperty string P7=0

  • #99063

    gist:0d019e8c5050b352f3e189441086a6d2

  • #99066

  • #99073

    Fredrik Kacsmarck
    Participant

    You could do it in multiple ways, kind of depends on how easy you want to read it and so forth.
    But here is an example.

    $htmlData = Get-Content test.html -Raw # I put your example into a file, so you would change this to whatever suits you.
    $htmlValue = $htmlData | ConvertFrom-String | Select -ExpandProperty P7
    $htmlTagValue = $htmlValue[4]
    

    So the extra steps are -ExpandProperty which will return just the content of P7, not the header itself.
    Then you can decide how you want to extract the value.
    The option above is kind of quick and dirty in the sense that if the data is not consistent (same issue with P7) every time you will get errors.
    What the [4] do is taking the fifth value from the string, strings can be used as if they are an array of characters.

    To make it a bit more robust and if the value you want only contains numbers then you could do a simple regex instead.

    $htmlTagValue = $htmlValue -replace '\D'
    

    But it depends possible values in the tag, what you need and can do and so forth.

  • #99088

    I sincerely thank you for your quick and detail response. It perfectly worked. Many Thanks Mr.Fredrik Kacsmarck.

You must be logged in to reply to this topic.