Welcome › Forums › General PowerShell Q&A › Find text on a web page
This topic contains 15 replies, has 6 voices, and was last updated by
-
AuthorPosts
-
June 18, 2014 at 11:11 am #16352
Hi,
I am accessing an internal web page which has a list of servers and uuids from which I would like to search for a server and get its uuid.Each line on the html page looks like this:
server1,7a063332-2e05-53338-ff5b-5843116ad838
server2,5d883442-ab28-122b-9646-1cb44b8c344d
server5,36c915bf-3d21-a222-b0a9-6f10d22d7b011After using the invoke-restmethod command and exporting to a csv file, the exported format becomes:
server1,7a063332-2e05-53338-ff5b-5843116ad838server2,5d883442-ab28-122b-9646-1cb44b8c344dserver5,36c915bf-3d21-a222-b0a9-6f10d22d7b011#1 is there any way to easily just find the server and uuid with the invoke-restmethod ?
#2 if #1 cannot be done, how can I extract it from the csv file?And no, I cannot get the uuid from the server at this point. I need this list.
Thanks in advance! -
June 18, 2014 at 12:39 pm #16355
You might look into Invoke-WebRequest; I'm not sure exactly what you need to do, though.
If the web server is returning that as raw text, rather than something XML- or JSON-encoded, then it can be difficult to detect the end-of-line characters, which is why you're getting a single line of output in your CSV. I'd try to solve that. If the items were coming through as individual lines, your task would be very easy – just use Import-Csv.
Try looking at the raw data being returned by the server and see if it includes CRLF (carriage return, line feed) at the end of each line, or something else.
-
June 19, 2014 at 4:44 am #16358
You probably could use the HTML DOM model to pull the data from the site. In IE, if you hit F12 it will open a DOM explorer and determine what HTML object (DIV, SPAN, TABLE, TEXTBOX, etc) contains your data, you can do something like this: http://powershell.com/cs/blogs/tips/archive/2013/09/02/importing-website-tables-into-excel.aspx
In this blog, they are pulling information out of a table. If the raw data is place in a DIV, like so (standard tags replace so HTML would not render);
{html}
{head}
{title}Awesome Website{/title}
{/head}
{body}
{div id="div_Hdr"}Server Data{/div}
{div id="div_ServerData"}
server1,7a063332-2e05-53338-ff5b-5843116ad838{br/}
server2,5d883442-ab28-122b-9646-1cb44b8c344d{br/}
server5,36c915bf-3d21-a222-b0a9-6f10d22d7b011{br/}
{/div}
{/body}
{html}You could do something like:
$content = $data.ParsedHtml.getElementsByTagName("div"))[1].InnerHTML
The .getElementsByTagName will find all DIV's in the website (it could be a LOT), but in this example there are 2, so 0 represents div_Hdr and 1 represents div_ServerData (object array). You want the actual RAW content, so you want to get what is contained in the DIV, so it would be .InnerHTML. If you explore the DOM and see that the container does have an ID, it's cleaner to just get the object by ID (or Name):
$content = $data.ParsedHtml.getElementById("div_ServerData").innerHTML)
As Don eluded, the question is really what format the data is in to be able to parse it and make it viable Powershell object. If it's a div with {br /} line breaks, then you can split the content on that and commas and create a custom PSObject from that. If it's a table, which is ideal, just do a search on 'parsing table dom powershell' and take it from there. The built-in DOM explorer in IE (feature is available in most browsers) you should be able to narrow down pretty quickly what container object and if there is any easy way to identify (e.g. ID or Name) of the object to start your parsing. Good luck!! If you figure it out, post some code for others to use.
Edit: Any forum guys, is there a proper way to show something as literal like HTML without rendering it? I tried the pre, blockqoute, nothing and all of them rendered the html. Thanks.
-
June 19, 2014 at 5:38 am #16359
As a fun example, take this website for instance. I hit F12 in my IE browser, hit the cursor in the box on the top middle of the bar and place it on the forum topics object and clicked it. I could see the container object with all of the topics is a UL (unordered list) with an ID of bbp-forum-2683. Then I started parsing the objects until I finally arrived at the title which was UL (bbp-forum-2683) > UL > LI > A > title. Note that when I parse the LI that there are other A tags in collection, but I only wanted the forum title, so I looked at the returned object attributes and saw there was a CSS class 'bbp-topic-title' that I used to filter the results:
$URL = "https://powershell.org/forums/forum/windows-powershell-qa/" # reading website data: $data = Invoke-WebRequest -Uri $URL # get the first table found on the website and write it to disk: $data.ParsedHtml.getElementByID("bbp-forum-2683") | foreach{ $_.getElementsByTagName("ul") | foreach{ $_.getElementsByTagName("li") | Where{$_.className -eq 'bbp-topic-title'} | foreach{ $_.getElementsByTagName("a") | foreach{ $_.Title } } } }
Returns:
Forums Tips and Guidelines
Check these External Forums for Specific Topics
Find text on a web page
Help emailing formatted HTML – Please
Need a script to use a threading.
Powershell v3 & v2 Compatibilty
Advanced Function Optional Parameter Problem
Help Bulk Permission changes O365 Conferance Rooms
Variable Output in Email
Detecting param variable input only
Scenario Training?
How to Format a list to be emailed?
how to temporarily un-protect certain objects??
Powershell 2.0 and Sql server 2005 help
import same.csv | For Each {
Strange behavior with the Get-Alias cmdlet
Redundancy of Code in Advanced Function -
June 19, 2014 at 5:49 am #16360
Edit: Not finding a good way to post HTML code on the new forum plugin yet. It's even unwrapping double-escaped HTML.
-
June 19, 2014 at 6:10 am #16361
Dare I try to type in some JavaScript and see if that executes? It's impossible, right TweetDeck 🙂
-
June 19, 2014 at 6:41 am #16362
Wow thanks for the help everyone!
I used the F12 and browsed the DOM. It showed up as this, for example (written in poor format due to the limitations of posting html:
after META, it says http equiv = content type content text/html
all of the lines are within the body of the html
Is the object body, or ? Hmm sorry I know nothing about html 🙁 but now it's on my list to learn 🙂
What if I just take this file and remove the line breaks, or replace them?
The result I am looking for is that I can search for the server name, then set a variable for its uuid found.
-
June 19, 2014 at 9:56 am #16369
It's possible it's just in the BODY. You can just try:
$data.ParsedHtml.getElementsByTagName("body").InnerHTML
See if that contains the server data, then it's just parsing it
-
August 5, 2014 at 8:02 am #17787
Hi,
I'm back 🙂
I now have an HTML file as output , and in it a table. It looks like this, with headers:hostname uuid
server01 5d7b0d42-760d-cf02-4c0d-b00848b20a38
server02 5d7b0d42-760d-cf02-4c0d-b00848b20a38What I am trying to do is search for server02, and set a variable to the server's uuid.
Can someone help?
Thanks! -
August 5, 2014 at 9:52 am #17795
Rob that is well good – pretty impressive what you can parse and how to get those details out.
-
August 5, 2014 at 10:31 am #17796
You are just using standard Powershell commands\logic now that you have a PSObject:
$object = @() $object += New-Object -TypeName PSObject -Property @{Hostname="Server1";UUID="5d7b0d42-760d-cf02-4c0d-b00848b20a38"} $object += New-Object -TypeName PSObject -Property @{Hostname="Server2";UUID="5d7b0d42-760d-cf02-4c0d-b00848b20a38"} $object | Where{$_.HostName -eq "server1"} | foreach{ Some-Command -UUID $_.UUID }
-
August 6, 2014 at 7:30 am #17811
Thanks.
But this is an html file I am searching in, so no psobject.. I think I am missing something?
Thanks! -
August 6, 2014 at 12:17 pm #17826
You just create a blank object and then redirect what is being outputted into the object, like so:
PS C:\> $URL = "https://powershell.org/forums/forum/windows-powershell-qa/" # reading website data: $data = Invoke-WebRequest -Uri $URL $myPSObject = @() # get the first table found on the website and write it to disk: $myPSObject = $data.ParsedHtml.getElementByID("bbp-forum-2683") | foreach{ $_.getElementsByTagName("ul") | foreach{ $_.getElementsByTagName("li") | Where{$_.className -eq 'bbp-topic-title'} | foreach{ $_.getElementsByTagName("a") | foreach{ $_ | Select Title, HREF } } } } $myPSObject title href ----- ---- Forums Tips and Guidelines https://powershell.org/forums/topic/forums-tips-a... Check these External Forums for Specific Topics https://powershell.org/forums/topic/check-these-e... Little scripting help https://powershell.org/forums/topic/little-script... WINRM authentication https://powershell.org/forums/topic/winrm-authent... WINRM kerberos & Negotiate https://powershell.org/forums/topic/winrm-kerbero... WinRM with non-domain joined machine using Certs https://powershell.org/forums/topic/winrm-with-no... Exchange cmdlet error change in PS 3 vs PS 4 https://powershell.org/forums/topic/exchange-cmdl... Find text on a web page https://powershell.org/forums/topic/find-text-on-... using select-object https://powershell.org/forums/topic/using-select-... File Copy Access is denied https://powershell.org/forums/topic/file-copy-acc... Dell Warranty Information https://powershell.org/forums/topic/dell-warranty... Foreign Security Principals https://powershell.org/forums/topic/foreign-secur... Module review https://powershell.org/forums/topic/module-review/ winrm g & e swithch diffrence https://powershell.org/forums/topic/winrm-g-e-swi... Help with setting up PSRemoting https://powershell.org/forums/topic/help-with-set... Help with adding a script method (and quite possibl... https://powershell.org/forums/topic/help-with-add... LOG for Copy-item https://powershell.org/forums/topic/log-for-copy-... PS C:\> $myPSObject | Where{$_.Title -like "*module*"} title href ----- ---- Module review https://powershell.org/forums/topic/module-review/
-
August 9, 2014 at 1:55 pm #17913
Do you know of a way to get this to work with a file on disk? If you pass a file:// URI to Invoke-WebRequest, you get back a different type of object which doesn't contain the parsed HTML objects.
I've been able to use the HTML Agility Pack for this, which requires downloading an extra DLL (and it helps to be familiar with XPath), but is also quite fast compared to accessing the HTML DOM through the objects that Invoke-WebRequest returns:
Add-Type -Path .\HtmlAgilityPack.dll $doc = New-Object HtmlAgilityPack.HtmlDocument $doc.Load("$pwd\forum.html") $links = $doc.DocumentNode.SelectNodes('//li[@class="bbp-topic-title"]/a') $properties = @( @{ Name = 'Title'; Expression = { $_.GetAttributeValue('Title', "") } } @{ Name = 'Href'; Expression = { $_.GetAttributeValue('href', "") } } ) $links | Select-Object -Property $properties
-
August 9, 2014 at 2:06 pm #17914
BTW, you can use the same library to handle parsing of live webpages as well. This has the benefit of keeping the faster performance of the HTML Agility Pack library, and consistent code regardless of where the HTML came from. To do so, use Invoke-WebRequest as before, and pass its Content property to the LoadHtml method of HtmlAgilityPack.HtmlDocument:
Add-Type -Path .\HtmlAgilityPack.dll $URL = "https://powershell.org/forums/forum/windows-powershell-qa/" $data = Invoke-WebRequest -Uri $URL $doc = New-Object HtmlAgilityPack.HtmlDocument $doc.LoadHtml($data.Content) $links = $doc.DocumentNode.SelectNodes('//li[@class="bbp-topic-title"]/a') $properties = @( @{ Name = 'Title'; Expression = { $_.GetAttributeValue('Title', "") } } @{ Name = 'Href'; Expression = { $_.GetAttributeValue('href', "") } } ) $links | Select-Object -Property $properties
-
November 29, 2016 at 10:28 am #58897
Use “Long path tool” software and keep yourself cool.
-
AuthorPosts
The topic ‘Find text on a web page’ is closed to new replies.
Pingback: How To Find Text On A Web Page | Information