Downloading Tables from webpages

This topic contains 33 replies, has 4 voices, and was last updated by  Rob Simmers 4 years ago.

  • Author
    Posts
  • #8690

    morpheus83
    Participant

    Hello,

    I am after creating a script where by I can download a table off a site and then look in a cell for a colour and if the colour matches red then sent a prompt stating the information in the cell next to it.

    Does anyone know how to do this?

    So far I can download the site however I cant pull the table or anything I have the following code :

    $webClient = new-object System.Net.WebClient
    $webClient.Headers.Add("user-agent", "PowerShell Script")
    $output = $webClient.DownloadString("http://www.google.co.uk")

     Obviously I have replaced google.co.uk with my actual site.

    Many Thanks

    James

  • #8696

    Don Jones
    Keymaster

    This is going to be HARD. Depending on how the page is built, it might be IMPOSSIBLE.

    In HTML, table cells are

    tags. The problem is, you can't just look at the tag to tell if it's a specific color. It might be but it might also be something like , with a style sheet applying the red color to that class. There's about a hundred ways a cell could end up being red – depending on how the page is built, it may be impossible to statically look at it and tell. The color may even be applied from a separate CSS file.
  • #8706

    morpheus83
    Participant

    Ah right. I have noticed now looking at the site that the word is in the colour as well so the colour of the cell is stated. So if the cell is green inside the cell it actually has the text Green. Would this make life a bit easier?

    Many Thanks

    James

  • #8713

    Don Jones
    Keymaster

    Well, "easier" is relative. You're dealing with a giant text document.

    If it's well-formed XHTML (and it should have a meta tag indicating that, near the top), then you can have the shell parse it as XML and manipulate it. That's easier.

    If it isn't, you're going to probably have to treat it as a giant hunk of text and use regular expressions to parse it. That's harder. You'll essentially have to construct a regex that looks for whatever text pattern you need (the table cell HTML) and captures the subexpression (the content of the table cell) into a variable.

    You MIGHT, if this isn't running on a server, be able to use the Internet Explorer COM object to parse the HTML string into a Document Object Model (DOM), which can be manipulated a bit like XML. Also not super-easier, but maybe easier than a regex.

    Sadly, I'm not your regex guy. If that's the direction you need to go, we'll try and elicit some more helpers for ya.

  • #8729

    morpheus83
    Participant

    Thanks for that I have had a look at the source and this is at the top which is what I presume you mean:

    !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

    html xmlns="http://www.w3.org/1999/xhtml"

    I presume this means that we can use XML?

    Many Thanks

    James

  • #8731

    morpheus83
    Participant

    I think I have progressed a bit from some searching but a bit stuck now.

    I can get the table from the page but thats about it currently:

    $ie = New-Object -com "InternetExplorer.Application"
    $ie.navigate(http://www.google.co.uk)
    $doc = $ie.Document
    ($doc.getElementsByTagName("table"))|%{$_.innerHTML}

    Any help would be greatly apprecited.

    Many Thanks

    James

  • #8776

    morpheus83
    Participant

    ok I Have got the following:

    $ie = New-Object -com "InternetExplorer.Application"
    $ie.navigate("http://www.google.co.uk")
    $doc = $ie.Document
    ($doc.getElementsByTagName("table"))|%{$_.innerText} | Out-File C:\Output.txt

    $Red = Select-String C:\Output.txt -pattern "Red"
    $Yellow = Select-String C:\Output.txt -pattern "Yellow"

    If(Select-String C:\Output.txt -pattern "Email and OWA"){
    $Service = Select-String C:\Output.txt -pattern "Email and OWA"
    $Test = $Service -split("Green")
    $Test
    $Test[0]
    $Test[1]
    $Test[2]
    }

    Which essentially dumps the table with no formatting into a text file and I am trying to sort out the information. However there are no spaces or anything between the columns of the table which makes things difficult.

    An extract of the information as shown is here:

    Email and OWAPlanned maintanance affecting external emailsGreenPlease be aware that on Wednesday the 31st July and Wednesday 7th August there will be interruptions to
    external e-mail flowing in and out of the organisation.
    The service will stop at 5.30pm and may be down until 9.00pm.

    Can anyone give any idea's on this?

    Many Thanks

    James

  • #8783

    Don Jones
    Keymaster

    I'd say you want to sort that out BEFORE you get rid of the HTML. The HTML tags are how you tell the difference between columns.

  • #8824

    morpheus83
    Participant

    Ahhh right how would I go about that? I cant seem to fathom it out.

    Many Thanks

    James

  • #8826

    Don Jones
    Keymaster

    Well, that's the thing. It's hard. You have to write a regex to capture the "

    whatever

    " and then a subexpression to capture just the contents of the cell. Again, I'm not especially good at that. Me, I'd try and parse the file as XML or use a DOM approach, because it's what I'm more comfortable with.

  • #8828

    morpheus83
    Participant

    Hmmm I have no idea which either of them mean so any information would be greatly appreciated. I did try to get it out as an XML however it didnt view correctly. I used the ConvertTo-XML cmdlet if thats any use?

    Many Thanks

    James

  • #8834

    Don Jones
    Keymaster

    No, ConvertTo-XML isn't right, and once you parse as XML it won't "display" at all. It's an object hierarchy. If your HTML document is in $html, you'd do "[xml]$xml = $html" to attempt to parse the HTML as XML and create an XML object model. If it errors, then the HTML isn't compliant with XML standards (which isn't unusual – HTML isn't exactly the same as XML, although they're close).

  • #8836

    morpheus83
    Participant

    Using the code above I have tried the code you suggested which in my code bove is:

    [xml]$xml = $doc

    and I get the following error:

    Cannot convert value "mshtml.HTMLDocumentClass" to type "System.Xml.XmlDocument". Error: "The specified node cannot be
    inserted as the valid child of this node, because the specified node is the wrong type."
    At line:1 char:1
    + [xml]$xml = $doc
    + ~~~~~~~~~~~~~~~~
    + CategoryInfo : MetadataError: (:) [], ArgumentTransformationMetadataException
    + FullyQualifiedErrorId : RuntimeException

    So does this mean that I am unable to do it? If this is the case how can I progress or is this where the regex stuff comes in?

    Many Thanks

    James

  • #8837

    Don Jones
    Keymaster

    That probably means it won't work, and in terms of how you progress with writing the regex – unfortunately, as I said, I'm not the regex master :(. I'm personally not going to be much help to you. But I'll try and get a couple of other folks to take a look at this.

    If you can, try to provide a BRIEF snippet of the actual HTML – you will need to attach that as a file, since you can't post actual HTML in the forums. It's often helpful to see what the actual data looks like.

    Then, if you can provide a concise explanation of what you want to extract from the HTML. I *think* what you want is something like, "I need to identify table cells that contain the word 'Red' and extract those contents into a variable," but it'll be better if you re-state the problem you're trying to solve. We've gone down a couple of paths in this thread, so it'd be good to make sure anyone coming in at this point can quickly understand what you're after.

  • #8838

    morpheus83
    Participant

    Ok thank you for your help.

    I am after trying to get the row which contains either Red or Yellow into a variable and then output the information in the cells of that row out so that it can be displayed to the user.

    I hope this makes sense.

    I have the following code:

    $ie = New-Object -com "InternetExplorer.Application"
    $ie.navigate("http://www.google.co.uk/")
    $doc = $ie.Document

    ($doc.getElementsByTagName("table"))|%{$_.innerhtml} | Out-File C:\Output.txt

    $Red = Select-String C:\Output.txt -pattern "Red"
    $Yellow = Select-String C:\Output.txt -pattern "Yellow"

    Which works however I need it better formatted which is what I am unable to work out how to do.

    An example of the Output.txt is attatched and the site has obviously changed

    Any Assistance would be greatly appreciated.

    James

  • #8849

    Rob Campbell
    Participant

    I'll take a stab at it.
    I generally approach this using a multi-line single-line regex, in a here-string (I call it "jumbo shrimp in a can"). Let me review the requirments and that data and see what I can come up with.

  • #8850

    Rob Campbell
    Participant

    First problem – there doesn't appear to be any "Red" or "Yellow" in the sample data. Are we looking for background color, or foreground color?

  • #8857

    Rob Simmers
    Participant

    I believe I've accomplished what you are looking for with HTML DOM parsing, which is the first attempt in Powershell and my head hurts a little from beating it on my desk. However, this looks like a dashboard for some internal status, so I would highly suggest trying to get access to a database\view\webservice or something to get raw data versus page crawling. This could work today and tomorrow it will not because you are the mercy of the developers and how the parsing is setup in the script at the moment.

    I took the Output1 and 2 text files and just put a starting and ending table tags and ran the script against a local copy. From reading through your conversation thus far, it looks like you are trying to get critical and warning status by filtering on the color during your parse. I would leverage Powershell and pull all results and then filter on those results versus trying to figure out all of the DOM parsing to only pull certain results. This code:

    $ie = New-Object -com “InternetExplorer.Application”
    $ie.navigate(“C:\Users\Rob\Desktop\output2.html”)

    $htmlResults = @()

    foreach($table in $ie.Document.getElementsByTagName(“table”)) {
    $tableHdr = @()

    foreach($th in $table.getElementsByTagName("th")) {
    $columnName = $th.getElementsByTagName("a").Item(0).innerHTML
    $tableHdr += $columnName
    }

    foreach($tr in $table.getElementsByTagName("tr")) {
    #filter to find the rows in the table where the data is
    if($tr.style.backgroundColor -eq "rgb(239, 243, 251)" -or $tr.style.backgroundColor -eq "white") {
    $tds = $tr.getElementsByTagName("td")
    $rowProps = New-Object PSObject
    for ($i=0; $i -lt $tds.Length; $i++) {
    if ($tds.Item($i).innerHTML -like "
    produces the following results (removed the Email and OWA for formatting):

    Council Services Description Status Information
    —————- ———– —— ———–
    Carefirst – Adults No major problems Green
    Citrix No major problems Green
    Civica APP No major problems Green
    Civica IBS No major problems Green
    website No major problems Green
    EDRMS No major issues Green
    Payroll System No major problems Green
    Frontline No major problems Green
    Internet No major problems Green
    Intranet No major problems Green
    Tribal – Childrens' No major problems Green
    Northgate Housing System No major issues Green
    Wireless network No major issues. Green
    SAP No major problems Green
    VOIP Telephone System No major problems Green
    VPN Homeworking Access No major problems Green
    PECOS No major problems Green
    PARIS No major problems. Green
    Other No major problems Green
    Mobile Telephony Service No major problems Green
    Office Online No major problems Green
    iTrent HR Payroll No major problems Green

    Now that you have that, if there were Red or Yellow status, you could just do simple filters:

    $htmlResults | Where {$_.Status -eq "Red"}

  • #8863

    Rob Campbell
    Participant

    Here's a regex solution that works with the posted test data. There appears to be 2 color settings per entry, and none of them have values of Red or Yellow. I'm not sure which one is supposed to change, so I parsed out both of them, and you can key off of whichever one you need.

    $data = gc .\Output1.txt -raw

    [regex]$regex1 = @'
    (?ms)

    .+? .+? (.+?) (.+?) }
    }

  • #8910

    morpheus83
    Participant

    Rob,

    Thank you for your reply the code you have supplied seems to work fine however it doesnt seem to pull all results from the table when I put it against the actual webpage.

    The results I get are:

    Citrix
    Civica IBS
    EDRMS
    Network
    Frontline
    Intranet
    Northgate Housing System
    SAP
    VPN Homeworking Access
    PARIS
    Mobile Telephony Service
    iTrent HR Payroll

    Which from your example isnt correct. The only thing I have changed is the address it looks at in the $ie.navigate and that was the result.

    Do you have any idea's?

    Many Thanks

    James

  • #8912

    Rob Simmers
    Participant

    As I mentioned, it's tough to build the code because if anything changes, no worky. Take a look at the tables on the page. I believe the issue is the looping through the tables versus only parsing the table you care about. If the table has a name or ID then you can do .GetElementByName or .GetElementByID versus looping through all tables with .GetElementsByTagName("table"). If the table has an inconsistent or no name\id, you can identify it by index or .Item(1) (which 1 indicates the second table on the page).


    $ie = New-Object -com “InternetExplorer.Application”
    $ie.navigate(“C:\Users\Rob\Desktop\output2.html”)

    $htmlResults = @()

    foreach($table in $ie.Document.getElementsByTagName(“table”)) {
    $table
    }

    That will get you all of the tables on the page and dump all of the attributes of that table. You are looking for a unique identifier to help filter out tables you don't want to parse. Use the rest of the code I posted and try to limit the table parsing to just the table you want to parse.

  • #8924

    Rob Campbell
    Participant

    It appears the sample data isn't the same as what's on the web page. The regex could probably be adjusted to work with the actual data, but I can't do that without knowing what's different.

  • #8946

    morpheus83
    Participant

    Hello,

    I dont know if it will help but here is the dump from your command about pulling the Tables from the page (attatched). I have tried the getElementByID and that returns the same information which is why I am confused.

    I presume I am missing something in the HTML?

    Many Thanks

    James

  • #8948

    Rob Campbell
    Participant

    What are you using to retrieve the data that you're using the regex on? That's made to work with a single, multiline string. The first coded you posted (using system.net.webclient) should do that.

  • #8949

    morpheus83
    Participant

    I am sorry I dont understand I am not using a regex as far as I am aware. The code I orgionally posted was from what I threw together after some googling and playing around.

    Could you please explain a little more and I maybe able to help?

    Many Thanks

    James

  • #8950

    Rob Campbell
    Participant

    Right now you have two solutions, one using regex and one using DOM. You also have two different methods of getting the data from the web page – one using system.net.webclient and one using IE. The regex solution I posted is appropriate for raw page source data, like the data stream you'd get back from reading a web page using system.net.webclient.

  • #8954

    morpheus83
    Participant

    I think it would be easier to use the IE method since this is on all machines. Although your regex I must admit I havent tried mainly because I am not sure about how to get the Output file each time for it to be used so although this may work better I dont fully understand it.

    Many Thanks

    James

  • #8957

    Rob Simmers
    Participant

    Try this:

    $ie = New-Object -com “InternetExplorer.Application”
    $ie.navigate(“C:\Users\Rob\Desktop\output2.html”)

    $htmlResults = @()

    #foreach($table in $ie.Document.getElementsByTagName(“table”)) {
    $table = $ie.Document.getElementsByTagName("table").Item(1)
    $tableHdr = @()

    foreach($th in $table.getElementsByTagName("th")) {
    $columnName = $th.getElementsByTagName("a").Item(0).innerHTML
    $tableHdr += $columnName
    }

    foreach($tr in $table.getElementsByTagName("tr")) {
    #filter to find the rows in the table where the data is
    if($tr.style.backgroundColor -eq "rgb(239, 243, 251)" -or $tr.style.backgroundColor -eq "white") {
    $tds = $tr.getElementsByTagName("td")
    $rowProps = New-Object PSObject
    for ($i=0; $i -lt $tds.Length; $i++) {
    if ($tds.Item($i).innerHTML -like "

    So, versus looping through each table in the page, we're statically setting it to the table with the index item. Indexing starts at 0, so table 0 in your page is just used for formatting the current date. Index 1 is the table you want to parse. If you look at your output3.txt, it shows a ASP.net generated ID (ctl00_ContentPlaceHolder2_GridView1), so it can be directly accessed that way too:

    $table = $ie.Document.getElementByID("ctl00_ContentPlaceHolder2_GridView1")

    That ID is a randomly generated ID by ASP.net, so it is not a good ID to hardcode as it can change. Hopefully this works for you and DOM parsing makes a little more sense.

  • #8968

    morpheus83
    Participant

    Hello,

    Thank you for that I am still only getting the following:

    Citrix
    Civica IBS
    EDRMS
    Network
    Frontline
    Intranet
    Northgate Housing System
    SAP
    VPN Homeworking Access
    PARIS
    Mobile Telephony Service
    iTrent HR Payroll

    However after having a look at the code on the page the colours are slighty different where you have:

    if($tr.style.backgroundColor -eq "rgb(239, 243, 251)"

    The page code as:

    if($tr.style.backgroundColor -eq "#eff3fb"

    So I have updated the code with the above and its working as it should.

    Yes the DOM approach is simpler and I am now guessing (as I will have to test) but I can search for anything which has a status of red or yellow and then I can just pop that to screen.

    Thank you very much for your help I appreciate it all from everyone in the thread.

    James

  • #8969

    Rob Simmers
    Participant

    Another mystery solved, glad it's working for you.

  • #8970

    morpheus83
    Participant

    Yes it is.

    I do however have one slight question which is confusing me...

    I have added in the following:

    `
    
    $Red = $htmlResults | Where{$_.Status -eq "Red"}
    $Yellow = $htmlResults | Where{$_.Status -eq "Yellow"}
    $Green = $htmlResults | Where{$_.Status -eq "Green"}
    
    If($Red -ne ""){
    ForEach-Object{
    Write-Host $Red."Services" 
    Write-Host $Red.Description 
    Write-Host $Red.Status
    Write-Host $Red.Information
     }
    }
    
    If($Yellow -ne ""){
    ForEach-Object{
    Write-Host $Yellow."Services" 
    Write-Host $Yellow.Description 
    Write-Host $Yellow.Status
    Write-Host $Yellow.Information
     }
    }
    
    `

    Which works fine and pulls out the relevant information however if there is more than one of them it throws all the information together.

    So if I output $Yellow (On test) I get:

    Services Description Status Information
    ——– ———– —— ———–
    Carefirst System upgrade planned for... Yellow System will not be availab...
    Internet Slow between 12 – 2pm Yellow Service affected: Internet...
    Intranet Slow between 12 – 2pm Yellow Service affected: Internet...

    However if I look at the output from the code above:

    `
    
    If($Yellow -ne ""){
    foreach($line in $lines){
    Write-Host $Yellow."Services" 
    Write-Host $Yellow.Description 
    Write-Host $Yellow.Status
    Write-Host $Yellow.Information
     }
    }
    
    `

    I get the following output (code tags due to html in output):

    `
    Carefirst Internet Intranet
    System upgrade planned for 4th - 6th Feb  Slow between 12 - 2pm Slow between 12 - 2pm
    Yellow Yellow Yellow
    System will not be available for 3 days.
    Please make provisions for this in your service areas. Service affected: In ternet Access particularly between 12 - 2pm.

    Who is affected? All Internet users including home workers and certa in remote offices.

    Problem description: General slowness of Internet access caused by greater use of the Internet at these times. Problems connecting remotely for home workers and site to site VPN attached offices.

    What is cur rently being done to resolve it? Internet traffic reviewed and a rationalisation taken place. New lines have been added. This has released additional capacity to be used until the internet feed is upgraded.

    Next steps: Upgrade curren t internet feed.

    Expected solution: Feb 2011 Service affected: Internet Access particularly between 12 - 2pm
    < BR>Who is affected? All Internet users including home workers and certain remote offices.

    Problem description: Ge neral slowness of Internet access caused by greater use of the Internet at these times. Problems connecting remotely for home workers and site to site VPN attached offices.

    What is currently being done to resolve it? Internet traffic reviewed and a rationalisation taken place. New lines have been added. This has released additional capacity to be used until the internet feed is upgraded.

    Next steps: Upgrade current internet feed

    Expected Solution: Feb 201 `

    Do you have any idea's what I have done wrong? I am thinking I need a foreach rather than a foreach-object but I cant get it to work...

    Many Thanks

    James

  • #8971

    Rob Simmers
    Participant

    The issue is you are looping through and pulling individual properties and writing them to your console as strings. If you type $yellow in the console, it is an object and will display properly. What are you trying to do with the individual results? This code has several issues:


    If($Yellow -ne ""){
    foreach($line in $lines){
    Write-Host $Yellow."Council Services"
    Write-Host $Yellow.Description
    Write-Host $Yellow.Status
    Write-Host $Yellow.Information
    }
    }

    What is $lines? The code is saying for every item ($line) in $lines, do x. Then $Yellow.ColumnName is going to dump the entire column as a string.

    Try something like this:

    $yellow | foreach{$_}

    or

    foreach ( $line in $yellow ) { $line }

    Another command you should look at is:

    $htmlResults | Group-Object -Property Status -NoElement

  • #8972

    morpheus83
    Participant

    Hello,

    Nevermind sorted it out after some fiddling and realising I was going about it wrong!

    $Red = $htmlResults | Where{$_.Status -eq "Red"}
    $Yellow = $htmlResults | Where{$_.Status -eq "Yellow"}
    $Green = $htmlResults | Where{$_.Status -eq "Green"}

    If($Red -ne ""){
    foreach($line in $Red) {

    $Start = "More Information: "
    $NewLine = $line.Information -replace "
    " , "`n"
    $Info = "$Start $NewLine"

    Write-Host "Service: " $line."Council Services"
    Write-Host "Description: " $line.Description
    Write-Host "Status: " $line.Status
    Write-Host $Info
    Write-Host " "
    Write-Host " "
    }
    }

    If($Yellow -ne ""){
    foreach($line in $Yellow){
    $Start = "More Information: "
    $NewLine = $line.Information -replace "
    " , "`n"
    $Info = "$Start $NewLine"
    Write-Host "Service: " $line."Council Services"
    Write-Host "Description: " $line.Description
    Write-Host "Status: " $line.Status
    Write-Host $Info
    Write-Host " "
    Write-Host " "
    }
    }

    Which got me the spaced output I required.

    Many Thanks

    James

  • #8973

    Rob Simmers
    Participant

    Try:

    $yellow | Format-Table -AutoSize -Wrap

You must be logged in to reply to this topic.