Scraping A Screen of Links - Saving The Content

This topic contains 0 replies, has 1 voice, and was last updated by  Forums Archives 5 years, 7 months ago.

  • Author
    Posts
  • #5378

    by Kudzu_Kid at 2012-12-21 10:10:05

    Hi All,

    My first post, so be kind... The sign said "noob & begininners welcome"... That would be me.

    I've begun playing in PowerShell a month or so ago. Still slogging my way through "Windows Powershell In Action" by Payette. I've done some minor small scripts for work such as taking a CSV and parsing out the user ID and changing their memberships in AD, etc.

    But this task is a 'personal' chore, and I'm not sure if it's ideal for Powershell or not. I'm lazy, and Powershell is s'posed to make my life easier, right? So that's the tool I intend to use. Enough of the disclaimers, here's some background:

    I subscribe to my old hometown newspaper (1400 miles away, so "hard copy" delivery from them isn't happening). They've entered the 21st century kicking and screaming, and now offer an "online" edition. In a nutshell, that means they offer a webpage with a bunch of JPG thumbnails of the day's edition pages. If you left-click on them, you'll get a partial screen full of a page. It's a sketchy JPG (of pictures AND text!), not a clear PDF. Soooo, if I RIGHT-click on a hyperlink under the thumbnail, I have the option to "Save Link As.." – which will save the page as a single PDF.

    Here's a screen shot which shows what I'm talking about...

    a user uploaded image

    Here we go]

    How can I tell PS to save ALL the available .PDF's on the page? Some problems include:
    1) The site where the paper is ACTUALLY hosted uses "virtual directories" so you don't know, issue-by-issue where to look for the PDF's – you need to come in through the local paper site, authenticate, etc. THEN (and only then) the real hosting site presents the contents in a frame on the local paper's "site".

    2) Usually "today's" paper starts with "A01" and runs anywhere to "A16" or "A32" (those correspond to page #'s) – which are usually an even number but variable from issue to issue. 32 seems to be the highest number of pages in a weekday issue (unless it's a special weekend or whatever). Could be less.
    I'm willing to manually intervene or 'help' the script as needed ("All done or more to do?" If more – advance the browser to the next page, etc.

    So all I want to do, initially, is just scrape out today's PDF's (there might be 16 or however many) – I then use Adobe Acrobat to "Combine Files" and grind the individual files into one single PDF multi-page file of however many single PDF's there were. Does that make sense?

    Occasionally (but fairly rarely) the paper will span through the "Axx"'s into the "Bxx"'s – for example if they include some special "weekend" edition, or "holiday" or "vacation" stuff. But, for now, I say "K.I.S.S." applies (unless someone out there is REALLY generous with their time and mad coding skilz!).

    So here's my question]

    How can I intelligently scrape those "Axx" named links off the page with PowerShell? Odds are fairly high that if the link name begins with "B01" I've gone beyond TODAY's edition. I'll manually muck around for anything beside "today's" edition if need be. If I can get PS to glean it for me, I'll be (VERY!). Can you give me an example of some code to at least start with? Note: For reasons that go far beyond the scope of this note – I'd prefer to use Firefox, but I can use IE if that makes a difference for some reason.

    If anyone needs more specific details, shoot me an email at:
    powershellorg AT pferris DOT com
    And I'll be happy to answer any questions, comments, etc.

    Ideally, the script could be run from home -OR- work (AD / Domain)...

    Yes, it WOULD BE sweet if they would just have a single link to click to download the entire paper as a PDF (who the heck would just WANT to download an entire newspaper as a PDF – one file/page at a time? Yet, that's where I find myself... When I emailed their "Chief Technology & Information Officer" (bet YOUR paper doesn't have one of THOSE, LOL!) he acted as if the concept of a complete PDF download was blashphemy... Stating no one else would want or need that. MORE work for him, etc. Personally, I'd rather upload 1 file than 16, 18, 20... 32!). He also tried to tell me that "Adobe Acrobat Pro is free and comes bundled with every system for years now". Wowww! Guess Adobe really raped me again, eh? So, I've ceased communication with him. %^}

    Thanks so much for pointing me in the right direction and thanks tons for making the valuable resource Powershell.org available to neophytes like myself. Sorry if this is too convoluted! And as I said, this task may not lend itself to PS in the first place.

    Cheers,

    –Pete

    by Klaas at 2012-12-23 02:30:40

    Hi Pete

    I wish everyone would formulate, describe and illustrate a question or problem like you do. Very nice, and an interesting one too.
    I'm not really experienced with this, but I guess I would start with Get-Help Invoke-Webrequest -Full | more
    Probably you can enter your credentials and apply a filter or pipe the result to another cmdlet to find the links you need.

    by Kudzu_Kid at 2013-01-17 07:27:09

    Thanks very much Klass, I appreciate you taking the time to answer my question.

    If I have success with anything, I'll let you know! Again, my thanks for your response!

    –Pete

You must be logged in to reply to this topic.