How to crawl same url using Scrapy?

How to scrape the Google cache

  • Had a plan to scrape a website, but now it's down indefinitely. Google has the site cached, but this makes things kind of complicated. Newbie questions about scraping websites and using the Google cache inside. I've read http://ask.metafilter.com/165937/How-do-I-download-an-entire-website-from-Googles-cache - should I be trying to use warrick? I'm starting a small side project that involves scraping a site and doing some analysis/visualization of the resulting dataset. I'm planning on using Python and Beautiful Soup to do the scraping - the site is laid out very consistently and looks like it'd be easy to scrape/parse and would make for a good learning exercise (I'm new to this, so apologies for any incorrect terminology). Unfortunately, the timing of this idea coincided with the very recent (permanent) shutdown of the site's servers. I still want to follow through with this, but this additional complication has thrown a bit of the wrench into the works. Google has the site cached, and I'd like to see if I can piece together data using what's archived. The site is basically structured like a blog with many many posts. The first page shows the 20 latest posts, with older posts pushed back to subsequent pages. The URL structure is static and follows this format: If post 1 is the most recent post, at any given time, "website.com/page/2/" shows posts 21-40, "website.com/page/3/" shows posts 41-60, and so on. When a new post is submitted, it becomes post 1, and everything is pushed back one. Posting frequency is probably ~30 posts/day. My problem is that not all of the pages were last crawled on the same date, resulting in some overlaps and some gaps in data due to the way the page content shifts between archive dates. I'm not concerned with overlaps or having the most recent crawl, but in an ideal world I'd have dataset that covers all posts that were made in the timeframe of ~1 year with no gaps in dates. I don't think this is possible, though. 1) Can I scrape the google cache the same way I'd scrape the original site? (use urllib and point the script to "http://webcache.googleusercontent.com/search?q=cache:http://website.com/page/2/") for all existing pages. According to http://apps.ycombinator.com/item?id=271130, it looks crawling the google cache is violating their Terms of Use. What's the next best alternative? Is using --wait=seconds kosher? (The Way Back Machine doesn't have nearly as much archived as Google does.) 2) Is there any way to access earlier cached versions of the same url or are they overwritten? My reasoning is that I could perhaps eliminate some of the gaps if I could scrape all versions of "website.com/page/3/" and just get rid of duplicate entries (because they'd inevitably end up on page 4 or 5 in a later crawl). Like I said, I'm also new to this whole area, so I'd also like to use this post as a sanity check - is anything I'm saying here wrong/impossible/etc? Any other advice? Thanks for your help!

  • Answer:

    You may want to check on archive.org's http://archive.org/web/web.php as well which could help cover question 2) above.

hot soup at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

I just tried curl and Python's urllib2, and got a 403 Forbidden error with both trying to retrieve from the Google cache. You'll need to change/spoof the User-Agent. Depending on how the page was originally structured, you'll need to do some URL rewriting when you extract data from the cached post listing page. If the original page was coded with relative links, you'll need to add both the Google cache preamble as well as the site URL. If they are absolute URLs, you'll need to prepend just the Google cache prefix. Note that Google adds a <base> element in the <head> pointing to the original URL of the content, which means that in a browser relative URLs will try to go to the site instead of the Google cache. But if you're scraping, you probably won't see that behavior because things like BeautifulSoup don't interpret such tags, as far as I know. If Google does internally store multiple revisions of the same page, they don't make them available in any way, AFAIK, so what you see is what you get.

Rhomboid

Are you sure that that's the only archive URL scheme that's available? For instance, if the site were published using tumblr, there would also be canonical monthly archives like "website.com/archive/2012/1"...this would at least simplify the issue with duplicate posts.

bcwinters

Warrick's gotten a bit of an overhaul - it won't work with the Google cache, I think, but it can use Memento interfaces with a variety of archives, not just the Wayback Machine. It's probably best to let Warrick do its job, then let your software work on local files.

Pronoiac

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.