Screen scraping etiquette
-
Looking to start a 20k+ request screen scraping project, what sort of guidelines (in addition to those I plan to implement) do I need to follow to avoid having the hounds sent out after me. A corporate site has a collection of about ~20k freely available pages that I'd like to download for a personal database. This database wouldn't be for anything but personal use. Their robots.txt lists only their sitemap url and User-agent: *. Their terms of service are the standard rigamaroll. Only two points make me think they'd have a problem with me scraping them.... "Other than connecting to Service Provider's servers by http requests using a Web browser, you may not attempt to gain access to Service Provider's servers by any means - including, without limitation, by using administrator passwords or by masquerading as an administrator while using the Service or otherwise. " and "You may not in any way make commercial or other unauthorized use, by publication, re-transmission, distribution, performance, caching, or otherwise, of material obtained through the Service, except as permitted by the Copyright Act or other law or as expressly permitted in writing by this Agreement, Service Provider or the Service." emphasis added. So, while they want you to access the service with a web browser, there is no specific prohibition against automated methods, crawlers, spiders, robots etc. My other concern is the "caching" which I technically would be doing. Since Google has indexed and cached their site, it seems that this isn't really a problem though. I've never scraped that many pages before, but my plan was to be nice about it. Put in a 15-30 sec delay between requests (i don't care if it takes a long time), only run the script during off peak hours, and set the user-agent to announce myself and contact info. They don't have a "webmaster" e-mail address, only feedback. Should I bother sending them an email to "ask permission"? Also, might I be better off not announcing myself with the user-agent?
-
Answer:
A couple of extra tips, from someone who's both scraped lots of data and defended a website from scrapers: Delay is good. Random delay is better. If the requests don't show up like clockwork every N seconds they're harder to identify. Be very, very certain your scraper does not have some failure mode where it re-fetches the request immediately if the request fails. That quickly leads to misery. You're more likely to succeed if you emulate an MSIE user-agent string. If you want to be polite and don't care if you're caught, by all means put your email address in the user-agent. For particularly unfriendly sites I've needed to scrape, I've gotten pages via Tor. It's much slower and less reliable, but now your requests are coming from a bunch of different IPs.
NormandyJack at Ask.Metafilter.Com Visit the source
Other answers
For what it's worth, that sounds like a pretty standard terms of service. How can you be expected to not cache pages? There's nothing really illicit about it as long as you don't deny service to others or republish the data. I'd do it via tor, space out the requests to be nice, and let it run for a couple of days .Just use the user-agent for a browser, as well. This idea might conflict with others' scruples, but I feel my answer is the best balance of ethics and utility.
cellphone
Don't email them, just go ahead and do it. You're being more than nice enough already by doing it during off-peak hours and spacing it out. But don't announce yourself with the user-agent, just in case they're insane.
equalpants
I'd say hold off on providing your info (use a default user-agent) until (if) they complain. You're already being nice about it, spreading the requests and all, so I think you're in "ask for forgiveness, not permission" territory.
inigo2
I'll echo the "just go ahead and run it off-hours" comments. Either they're going to limit you very, very quickly, or they won't at all. At least, that was my experience when extensively screen-scraping del.icio.us to gather a corpus for a class project; they didn't take kindly to our assault and locked us down pretty fast. So try it, see if it works - it probably will - and be 'nicer' only if you have to be.
Tomorrowful
Tomorrowful has it. Throttle your requests and do it during offpeak hours. If they are unhappy with your visits, there won't be any emails or conversation; they'll just block your IP. This is one of those instances where "don't ask, don't tell" is actually a good policy. Residents of my house have done this repeatedly with 20K+ record scrapes from a variety of sources and this has consistently worked to achieve, err, the desired end.
DarlingBri
I have scraped some very large sites in the past -- and usually do something under a hit a minute. The only time I have been threatened was when I used a non-standard user agent (that had my email address in it). I would discourage doing that.
SirStan
From the moral PoV, it's pretty clear that they don't want you to grab their content by automated means. You can twist your interpretation of their ToS however you like, but don't kid yourself - you will be connecting using other than a web browser, and you will be making use of the content in a way they will see as being contrary to their copyright. Having said that, your plan sounds OK and more respectful of their server load and bandwidth than they would be if the situation was reversed. Just remember that their concern isn't with the load on their servers, it's with controlling their data. Because of this, I'd fake a real browser ID and not give them your name / contact details. If they catch on and want that info, they can hassle your ISP for it. Depending on your ISP & the relevant laws, it's a layer of protection between you and them. I do a daily screen scrape of around 200~400 pages from a site which is actively and overtly hostile against such activities - javascript randomisation and encryption of all the data on the pages, limiting requests to a handful per day from any one IP address, both the site owner & data supply companies actively pursuing federal court legal action against people even suspected of running scrapers for personal use only, etc. A bit of lateral thinking led to a way that gets around even this level of corporate paranoia and protectionism to grab ~300 pages/hr - but I'm afraid that if I explain it they'll notice, close that loophole, and set their QCs on me...
Pinback
As a guy who has run a bunch of websites, I suspect that a 30 second delay between requests will lead to absolutely nobody noticing- you'll be below the noise level. However, be aware that if you are only running the script for 12 hours of every day (to be non-peak), and are letting 30 seconds elapse between each page request, you'll take the better part of two weeks in order to get everything. Also, will you be pulling down any referenced graphics as well? If so, you may have more than 20,000 files to pull down, which will make this take longer.
jenkinsEar
I'd use a non-generic useragent. (e.g. RoboBot 1.0). If they block that, then you know they have a problem with it. And can decide what to do from there (either mask the UA or announce your intentions). Odds are they won't notice. (Incidentally, if you do decide to mask the UA, I suggest using GoogleBot)
meta_eli
Related Q & A:
- What are the best resources to learn about web crawling and scraping?Best solution by Quora
- What is the best language for HTML parsing and web scraping?Best solution by Quora
- What is the best web scraping software for building contact information databases from online directories?Best solution by Quora
- What is the proper email etiquette to reply to a person?Best solution by Yahoo! Answers
- Where do I find a list of professional rules of etiquette?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.