What are the best resources to learn about web crawling and scraping?

How do you go about automating/simulating (I think) an HTTP request?

  • I want to write a script to automate doing a search, retrieving, and parsing the search results from a website (a booking site similar to the search on www.hilton.com ). My (extremely) rough understanding is that I should write a script to mimic the request the form is sending, and that I can use something like Firebug or Fiddler to capture what my browser is sending. I am way out of my depths here but am pretty committed to doing this as an educational project, so I'd appreciate any direction getting started figuring out how all of this is done - I'm familiar with basic python scripting and have used urllib and BeautifulSoup to do basic web scraping, but I don't really understand how all these pieces fit together or how to get started - pointers to good resources would help tremendously, as I've found information on StackOverflow but am having trouble deciphering it. I'm a little more comfortable with what I need to do to parse the html once I get it back. Also, am I even right in assuming that what I want to do is send a HTTP GET request? Or is this search done with Javascript (it seems like if that's the case, ths becomes more difficult? How do I tell what's really going on - I've messed around with Fiddler but am having trouble). Please bear in mind that I barely understand the words that I'm using, but I'd really like to learn. Thanks!

  • Answer:

    http://scrapy.org/ should hide some of these issues from you and also get through the next steps really well. To answer some of your other questions though, what I'd suggest is playing around with some http://dannguyen.github.com/NICAR/2012/02/25/nicar-2012-inspect-the-web-with-your-browsers-web-inspector/ (or Firebug tutorials or Tamper Data tutorials) until you understand the mechanics of the HTTP request.

hot soup at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

Er, here's the http://ruby.bastardsbook.com/chapters/web-inspecting-html/ I had in mind--you may find this easier to work with than Fiddler.

Monsieur Caution

HTTP Requests are computers sending plain text instructions back and forth. When you go to a website, your computer sends the other computer text like this: GET /index.html HTTP/1.1 Host: www.example.com The other computer then knows to reply with index.html. GET requests use only the URL, and add parameters by putting stuff like ?user=me&color=blue at the end of a url. POST requests can basically push across a multi-line document, so they can use longer data. Additionally, by convention, POST requests can change things (by creating or deleting a blog post, for example) and GET can't. The standard HTTP request library for Python is called http://docs.python-requests.org/en/latest/. The section on http://docs.python-requests.org/en/latest/user/quickstart/#passing-parameters-in-urls may help you. Basically, it lets you make requests like this: r = requests.get("http://google.com") r.text #outputs the HTML content of the page Searches can be done with Javascript in a variety of ways, but if it talks to the server it's still using HTTP POST or GET. If it doesn't talk to the server it's just using CSS to hide data that's on the page. Scrapy looks nice, but I would personally recommend the https://scraperwiki.com/docs/python/python_intro_tutorial/ tutorial. They will run your code for you and can store your results. It is more focused on the data extraction part, though.

Hotel booking sites are not going to take kindly to your data scraping. Of all the sites on the web, they are some of the most likely to use technical measures to make this quite difficult.

ryanrs

If you want to see exactly how your browser and the hotel site are talking to each other, there is no substitute for http://www.wireshark.org/. It has a http://www.wireshark.org/docs/wsug_html_chunked/ChAdvFollowTCPSection.html function that's just perfect for looking at Web conversations assuming those are not taking place over HTTPS, in which case reverse engineering gets rather harder.

flabdablet

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.