What's the best way to harvest information from a website?
-
What is the best way to harvest, automatically, information from a website? I'm an attorney. As part of my law practice, I spend many hours a week looking up court cases on a court website so that I can contact by mail people who may need my services. The process is time-consuming and tedious as hell. I know there must be a way to write a program that would automate the process. My current process works like this. I go to the court website, select (using a pull-down menu) the type of search I want to do ("docket search"), enter a date in a text box in a different frame of the page, and get a results page with the names of all individuals who have court cases in a certain court on the date I entered, along with a docket number assigned to each individual. The list of individuals produced by this search is presented in an HTML table. I copy that list, drop it into Excel (so that each person, with his or her corresponding docket number, occupies a row in the spreadsheet), and then I have to search for each case, one by one, using a different kind of search ("case search by docket number") to get the full information listing for that particular person, which is presented in an HTML table. I then copy-and-paste the information I need (the person's home address) for each individual onto that person's line in my Excel spreadsheet. The end result is that I print out a bunch of mailing labels in MS Word, using the mail merge function. It seems that it would have been so easy for the programmer of this court system to allow you to just click on the docket number on the results list produced in the original search, pulling up all the data for each person, and cutting out the need for a separate search using each person's docket number, but it does not work that way. The ideal application that I am seeking, would have a simple interface that would allow me to enter the date of the docket, the court division, and then press a button, automatically harvesting the information for each person who has business on that date's docket, and outputting the information (docket number, name, home address, etc.) in plain-text format that could be dropped into Excel. Even better would be a program where you could enter a date range, and it would harvest all the information for all dates, and all court divisions, in that range. Is it possible to do this? What programming language or technology would be most appropriate for this purpose?
-
Answer:
Exactly the kinds of task you describe is often done using a scripting language with a good website automation library. It would be easy to do in Perl or Python.
anonymous at Ask.Metafilter.Com Visit the source
Other answers
curl/wget + perl would be great for this. You could output the info in a .csv to be used with mail merge. Depending on how the labels are done, you might even be able to do it with LaTeX. Contact info in the profile.
devilsbrigade
oops, I just noticed that the end of my comment was lost... When using Perl to do that kind of task I often write my own parser for added flexibility and robustness, but it is very easy to whip up something quick-and-dirty using the Perl CPAN module http://search.cpan.org/~petdance/WWW-Mechanize/.
RichardP
Excel has some built-in ability to automatically download and parse web pages. I'm not much of an Excel jockey myself, but I know this feature is used in some spreadsheets at my office. I think the feature is called "Get External Data."
mbrubeck
What you are looking to do is often called screen scraping or spidering. I'd recommend the book http://www.oreilly.com/catalog/spiderhks/ which provides lots of code, and tips -- including the importance of spidering etiquette (e.g., don't overburdon the server with tons of requests without a delay between them). I read the book a couple of years ago, and applied the findings to some code I was writing in Visual Basic -- which you really shouldn't use for this if you can use PERL, Python, PHP, or Ruby, which are much more well suited for the task.
i love cheese
As you don't have access to their database directly, you are limited to reading the HTML (called "screen scraping" or "web scraping") and converting that into a useful format. Typically it's done by downloading the page (wget), cleaning the page so that other software can more easily use it (htmltidy), and then extracting and converting the parts you want (xslt). You can automate this conversion process in almost any language. That's what you can do now with just webpages. Alternatively you could ask them to open up their database, providing daily dumps of their data to a server. Getting the source this way is preferable and much easier to use.
holloway
For what it's worth, you're right -- it would be nice if the courts and all the other government agencies started publishing their information in easily consumable formats. It's entirely possible that they already have this capability, so you might want to drop a quick email to the administrator of the site which you're going to spider to find out what they've got under the hood. You never know unless you ask...
ph00dz
Definitely use Perl and WWW::Mechanize. It rocks for this kind of thing. Having got to the requisite page, there are also Perl modules which specialise in the analysis of HTML tables, like HTML::TableContentParser and HTML::TableExtractor. By the way: I go to the court website, select (using a pull-down menu) the type of search I want to do ("docket search"), enter a date in a text box in a different frame of the page, and get a results page this part can possibly be automated easily for you if your results page has a URL like domain.com/script.cgi?search=docket&date=1/1/05 -- you could have bookmarks set up or a javascript bookmarklet which automatically went to todays date for instance.
AmbroseChapel
I've done this in perl (scraping eMusic.com) and in Firefox using javascript (scraping a "best of the web" community web site); each program had several hundred satisfied users. In perl, as RichardP notes, WWW::Mechanize provided an easy start. Doing it in Firefox is easier (but requires, obviously, Firefox). Now, I don't, usually, do my own legal work: I hire a lawyer because he can do it better and faster than I can. (This is just a trivial application of Ricardo's Law.) I'd suggest that it's ultimately cheaper for you to not distract yourself from making money as a lawyer, by hiring a coder to do this for you.
orthogonality
If you want to go with Python (probably easier to learn than perl), go with http://www.crummy.com/software/BeautifulSoup/ or http://wwwsearch.sourceforge.net/mechanize/ (yes, after the perl module). But only do this if you think programming is a fun hobby anyway, if not, I agree with orghogonality: just hire somebody to code this for you.
davar
Related Q & A:
- What's the best way to hook up an overhead projector to a laptop?Best solution by Yahoo! Answers
- What's the best way to start a small clothing line business?Best solution by Yahoo! Answers
- What's the best way to get a job in a restaurant?Best solution by Yahoo! Answers
- What's the best way to get smudges off of a plasma?Best solution by Yahoo! Answers
- What's the best way to make a good impression at a job interview?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.