What are some good tools for interactive web scraping and automation?

Freely Accessible Etymology Database? Or tools to help create one?

  • I have an idea for a project that would require the ability to search a dictionary of words and find the year of it's known introduction (as close as possible).I am aware of etymology-online (love that site), but since, as far as I'm aware, it's just a site, and the compilers don't have a publicly accessible database, I was wondering if anybody knows of any site that actually WOULD have a freely available database (either query via an API through the web, or downloadable to self-host)? If there isn't any, does anybody have an idea who I might be able to contact? Would it be prudent to access the etymology online folks? I feel like scraping their pages for the data would be reckless and kind of jerky in terms of bandwidth, but then again - it's mostly text and these days bandwidth is fairly cheap... But I suppose it'd be possible to do that as a last resort and script something to pull the data into a database? Any ideas on what would be good tools to do such a thing?

  • Answer:

    Ask Before. And i recommend python for scrapping. Maybe you can ask on opendata on reddit and stack exchange too :) regards and good luck.

symbioid at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

Oh actually that doesn't have years. so it might NOT be what you want. But it's still awesome.

aubilenon

The http://developer.wordnik.com/ (http://developer.wordnik.com/docs.html) has etymologies, though they don't (as far as I can tell) generally have an easily extractable year of introduction. For relatively recent words, you could use http://storage.googleapis.com/books/ngrams/books/datasetsv2.html to find the first year a given word appeared in published books.

aparrish

http://ask.metafilter.com/270788/Freely-Accessible-Etymology-Database-Or-tools-to-help-create-one#3931441 For relatively recent words, you could use Google Ngram data to find the first year a given word appeared in published books. Bad idea; Google's metadata is notoriously unreliable.

languagehat

I think my best bet is to ask the etymonline guy if he has data he's willing to share or I could pay for, or if not, if he minds if I scrape his page. Looks like most etymology stuff doesn't really have dates. And it doesn't have to be super accurate, just close enough. I wonder how he ended up getting dates, perhaps he used Google's ngram stuff and it's just as unreliable?

symbioid

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.