How to extract text from web page?

Has anyone used NLTK to extract noun phrase embedded in web page content? How much work does it take to write a NLTK code for extracting noun phrase from web page? Also, do I need to train it in order to extract noun phrase from text in English. Can I use the default module without any training. Tha

  • How would one compare NLTK to openNLP? I am a newbies to text analysis. I need to write a code for extracting and identifying noun phrase from web page. Any suggestion or input are appreciated. Thanks,

  • Answer:

    For Noun Phrase extraction, I think OpenNLP is much easier to use than NLTK. Since you are starting with web pages, for you finding Noun Phrases will involve the following steps: crawling, HTML parsing, sentence tokenization, word tokenization, POS tagging, phrase chunking, and finally filtering out NP chunks.   OpenNLP comes with trained models to do this, so its just a matter of writing some boilerplate Java code.   With NLTK you will need to build your own POS tagger and phrase chunker out of smaller components (regex, backoff/classifier hybrids, etc). NLTK does come with its own HTML to text converter.

Sujit Pal at Quora Visit the source

Was this solution helpful to you?

Other answers

Nltk has a very good tutorial for noun phrase extraction process which can be found here: http://www.nltk.org/book/ch07.html . You can use beautiful soup to extract text from HTML pages. Opennlp and Stanford NLP both has noun phrase extraction methods, but I preferred OpenNLP one when I did this.

Rivu Roy

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.