How to extract information from text file in Python?

What's the best way to extract phrases from a corpus of text using Python?

Vineet Yadav at Quora Visit the source

Was this solution helpful to you?

Other answers

It depends on how you define a phrase.My answer is regarding collocations, which are roughly token combinations appearing in a text more than they are statistically likely to appear (e.g. In many texts the phrase "San Francisco" appears more than is expected by the individual frequencies of the tokens "San" and "Francisco").For some theory, please see Chapter 5 from Foundations of Statistical Natural Language Processing (Manning & Schutze): http://nlp.stanford.edu/fsnlp/promo/colloc.pdf https://radimrehurek.com/gensim/models/phrases.html and http://www.nltk.org/howto/collocations.html show how to find collocations using gensim and NLTK, respectively.This blog post by Mark Needham gives a nice explanation of using gensim to find phrases:http://www.markhneedham.com/blog/2015/02/12/pythongensim-creating-bigrams-over-how-i-met-your-mother-transcripts/

Yuval Feinstein

How about the solution found here: https://github.com/cirlabs/citizen-quotes/ It's a wonderful solution by https://twitter.com/chasedavis

Shola Smith

There're quite a lot of text processing examples at http://streamhacker.com - and they are using NTLK.

Dima Kuchin

After many hours of checking various API, we've decided to go with TextRazor.Quality of NLP phrase extraction / classification results is superb - TextRazor uses Freebase and DBpedia (among other repositories) and this allows TextRazor to classify / categorize / extract PHRASES such as "computer security" - correctly as one entity (and not as many other APIs - incorrectly classifying this example as one class of "computer" AND another class as "security"). Programmatic control over which terms TextRazor will use and which ones will not - is again, very simple.In terms of speed - TextRazor is amazingly fast. If I understand correctly, it uses parallel computing on many (hundreds ? thousands?) of Amazon on-demand machines.Cost - we compared it to others and did an in-depth analysis with one of their competitors (a very large 3 letters company) - and they are definitely competitive and reasonable.Integration with their API using Python was (relatively) straight-forward, except some minor issue with https when working locally on a Web2Py framework. If you hit an obstacle while using TextRazor on Web2Py locally - feel free to ping me and I'll gladly share our solution.Service / support - almost instantaneous - they usually reply within 12 hours to all inquiries.Disclosure - I have no interests, shares or any other financial benefits related to TextRazor and we are actually still on their free plan - so we didn't pay them yet for their API services.

Dan Toren

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.