Natural Language Processing: Which is the best text extraction library in Python?
-
I have come across this http://redmine.djity.net/projects/pythontika/wiki which is a Python interface to http://tika.apache.org/ . Wondering if there is native Python libraries for extracting texts from PDF, PS, Word Doc, Html etc.
-
Answer:
If your objective to to merely extract text from pdf, ps, word doc, html. I would suggest you give abiword a try. Abiword in command line mode can extract txt by converting file formats. Also note that if you wish to retain tables in text mode then use w3c text browser to convert html to txt. Having helped a friend's data conversion project I have come to the conclusion that the best results can be obtained by scripting msoffice itself for conversion. You can try IronPython and OLE.
Ankur Gupta at Quora Visit the source
Other answers
What is meant by text extraction here ? All text or main text ? If you are to extract all text of webpage, you go for lxml. I highly recommend this library because of its CPython origin giving it extremely good performance. If you need to extract main text of the webpage except clutter, you can try http://ooyuz.com/devextrac.php (we have a Python port available). Extracting main text from webpage is not as easy as one can think : like find largest div or anything. It must comprise set of algorithms based on observation across few thousand websites.
Akshay Bhatt
Related Q & A:
- Which is the best programming language?Best solution by Yahoo! Answers
- Which are the best search engines which give only few site names relevant to the subject?Best solution by pandia.com
- Which acoustic guitar should I buy? Which is the best?Best solution by Yahoo! Answers
- Which are the best easiest racing game which looks like orignial racing?Best solution by videogamer.com
- Which bmx is best order best to worst?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.