How to remotly debug python on my apache?

Natural Language Processing: Which is the best text extraction library in Python?

  • I have come across this http://redmine.djity.net/projects/pythontika/wiki which is a Python interface to http://tika.apache.org/ . Wondering if there is native Python libraries for extracting texts from PDF, PS, Word Doc, Html etc.

  • Answer:

    If your objective to to merely extract text from pdf, ps, word doc, html. I would suggest you give abiword a try. Abiword in command line mode can extract txt by converting file formats. Also note that if you wish to retain tables in text mode then use w3c text browser to convert html to txt. Having helped a friend's data conversion project I have come to the conclusion that the best results can be obtained by scripting msoffice itself for conversion. You can try IronPython and OLE.

Ankur Gupta at Quora Visit the source

Was this solution helpful to you?

Other answers

What is meant by text extraction here ?  All text or main text ? If you are to extract all text of webpage, you go for lxml. I highly recommend this library because of its CPython origin giving it extremely good performance. If you need to extract main text of the webpage except clutter, you can try http://ooyuz.com/devextrac.php  (we have a Python port available). Extracting main text from webpage is not as easy as one can think : like find largest div or anything. It must comprise set of algorithms based on observation across few thousand websites.

Akshay Bhatt

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.