Computer Vision: What algorithms can be used to classify scanned documents based on content?
-
I'm in need of classifying large sets of scanned / photographed images of office documents. Think invoices, vendor supply lists, etc. Documents are of course in different orientations and resolutions. The final goal is grouping by type (such as for example each month supplies from a single vendor are matched together). Same type documents follow similar format ("view") - company's paper, logos, generic structure and the like. My idea was either to find software (no luck) that does it or write one. In terms of writing I did preliminary research on computer vision, matching etc. This field however is a big one and I'd like some pointers as to how it should be done. I was considering CIBR in general or on a more basic level MSER algorithms (at least for matching documents regardless of scale/orientation). I have no idea though how to overcome slight (or not so) differences between variations (product lists might be single on many items long for example making "similar" documents quite different in terms of image data). TIA for any pointer on how to tackle this.
-
Answer:
What you need is Locality Sensitive Hashing (LSH) (http://en.wikipedia.org/wiki/Locality-sensitive_hashing). Since your documents are scanned you can use OCR features, Shape Contexts features or merely points on edges that can be learned. Then categorize/classify a set of documents yourself and create a training set. This training set can be used with the dimensionality reduction algorithm of LSH. The trained LSH acts on a test document (a new document) and classifies it into one of the trained categories. LSH is basically a nearest-neighbor algorithm. Based on a bunch of features, it answers "which bin am I closest to?" and it picks that bin. LSH and Principal Component Analysis (PCA and kernel PCA) are dimensionality reduction algorithms that have a good shot of working for your application. With PCA, the variance signified by the eigenvectors/values can be learned using Support Vector Machines (SVM) or Random Forests classifiers. Ankur
Ankur Kumar at Quora Visit the source
Other answers
This feature is more commonly found in corporate DMS, but generally comes at a high price tag. But we at DINO are working exactly on that and looking for alpha testers for our document recognition technology! Don't hesitate to contact me at , i'll hook you up with a free account and see what we can do for you.
Damien Buty
I guess you are looking for easy classification of documents while digitizing them. A lot of companies have come up with such softwares which work with your data repository and can classify and sort scanned documents, images etc. http://Ephesoft is one such company. You can check the image attached as it clearly shows the complete workflow. You can see where is it possible to implement it in your process.
Natasha Woreen
Related Q & A:
- Could anyone please tell me what i can do to speed up my yahoo mail it used to be fast but now slow?Best solution by Yahoo! Answers
- Why can't WD-40 or other petroleum based lubricants be used on airsoft guns?Best solution by Yahoo! Answers
- What jobs can you get in computer programming?Best solution by Yahoo! Answers
- What is a good video camera can be used for youtube?Best solution by Yahoo! Answers
- What are some expressions that can be used to end an email?Best solution by English Language and Usage
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.