Information Retrieval: What technique can be used to avoid the limitations of vector space models that perform poorly with longer documents?
-
The limitations are specified on the Wiki page: http://en.wikipedia.org/wiki/Vector_space_model#Limitations. I am interesting in tackling the first limitation which is Long documents are poorly represented because they have poor similarity values (a small http://en.wikipedia.org/wiki/Scalar_product and a http://en.wikipedia.org/wiki/Curse_of_dimensionality) EDIT: I think there is confusion in understanding the question. Let me explain the question by taking an example: Let's assume, Query tf-idf vector: Q Doc 1 tf-idf vector: D1 Doc 2 tf-idf vector: D2 Assume that the intersection of Q \intersection D1 is same as Q \intersection D2 but len(D2) >> len(D1) and also assume that D1 \subset D2. For simplicity, also assume that the tf-idf scores of the intersection set are same for D1 and D2. So now when we compute cosineSim(D2,Q) and cosineSim(D1,Q) numerator will be same as "Q \intersection D1 = Q \intersection D2". But the denominator for D2 will be very very high as D2 is a high dimension vector which leads to higher value of |D2| and lower value of cosine similarity. So the problem -- despite of "Q \intersection D1 = Q \intersection D2" the cosineSim(Q,D1) > cosineSim(Q,D2). Ideally both the document should be equally relevant as their intersection with Query terms is the same. This is well known problem as per Wikipedia page but I do not know how to solve this. I hope, my question is clear now.
-
Answer:
You might want to normalize the vector and take the tf-idf score for each word instead of simply taking the count. The tf-idf score will take care of both high frequency as well as the rarity of the word. Thus, you can eliminate words with low scores which don't contribute much to the document and are practically noise. I hope this might enhance your system.
Ayushi Dalmia at Quora Visit the source
Related Q & A:
- Is there a tool that can be used to simulate credit card payments?Best solution by Software Recommendations
- Could anyone please tell me what i can do to speed up my yahoo mail it used to be fast but now slow?Best solution by Yahoo! Answers
- What is a good video camera can be used for youtube?Best solution by Yahoo! Answers
- What are some expressions that can be used to end an email?Best solution by English Language and Usage
- What parts of Philadelphia should I avoid?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.