Is there a collectionwise normal topological vector space which is not paracompact?

Information Retrieval: What technique can be used to avoid the limitations of vector space models that perform poorly with longer documents?

  • The limitations are specified on the Wiki page: http://en.wikipedia.org/wiki/Vector_space_model#Limitations. I am interesting in tackling the first limitation which is Long documents are poorly represented because they have poor similarity values (a small http://en.wikipedia.org/wiki/Scalar_product and a http://en.wikipedia.org/wiki/Curse_of_dimensionality) EDIT: I think there is confusion in understanding the question. Let me explain the question by taking an example: Let's assume, Query tf-idf vector:  Q Doc 1 tf-idf vector: D1 Doc 2 tf-idf vector: D2 Assume  that the intersection of Q \intersection D1 is same as Q \intersection  D2 but len(D2) >> len(D1) and also assume that D1 \subset D2. For  simplicity, also assume that the tf-idf scores of the intersection set  are same for D1 and D2. So now when we compute cosineSim(D2,Q)  and cosineSim(D1,Q) numerator will be same as "Q \intersection D1 = Q  \intersection D2". But the denominator for D2 will be very very high as  D2 is a high dimension vector which leads to higher value of |D2| and  lower value of cosine similarity. So the problem -- despite of "Q \intersection D1 = Q \intersection D2" the cosineSim(Q,D1) > cosineSim(Q,D2). Ideally both the document should be equally relevant as their intersection with Query terms is the same. This is well known problem as per Wikipedia page but I do not know how to solve this. I hope, my question is clear now.

  • Answer:

    You might want to normalize the vector and take the tf-idf score for each word instead of simply taking the count. The tf-idf score will take care of both high frequency as well as the rarity of the word. Thus, you can eliminate words with low scores which don't contribute much to the document and are practically noise. I hope this might enhance your system.

Ayushi Dalmia at Quora Visit the source

Was this solution helpful to you?

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.