How do you rate the IDF?

How do I calculate tf-idf at the corpus level?

  • I can calculate tf-idf for a term for a document. Here tf will be (frequency of term in a document/total terms in the document). Can I do something similar to get tf-idf at corpus level ? I am thinking of a formula like : tf-idf (term) (at corpus level) =  (frequency of term in the corpus/total terms in the corpus). idf (term)=  log_e(Total number of documents / Number of documents with term t in it). Would this even be correct ?

  • Answer:

    IDF is already at corpus scope. You cannot do TF-IDF for a single document. TF, or Term Frequency is the count of the number of times the word appears in the document and IDF, or inverse document frequency is used to reduce the weight of terms that occur in more than one document. This is done to make sure that words which uniquely identify a document carry more weight. If you do TF over all the documents in the corpus you wouldn't be achieving what's necessary, the weight of the word that uniquely identifies the document. Instead, the TF value, calculated over all the documents in the corpus will just give the strength of that term in the corpus. Which you're neutralizing with IDF. This would be a useless metric.

Vignesh Natarajan at Quora Visit the source

Was this solution helpful to you?

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.