What is the significance of document vector weighting? Hello Dr. Ballings,
Can you explain why it is important to give rare terms
more weight in a document by term matrix?
What would happen if we chose not to do this? Also,
what are some examples of rare terms?
Answers and follow-up questions Answer or follow-up question 1
"Can you explain why it is important to give rare terms
more weight in a document by term matrix?"
Whereas the term frequency (tf) allows to discriminate between documents, the inverse
document frequency (idf) enables us to discriminate between terms. By multiplying tf by idf
we give more weight to rare terms (i.e., terms that appear in fewer documents) and less
weight to terms that appear in virtually all documents. For example, if we are analyzing
reviews about cars, then we might find that all reviews contain the word car. Therefore the
word car should be less important (because if will not allow us to discriminate between documents)
and we do this by scaling down all tf of the word car by a constant.
"What would happen if we chose not to do this? "
Some algorithms are sensitive to the size of the values in a variable (e.g., k-nearest neighbors and
neural networks) and can therefore exploit the additional information. Other algorithms are not
sensitive to the size (e.g., trees) and applying idf does not change the results.
"Also, what are some examples of rare terms? "
This really depends on the corpus. In a corpus about plants, the word car is rare. In a corpus about
cars, the word car is very common.
Sign in to be able to add an answer or mark this question as resolved.