TF-IDF

Term Frequency - Inverse Document Frequency

Consider a bag-of-words vector representation for a document. Words such as "the" or "a" will have a large value. This doesn't mean that they are important to a document. Rather, the most characteristic words of a document will be the ones that appear only in that document, and in few other places. TF-IDF weights each word according to how unique it is to that specific document, relative to other documents.

So, given a term \(t\), a document \(d\), and a set of documents \(D\): \[ \text{tfidf}(t,d,D) = \frac{|\{ x \in d \mid x=t \}|}{|d|} \cdot \log \frac{|D|}{ |\{y \in D | t \in y\}| } \] The left factor is the term frequency in the document, the right factor is the log of the inverse document frequency.

1 Useful links

what is tf-idf?