Mahout: Clustering - Representing data

Transforming data into vectors

In Mahout, vectors are implemented as three different classes

DenseVector can be thought of as an array of doubles, whose size is the numberof features in the data. Because all the entries in the array are preallocatedregardless of whether the value is 0 or not, we call it dense.
RandomAccessSparseVector is implemented as a HashMap between an integer and a double, where only nonzero valued features are allocated. Hence, they’re called as SparseVectors.
SequentialAccessSparseVector is implemented as two parallel arrays, one ofintegers and the other of doubles. Only nonzero valued entries are kept in it.Unlike the RandomAccessSparseVector, which is optimized for random access,this one is optimized for linear reading.

One possible problem with our chosen mappings to dimension values is that the values in dimension 1 are much larger than the others. If we applied a simple distance-based metric to determine similarity between these vectors, color differences would dominate the results. A relatively small color difference of 10 nm is treated as equal to a huge size difference of 10. Weighting the different dimensions solves this
problem.

Representing text documents as vectors

The vector space model (VSM) is the common way of vectorizing text documents. First, imagine the set of all words that could be encountered in a series of documents being vectorized. This set might be all words that appear at least once in any of the documents. Imagine each word being assigned a number, which is the dimension it’ll occupy in document vectors.

term frequency (TF) The value of the vector dimension for a word is usually the number of occurrences of the word in the document. This is known as term frequency (TF) weighting.
Term frequency–inverse document frequency (TF-IDF) Term frequency–inverse document frequency (TF-IDF) weighting is a widely usedimprovement on simple term-frequency weighting. The IDF part is the improvement;instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words.

The basic assumption of the vector space model (VSM) is that the words are dimensions and therefore are orthogonal to each other. In other words, VSM assumes that the occurrences of words are independent of each other, in the same sense that a point’s x coordinate is entirely independent of its y coordinate, in two dimensions. By intuition you know that this assumption is wrong in many cases. For example, the word Cola has higher probability of occurring along with the word Coca, so these words aren’t completely independent. Other models try to consider word dependencies. One well-known technique is latent semantic indexing (LSI), which detects dimensions that seem to go together and merges them into a single one.

In Mahout, text documents are converted to vectors using TF-IDF weighting and n-gram collocation using the DictionaryVectorizer class.

Generating vectors from documents

mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"

mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles

mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow

In the first step, the text documents are tokenized—they’re split into individual words using the Lucene StandardAnalyzer and stored in the tokenized-documents/ folder.
The word-counting step—the n-gram generation step (which in this case only counts unigrams)—iterates through the tokenized documents and generates a set of important words from the collection.
The third step converts the tokenized documents into vectors using the term-frequency weight, thus creating TF vectors. By default, the vectorizer uses the TF-IDF weighting, so two more steps happen after this:
the document-frequency (DF) counting job, and the TF-IDF vector creation.