Thursday, October 27, 2011

How to correctly generate bigrams from text in Rapidminer

Bigrams (two consecutive tokens without punctuation marks in betwen) are useful to represent terms that, when taken together, mean something different than when taken individually (e.g. “military intelligence”). The improvement of classification results achieved when incorporating bigrams vary from one document collection to another, and also depending on the bigram filtering method. “Using Bigrams in Text Categorization” is an excellent evaluation (if limited in scope) of bigram effectiveness.
Although classification improvement with bigrams is not as dramatic as one would think, they are worth trying.

In order to create meaningful bigrams, we need to consider:
  1. only token pairs that appear consecutively within a text, and without intermediate punctuation marks that otherwise would create false pairs.
  2. bigrams that are significant both in frequency and in their discriminative value. 
Here is a Rapidminer workflow that achieves all that. I takes two or more document sets and transforms them to a table (an ExampleSet, in Rapidminer parlance), where each attribute is a word (unigram) or a bigram that is not class-independent:

Rapidminer workflow to generate unigrams and bigrams -- click to view full size, or click here to download this Rapidminer workflow
The hard part here is how to only generate bigrams that do not include stopwords, or cases where words should not really form a bigram (e.g. before and after a semicolon).

This workflow works as follows:
  1. It reads 2 document sets through the Read Documents from Files operator. This operator will also filter out all uni and bigrams that appear below or above a certain frequency threshold.
  2. It splits each document into many word sequences using the Cut Document operator. A word sequence ends whenever anything other than whitespaces appear.
  3. It then splits each word sequence into tokens, and generate unigrams and bigrams. Here it is not possible to remove stopwords before generating the bigrams (otherwise a bigram would include non-consecutive words that originally had a stopword in between). We also do not want to keep stopwords as unigrams. Therefore we perform two steps:
    1. We eliminate all stopwords (unigrams) with Filter Stopwords after generating the bigrams.
    2. We eliminate most of the bigrams that include stopwords, by eliminating all bigrams that start or end with one, two or thre-letter word. Some stopwords will still be included in bigrams, but most with non-functional words (the, our, my, etc) will be discarded.
  4. After tokenizing and generating all uni and bigrams from from separate sequences coming from the same document, we merge back all those sequences belonging to the same document into one document, with the Combine Documents operator applied to the ouput of Cut Documents. Remember that we needed to cut the document in sequences to generate bigrams that do not cross punctuation marks.
  5. After having a table where each row stands for a document, and each attribute for an unigram or bigram, we eliminate all n-grams that are not useful for prediction. We do this by applying Χ2 with a threshold for (number of classes-1)×(number of bins-1) degrees of freedom, α=0.05. We selected 2 bins, so Χ2 DoF=  number of classes-1.  
The result of this workflow is a table where each document is represented by those unigrams and bigrams that appear in the document and are not class-independent.

You will notice that, compared to custom code in Java or Python, pre-processing text in Rapidminer is slow. This is not a problem in Rapidminer only but also in R . On the other hand, you can try the effectiveness of new ideas much quicker in Rapidminer or R; for us, these systems are our playground.
A note of caution: including both a bigram and the two words constituting the bigram as features will negatively affect the classification accuracy for any method that assume (conditional or unconditional) independence of features (i.e. Naive Bayes), since it will increase the number of independence violations (a bigram and its words are clearly not independent). Bigrams themselves may be non-independent, for example the 2 bigrams resulting from "linux system administrator" (linux system, system administrator) area  most likely non-independent. This will not affect classification methods that do not make such assumption (e.g. Logistic Regression). Pages 1-4 and 7-11 of Naive Bayes and Logistic Regression are a good introduction to this topic.