Deep Learning in Action

On Wednesday at Hochschule München, Fakultät für Informatik and Mathematik I presented about Deep Learning (nbviewer, github, pdf).

Mainly concepts (what’s “deep” in Deep Learning, backpropagation, how to optimize …) and architectures (Multi-Layer Perceptron, Convolutional Neural Network, Recurrent Neural Network), but also demos and code examples (mainly using TensorFlow).

It was/is a lot material to cover in 90 minutes, and conceptual understanding / developing intuition was the main point. Of course, there is great online material to make use of, and you’ll see my preferences in the cited sources ;-).

Next year, having covered the basics, I hope to be developing use cases and practical applications showing applicability of Deep Learning even in non-Google-size (resp: Facebook, Baidu, Apple…) environments.
Stay tuned!

Analyse de sentiments de critiques cinématographiques – version française

This is just an ultra short post saying that last Tuesday, I had the honor of presenting my “Sentiment Analysis of Movie Reviews” talk at Swiss Data Forum – in French 😉 Thanks again guys for having me, and for your patience 🙂

So here’s a link to the French version of the talk – toute la magie de word2vec et doc2vec, en français:-) Enjoy!

Sentiment Analysis of Movie Reviews (3): doc2vec

This is the last – for now – installment of my mini-series on sentiment analysis of the Stanford collection of IMDB reviews.
So far, we’ve had a look at classical bag-of-words models and word vectors (word2vec).
We saw that from the classifiers used, logistic regression performed best, be it in combination with bag-of-words or word2vec.
We also saw that while the word2vec model did in fact model semantic dimensions, it was less successful for classification than bag-of-words, and we explained that by the averaging of word vectors we had to perform to obtain input features on review (not word) level.
So the question now is: How would distributed representations perform if we did not have to throw away information by averaging word vectors?

Document vectors: doc2vec

Shortly after word2vec, Le and Mikolov developed paragraph (document) vector models.
The basic models are

  • Distributed Memory Model of Paragraph Vectors (PV-DM) and
  • Distributed Bag of Words (PV-DBOW)

In PV-DM, in addition to the word vectors, there is a paragraph vector that keeps track of the whole document:

Fig.1: Distributed Memory Model of Paragraph Vectors (PV-DM) (from: Distributed Representations of Sentences and Documents)

With distributed bag-of-words (PV-DBOW), there even aren’t any word vectors, there’s just a paragraph vector trained to predict the context:

Fig.2: Distributed Bag of Words (PV-DBOW) (from: Distributed Representations of Sentences and Documents)

Like word2vec, doc2vec in Python is provided by the gensim library. Please see the gensim doc2vec tutorial for example usage and configuration.

doc2vec: performance on sentiment analysis task

I’ve trained 3 models, with parameter settings as in the above-mentioned doc2vec tutorial: 2 distributed memory models (with word & paragraph vectors averaged or concatenated, respectively), and one distributed bag-of-words model. Here, without further ado, are the results. I’m just referring results for logistic regression as again, this was the best-performing classifier:

test vectors inferred test vectors from model
Distributed memory, vectors averaged (dm/m) 0.81 0.87
Distributed memory, vectors concatenated (dm/c) 0.80 0.82
Distributed bag of words (dbow) 0.90 0.90

Hoorah! We’ve finally beaten bag-of-words … but only by a tiny little 0.1 percent, and we won’t even ask if that’s significant 😉
What should we conclude from that? In my opinion, there’s no reason to be sarcastic here (even if you might have thought I’d made it sound like that ;-)). With doc2vec, we’ve (at least) reached bag-of-words performance for classification, plus we now have semantic dimensions at our disposal. Speaking of which – let’s check what doc2vec thinks is similar to awesome/awful. Will the results be equivalent to those had with word2vec?
These are the words found most similar to awesome (note: the model we’re asking this question isn’t the one that performed best with Logistic Regression (PV-DBOW), as distributed bag-of-words doesn’t train word vectors, – this is instead obtained from the best-performing PV-DMM model):

model.most_similar('awesome', topn=10)

[(u'incredible', 0.9011116027832031),
(u'excellent', 0.8860622644424438),
(u'outstanding', 0.8797732591629028),
(u'exceptional', 0.8539372682571411),
(u'awful', 0.8104138970375061),
(u'astounding', 0.7750493884086609),
(u'alright', 0.7587056159973145),
(u'astonishing', 0.7556235790252686),
(u'extraordinary', 0.743841290473938)]

So, what we see is very similar to the output of word2vec – including the inclusion of awful. Same for what’s judged similar to awful:

model.most_similar('awful', topn=10)

[(u'abysmal', 0.8371909856796265),
(u'appalling', 0.8327066898345947),
(u'atrocious', 0.8309577703475952),
(u'horrible', 0.8192445039749146),
(u'terrible', 0.8124841451644897),
(u'awesome', 0.8104138970375061),
(u'dreadful', 0.8072893023490906),
(u'horrendous', 0.7981990575790405),
(u'amazing', 0.7926105260848999), 
(u'incredible', 0.7852109670639038)]

To sum up – for now – we’ve explored how three models: bag-of-words, word2vec, and doc2vec – perform on sentiment analysis of IMDB movie reviews, in combination with different classifiers the most successful of which was logistic regression. Very similar (around 10% error rate) performance was reached by bag-of-words and doc2vec.
From this you may of course conclude that as of today, there’s no reason not to stick with the straightforward bag-of-words approach.But you could also view this differently. While word2vec appeared in 2013, it was succeeded by doc2vec already in 2014. Now it’s 2016, and things have happened in the meantime, are happening at present, right now. It’s a fascinating field, and even if – in sentiment analysis – we don’t see impressive output yet, impressive output is quite likely to appear sooner or later. I’m curious what we’re going to see!

Sentiment Analysis of Movie Reviews (2): word2vec

This is the continuation of my mini-series on sentiment analysis of movie reviews. Last time, we had a look at how well classical bag-of-words models worked for classification of the Stanford collection of IMDB reviews. As it turned out, the “winner” was Logistic Regression, using both unigrams and bigrams for classification. The best classification accuracy obtained was .89 – not bad at all for sentiment analysis (but see the caveat regarding what’s good or bad in the first post).

Bag-of-words: limitations

So, bag-of-words models may be surprisingly successful, but they are limited in what they can do. First and foremost, with bag-of-words models, words are encoded using one-hot-encoding. One-hot-encoding means that each word is represented by a vector, of length the size of the vocabulary, where exactly one bit is “on” (1). Say we wanted to encode that famous quote by Hadley Wickham

Tidy datasets are all alike but every messy dataset is messy in its own way.

It would look like this:

alike all but dataset every in is its messy own tidy way
0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 0 0 0 0 0
6 0 0 0 0 1 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 0 1 0 0 0 0 0 0
9 0 0 0 0 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 0 0 1

Now the thing is – with this representation, all words are basically equidistant from each other!
If we want to find similarities between words, we have to look at a corpus of texts, build a co-occurrence matrix and perform dimensionality reduction (using, e.g., singular value decomposition).
Let’s take a second example sentence, like this one Lev Tolstoj wrote after having read Hadley 😉

Happy families are all alike; every unhappy family is unhappy in its own way.

This is what the beginning of a co-occurrence matrix would look like for the two sentences:

tidy dataset is all alike but every messy in its own way happy family unhappy
tidy 0 2 2 1 1 1 1 2 1 1 1 1 0 0 0
dataset 2 0 2 1 1 1 1 2 1 1 1 1 0 0 0
is 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1

As you can imagine, in reality such a co-occurrence matrix will quickly become very, very big. And after dimensionality reduction, we will have lost a lot of information about individual words.
So the question is, is there anything else we can do?

As it turns out, there is. Words do not have to be represented as one-hot vectors. Using so-called distributed representations, a word can be represented as a vector of (say 100, 200, … whatever works best) real numbers.
And as we will see, with this representation, it is possible to model semantic relationships between words!

Before I proceed, an aside: In this post, my focus is on walking through a real example, and showing real data exploration. I’ll just pick two of the most famous recent models and work with them, but this doesn’t mean that there aren’t others (there are!), and also there’s a whole history of using distributed representations for NLP. For an overview, I’d recommend Sebastian Ruder’s series on the topic, starting with the first part on the history of word embeddings.


The first model I’ll use is the famous word2vec developed by Mikolov et al. 2013 (see Efficient estimation of word representations in vector space).
word2vec arrives at word vectors by training a neural network to predict

  • a word in the center from its surroundings (continuous bag of words, CBOW), or
  • a word’s surroundings from the center word(!) (skip-gram model)

This is easiest to understand looking at figures from the Mikolov et al. article cited above:

Fig.1: Continuous Bag of Words (from: Efficient estimation of word representations in vector space)


Fig.2: Skip-Gram (from: Efficient estimation of word representations in vector space)

Now, to see what is meant by “model semantic relationships”, have a look at the following figure:

Fig. 3: Semantic-Syntactic Word Relationship test (from: Efficient estimation of word representations in vector space)

So basically what these models can do (with much higher-than-random accuracy) is answer questions like “Athens is to Greece what Oslo is to ???”. Or: “walking” is to “walked” what “swam” is to ???
Fascinating, right?
Let’s try this on our dataset.

Word embeddings for the IMDB dataset

In Python, word2vec is available through the gensim NLP library. There is a very nice tutorial how to use word2vec written by the gensim folks, so I’ll jump right in and present the results of using word2vec on the IMDB dataset.

I’ve trained a CBOW model, with a context size of 20, and a vector size of 100. Now let’s explore our model! Now that we have a notion of similarity for our words, let’s first ask: Which are the words most similar to awesome and awful, respectively? Start with awesome:

model.most_similar('awesome', topn=10)

[(u'amazing', 0.7929322123527527),
(u'incredible', 0.7127916812896729),
(u'awful', 0.7072071433067322),
(u'excellent', 0.6961393356323242),
(u'fantastic', 0.6925109624862671),
(u'alright', 0.6886886358261108),
(u'cool', 0.679090142250061),
(u'outstanding', 0.6213874816894531),
(u'astounding', 0.613292932510376),
(u'terrific', 0.6013768911361694)]

Interesting! This makes a lot of sense: So amazing is most similar, then we have words like excellent and outstanding.
But wait! What is awful doing there? I don’t really have a satisfying explanation. The most straightforward explanation would be that both occur in similar positions, and it is positions that the network learns. But then I’m still wondering why it is just awful appearing in this list, not any other positive word. Another important aspect might be emotional intensity. No doubt both words are very similar in intensity. (I’ve also checked co-occurrences, but those do not yield a simple explanation either.) Anyway, let’s keep an eye on it when further exploring the model. Of course we have to also keep in mind that for training a word2vec model, normally much bigger datasets are used.

These are the words most similar to awful:

model.most_similar('awful', topn=10)

[(u'terrible', 0.8212785124778748),
(u'horrible', 0.7955455183982849),
(u'atrocious', 0.7824822664260864),
(u'dreadful', 0.7722172737121582),
(u'appalling', 0.7244443893432617),
(u'horrendous', 0.7235419154167175),
(u'abysmal', 0.720653235912323),
(u'amazing', 0.708114743232727),
(u'awesome', 0.7072070837020874),
(u'bad', 0.6963905096054077)]

Again, this makes a lot of sense! And here we have the awesomeawful relationship again. On a close look, there’s also amazing appearing in the “awful list”.

For curiosity, I’ve tried to “subtract out” awful from the “awesome list”. This is what I ended up with:

model.most_similar(positive=['awesome'], negative=['awful'])

[(u'jolly', 0.3947059214115143),
(u'midget', 0.38988131284713745),
(u'knight', 0.3789686858654022),
(u'spooky', 0.36937469244003296),
(u'nice', 0.3680706322193146),
(u'looney', 0.3676275610923767),
(u'ho', 0.3594890832901001),
(u'gotham', 0.35877227783203125),
(u'lookalike', 0.3579031229019165),
(u'devilish', 0.35554438829421997)]

Funny … We seem to have entered some realm of fantasy story … But I’ll refrain from interpreting this result any further 😉

Now, I don’t want you to get the impression the model just did some crazy stuff and that’s all. Overall, the similar words I’ve checked make a lot of sense, and also the “what doesn’t match” is answered quite well, in the area we’re interested in (sentiment). Look at this:

model.doesnt_match("good bad awful terrible".split())
model.doesnt_match("awesome bad awful terrible".split())
model.doesnt_match("nice pleasant fine excellent".split())

In the last case, I wanted the model to sort out the outlier on the intensity dimension (excellent), and that is what it did!

Exploring word vectors is fun, but we need to get back to our classification task. Now that we have word vectors, how do we use them for sentiment analysis?
We need one vector per review, not per word. What is done most often is to average vectors, and when we input these averaged vectors to our classification algorithms (the same as in the last post), this is the result we get (listing the bag-of-words results again, too, for easy comparison):

Bag of words word2vec
Logistic Regression 0.89 0.83
Support Vector Machine 0.84 0.70
Random Forest 0.84 0.80

So cool as our word2vec model is, it actually performs worse on classification than bag-of-words. Thinking about it, this is not so surprising, given that averaging over word vectors, we lose a whole lot of information – information we spent quite some calculation effort to obtain before… What if we already had vectors for a complete review, and didn’t have to average? Enter .. doc2vec – document (or paragraph) vectors! But this is already the topic for the next post … stay tuned 🙂

Sentiment Analysis of Movie Reviews (talk)

Yesterday at Tech Event, which was great as always, I presented on sentiment analysis, taking as example movie reviews. I intend to write a little series of blog posts on this, but as I’m not sure when exactly I’ll get to this, here are the pdf version and a link to the notebook.

The focus was not on the classification algorithms per se (treating text as just another domain for classification), but on the difficulties emerging from this being language: Can it really work to look at just single words? Do I need bigrams? Trigrams? More? Can we tackle the complexity using word embeddings – word vectors? Or better, paragraph vectors?

I had a lot of fun exploring this topic, and I really hope to write some posts on this – stay tuned 🙂