This is just an ultra short post saying that last Tuesday, I had the honor of presenting my “Sentiment Analysis of Movie Reviews” talk at Swiss Data Forum – in French 😉 Thanks again guys for having me, and for your patience 🙂
So here’s a link to the French version of the talk – toute la magie de word2vec et doc2vec, en français:-) Enjoy!
This is the last – for now – installment of my mini-series on sentiment analysis of the Stanford collection of IMDB reviews.
So far, we’ve had a look at classical bag-of-words models and word vectors (word2vec).
We saw that from the classifiers used, logistic regression performed best, be it in combination with bag-of-words or word2vec.
We also saw that while the word2vec model did in fact model semantic dimensions, it was less successful for classification than bag-of-words, and we explained that by the averaging of word vectors we had to perform to obtain input features on review (not word) level.
So the question now is: How would distributed representations perform if we did not have to throw away information by averaging word vectors?
I’ve trained 3 models, with parameter settings as in the above-mentioned doc2vec tutorial: 2 distributed memory models (with word & paragraph vectors averaged or concatenated, respectively), and one distributed bag-of-words model. Here, without further ado, are the results. I’m just referring results for logistic regression as again, this was the best-performing classifier:
test vectors inferred
test vectors from model
Distributed memory, vectors averaged (dm/m)
Distributed memory, vectors concatenated (dm/c)
Distributed bag of words (dbow)
Hoorah! We’ve finally beaten bag-of-words … but only by a tiny little 0.1 percent, and we won’t even ask if that’s significant 😉
What should we conclude from that? In my opinion, there’s no reason to be sarcastic here (even if you might have thought I’d made it sound like that ;-)). With doc2vec, we’ve (at least) reached bag-of-words performance for classification, plus we now have semantic dimensions at our disposal. Speaking of which – let’s check what doc2vec thinks is similar to awesome/awful. Will the results be equivalent to those had with word2vec?
These are the words found most similar to awesome (note: the model we’re asking this question isn’t the one that performed best with Logistic Regression (PV-DBOW), as distributed bag-of-words doesn’t train word vectors, – this is instead obtained from the best-performing PV-DMM model):
To sum up – for now – we’ve explored how three models: bag-of-words, word2vec, and doc2vec – perform on sentiment analysis of IMDB movie reviews, in combination with different classifiers the most successful of which was logistic regression. Very similar (around 10% error rate) performance was reached by bag-of-words and doc2vec.
From this you may of course conclude that as of today, there’s no reason not to stick with the straightforward bag-of-words approach.But you could also view this differently. While word2vec appeared in 2013, it was succeeded by doc2vec already in 2014. Now it’s 2016, and things have happened in the meantime, are happening at present, right now. It’s a fascinating field, and even if – in sentiment analysis – we don’t see impressive output yet, impressive output is quite likely to appear sooner or later. I’m curious what we’re going to see!
Yesterday at Tech Event, which was great as always, I presented on sentiment analysis, taking as example movie reviews. I intend to write a little series of blog posts on this, but as I’m not sure when exactly I’ll get to this, here are the pdf version and a link to the notebook.
The focus was not on the classification algorithms per se (treating text as just another domain for classification), but on the difficulties emerging from this being language: Can it really work to look at just single words? Do I need bigrams? Trigrams? More? Can we tackle the complexity using word embeddings – word vectors? Or better, paragraph vectors?
I had a lot of fun exploring this topic, and I really hope to write some posts on this – stay tuned 🙂