Analyse de sentiments de critiques cinématographiques – version française

This is just an ultra short post saying that last Tuesday, I had the honor of presenting my “Sentiment Analysis of Movie Reviews” talk at Swiss Data Forum – in French 😉 Thanks again guys for having me, and for your patience 🙂

So here’s a link to the French version of the talk – toute la magie de word2vec et doc2vec, en français:-) Enjoy!

R for SQListas (1): Welcome to the Tidyverse

R for SQListas, what’s that about?

This is the 2-part blog version of a talk I’ve given at DOAG Conference this week. I’ve also uploaded the slides (no ppt; just pretty R presentation 😉 ) to the articles section, but if you’d like a little text I’m encouraging you to read on. That is, if you’re in the target group for this post/talk.
For this post, let me assume you’re a SQL girl (or guy). With SQL you’re comfortable (an expert, probably), you know how to get and manipulate your data, no nesting of subselects has you scared ;-). And now there’s this R language people are talking about, and it can do so many things they say, so you’d like to make use of it too – so now does this mean you have to start from scratch and learn – not only a new language, but a whole new paradigm? Turns out … ok. So that’s the context for this post.

Let’s talk about the weather

So in this post, I’d like to show you how nice R is to use if you come from SQL. But this isn’t going to be a syntax-only post. We’ll be looking at real datasets and trying to answer a real question.
Personally I’m very interested in how the weather’s going to develop in the future, especially in the nearer future, and especially regarding the area where I live (I know. It’s egocentric.). Specifically, what worries me are warm winters, and I’ll be clutching to any straw that tells me it’s not going to get warmer still 😉
So I’ve downloaded / prepared two datasets, both climate / weather-related. The first is the average global temperatures dataset from the Berkeley Earth Surface Temperature Study, nicely packaged by Kaggle (a website for data science competitions; This contains measurements from 1743 on, up till 2013. The monthly averages have been obtained using sophisticated scientific procedures available on the Berkeley Earth website (
The second is daily weather data for Munich, obtained from This dataset was retrieved manually, and the period was chosen so as to not contain too many missing values. The measurements range from 1997 to 2015, and have been aggregated by taking a monthly average.
Let’s start our journey through R land, reading in and looking at the beginning of the first dataset:

df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')

## 1 1743-11-01 6.068 1.737 Århus
## 2 1743-12-01 NA NA Århus
## 3 1744-01-01 NA NA Århus
## 4 1744-02-01 NA NA Århus
## 5 1744-03-01 NA NA Århus
## 6 1744-04-01 5.788 3.624 Århus
## # ... with 3 more variables: Country , Latitude ,
## # Longitude

Now we’d like to explore the dataset. With SQL, this is easy: We use WHERE to filter rows, SELECT to select columns, GROUP BY to aggregate by one or more variables…And of course, we often need to JOIN tables, and sometimes, perform set operations. Then there’s all kinds of analytic functions, such as LAG() and LEAD(). How do we do all this in R?

Entering the tidyverse

Luckily for the SQLista, writing elegant, functional, and often rather SQL-like code in R is easy. All we need to do is … enter the tidyverse. Actually, we’ve already entered it – doing library(tidyverse) – and used it to read in our csv file (read_csv)!
The tidyverse is a set of packages, developed by Hadley Wickham, Chief Scientist at Rstudio, designed to make working with R easier and more consistent (and more fun). We load data from files using readr, clean up datasets that are not in third normal form using tidyr, manipulate data with dplyr, and plot them with ggplot2.
For our task of data exploration, it is dplyr we need. Before we even begin, let’s rename the columns so they have shorter names:

df <- rename(df, avg_temp = AverageTemperature, avg_temp_95p = AverageTemperatureUncertainty, city = City, country = Country, lat = Latitude, long = Longitude)

## # A tibble: 6 × 7
## dt avg_temp avg_temp_95p city country lat long
## 1 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E
## 2 1743-12-01 NA NA Århus Denmark 57.05N 10.33E
## 3 1744-01-01 NA NA Århus Denmark 57.05N 10.33E
## 4 1744-02-01 NA NA Århus Denmark 57.05N 10.33E
## 5 1744-03-01 NA NA Århus Denmark 57.05N 10.33E
## 6 1744-04-01 5.788 3.624 Århus Denmark 57.05N 10.33E

distinct() (SELECT DISTINCT)

Good. Now that we have this new dataset containing temperature measurements, really the first thing we want to know is: What locations (countries, cities) do we have measurements for?
To find out, just do distinct():

distinct(df, country)

## # A tibble: 159 × 1
## country
## 1 Denmark
## 2 Turkey
## 3 Kazakhstan
## 4 China
## 5 Spain
## 6 Germany
## 7 Nigeria
## 8 Iran
## 9 Russia
## 10 Canada
## # ... with 149 more rows

distinct(df, city)

## # A tibble: 3,448 × 1
## city
## 1 Århus
## 2 Çorlu
## 3 Çorum
## 4 Öskemen
## 5 Ürümqi
## 6 A Coruña
## 7 Aachen
## 8 Aalborg
## 9 Aba
## 10 Abadan
## # ... with 3,438 more rows

filter() (WHERE)

OK. Now as I said I’m really first and foremost curious about measurements from Munich, so I’ll have to restrict the rows. In SQL I’d need a WHERE clause, in R the equivalent is filter():

filter(df, city == 'Munich')
## # A tibble: 3,239 × 7
## dt avg_temp avg_temp_95p city country lat long
## 1 1743-11-01 1.323 1.783 Munich Germany 47.42N 10.66E
## 2 1743-12-01 NA NA Munich Germany 47.42N 10.66E
## 3 1744-01-01 NA NA Munich Germany 47.42N 10.66E
## 4 1744-02-01 NA NA Munich Germany 47.42N 10.66E
## 5 1744-03-01 NA NA Munich Germany 47.42N 10.66E
## 6 1744-04-01 5.498 2.267 Munich Germany 47.42N 10.66E
## 7 1744-05-01 7.918 1.603 Munich Germany 47.42N 10.66E

This is how we combine conditions if we have more than one of them in a where clause:

filter(df, city == 'Munich', year(dt) > 2000)
## # A tibble: 153 × 7
## dt avg_temp avg_temp_95p city country lat long
## 1 2001-01-01 -3.162 0.396 Munich Germany 47.42N 10.66E
## 2 2001-02-01 -1.221 0.755 Munich Germany 47.42N 10.66E
## 3 2001-03-01 3.165 0.512 Munich Germany 47.42N 10.66E
## 4 2001-04-01 3.132 0.329 Munich Germany 47.42N 10.66E
## 5 2001-05-01 11.961 0.150 Munich Germany 47.42N 10.66E
## 6 2001-06-01 11.468 0.377 Munich Germany 47.42N 10.66E
## 7 2001-07-01 15.037 0.316 Munich Germany 47.42N 10.66E
## 8 2001-08-01 15.761 0.325 Munich Germany 47.42N 10.66E
## 9 2001-09-01 7.897 0.420 Munich Germany 47.42N 10.66E
## 10 2001-10-01 9.361 0.252 Munich Germany 47.42N 10.66E
## # ... with 143 more rows

# OR
filter(df, city == 'Munich' | year(dt) > 2000)

## # A tibble: 540,116 × 7
## dt avg_temp avg_temp_95p city country lat long
## 1 2001-01-01 1.918 0.381 Århus Denmark 57.05N 10.33E
## 2 2001-02-01 0.241 0.328 Århus Denmark 57.05N 10.33E
## 3 2001-03-01 1.310 0.236 Århus Denmark 57.05N 10.33E
## 4 2001-04-01 5.890 0.158 Århus Denmark 57.05N 10.33E
## 5 2001-05-01 12.016 0.351 Århus Denmark 57.05N 10.33E
## 6 2001-06-01 13.944 0.352 Århus Denmark 57.05N 10.33E
## 7 2001-07-01 18.453 0.367 Århus Denmark 57.05N 10.33E
## 8 2001-08-01 17.396 0.287 Århus Denmark 57.05N 10.33E
## 9 2001-09-01 13.206 0.207 Århus Denmark 57.05N 10.33E
## 10 2001-10-01 11.732 0.200 Århus Denmark 57.05N 10.33E
## # ... with 540,106 more rows

select() (SELECT)

Now, often we don’t want to see all the columns/variables. In SQL we SELECT what we’re interested in, and it’s select() in R, too:
select(filter(df, city == 'Munich'), avg_temp, avg_temp_95p)

## # A tibble: 3,239 × 2
## avg_temp avg_temp_95p
## 1 1.323 1.783
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 5.498 2.267
## 7 7.918 1.603
## 8 11.070 1.584
## 9 12.935 1.653
## 10 NA NA
## # ... with 3,229 more rows

arrange() (ORDER BY)

How about ordered output? This can be done using arrange():

arrange(select(filter(df, city == 'Munich'), dt, avg_temp), avg_temp)

## # A tibble: 3,239 × 2
## dt avg_temp
## 1 1956-02-01 -12.008
## 2 1830-01-01 -11.510
## 3 1767-01-01 -11.384
## 4 1929-02-01 -11.168
## 5 1795-01-01 -11.019
## 6 1942-01-01 -10.785
## 7 1940-01-01 -10.643
## 8 1895-02-01 -10.551
## 9 1755-01-01 -10.458
## 10 1893-01-01 -10.381
## # ... with 3,229 more rows

Do you think this is starting to get difficult to read? What if we add FILTER and GROUP BY operations to this query? Fortunately, with dplyr it is possible to avoid paren hell as well as stepwise assignment using the pipe operator, %>%.

Meet: %>% – the pipe

The pipe transforms an expression of form x %>% f(y) into f(x, y) and so, allows us write the above operation like this:

df %>% filter(city == 'Munich') %>% select(dt, avg_temp) %>% arrange(avg_temp)

This looks a lot like the fluent API design popular in some object oriented languages, or the bind operator, >>=, in Haskell.
It also looks a lot more like SQL. However, keep in mind that while SQL is declarative, the order of operations matters when you use the pipe (as the name says, the output of one operation is piped to another). You cannot, for example, write this (trying to emulate SQL‘s SELECT – WHERE – ORDER BY ): df %>% select(dt, avg_temp) %>% filter(city == ‘Munich’) %>% arrange(avg_temp). This can’t work because after a new dataframe has been returned from the select, the column city is not longer available.

arrange() (GROUP BY)

Now that we’ve introduced the pipe, on to group by. This is achieved in dplyr using group_by() (for grouping, obviously) and summarise() for aggregation.
Let’s find the countries we have most – and least, respectively – records for:

# most records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count %>% desc())

## # A tibble: 159 × 2
## country count
## 1 India 1014906
## 2 China 827802
## 3 United States 687289
## 4 Brazil 475580
## 5 Russia 461234
## 6 Japan 358669
## 7 Indonesia 323255
## 8 Germany 262359
## 9 United Kingdom 220252
## 10 Mexico 209560
## # ... with 149 more rows

# least records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count)

## # A tibble: 159 × 2
## country count
## 1 Papua New Guinea 1581
## 2 Oman 1653
## 3 Djibouti 1797
## 4 Eritrea 1797
## 5 Botswana 1881
## 6 Lesotho 1881
## 7 Namibia 1881
## 8 Swaziland 1881
## 9 Central African Republic 1893
## 10 Congo 1893

How about finding the average, minimum and maximum temperatures per month, looking at just records from Germany, and that originate after 1949?

df %>% filter(country == 'Germany', !, year(dt) > 1949) %>% group_by(month(dt)) %>% summarise(count = n(), avg = mean(avg_temp), min = min(avg_temp), max = max(avg_temp))

## # A tibble: 12 × 5
## `month(dt)` count avg min max
## 1 1 5184 0.3329331 -10.256 6.070
## 2 2 5184 1.1155843 -12.008 7.233
## 3 3 5184 4.5513194 -3.846 8.718
## 4 4 5184 8.2728137 1.122 13.754
## 5 5 5184 12.9169965 5.601 16.602
## 6 6 5184 15.9862500 9.824 21.631
## 7 7 5184 17.8328285 11.697 23.795
## 8 8 5184 17.4978752 11.390 23.111
## 9 9 5103 14.0571383 7.233 18.444
## 10 10 5103 9.4110645 0.759 13.857
## 11 11 5103 4.6673114 -2.601 9.127
## 12 12 5103 1.3649677 -8.483 6.217

In this way, aggregation queries can be written that are powerful and very readable at the same time. So at this point, we know how to do basic selects with filtering and grouping. How about joins?


Dplyr provides inner_join(), left_join(), right_join() and full_join() operations, as well as semi_join() and anti_join(). From the SQL viewpoint, these work exactly as expected.
To demonstrate a join, we’ll now load the second dataset, containing daily weather data for Munich, and aggregate it by month:

daily_1997_2015 % summarise(mean_temp = mean(mean_temp))

## # A tibble: 228 × 2
## month mean_temp
## 1 1997-01-01 -3.580645
## 2 1997-02-01 3.392857
## 3 1997-03-01 6.064516
## 4 1997-04-01 6.033333
## 5 1997-05-01 13.064516
## 6 1997-06-01 15.766667
## 7 1997-07-01 16.935484
## 8 1997-08-01 18.290323
## 9 1997-09-01 13.533333
## 10 1997-10-01 7.516129
## # ... with 218 more rows

Fine. Now let’s join the two datasets on the date column (their respective keys), telling R that this column is named dt in one dataframe, month in the other:

df % select(dt, avg_temp) %>% filter(year(dt) > 1949)
df %>% inner_join(monthly_1997_2015, by = c("dt" = "month"), suffix )

## # A tibble: 705,510 × 3
## dt avg_temp mean_temp
## 1 1997-01-01 -0.742 -3.580645
## 2 1997-02-01 2.771 3.392857
## 3 1997-03-01 4.089 6.064516
## 4 1997-04-01 5.984 6.033333
## 5 1997-05-01 10.408 13.064516
## 6 1997-06-01 16.208 15.766667
## 7 1997-07-01 18.919 16.935484
## 8 1997-08-01 20.883 18.290323 of perceptrons
## 9 1997-09-01 13.920 13.533333
## 10 1997-10-01 7.711 7.516129
## # ... with 705,500 more rows

As we see, average temperatures obtained for the same month differ a lot from each other. Evidently, the methods of averaging used (by us and by Berkeley Earth) were very different. We will have to use every dataset separately for exploration and inference.

Set operations

Having looked at joins, on to set operations. The set operations known from SQL can be performed using dplyr’s intersect(), union(), and setdiff() methods. For example, let’s combine the Munich weather data from before 2016 and from 2016 in one data frame:

daily_2016 % arrange(day)

## # A tibble: 7,195 × 23
## day max_temp mean_temp min_temp dew mean_dew min_dew max_hum
## 1 1997-01-01 -8 -12 -16 -13 -14 -17 92
## 2 1997-01-02 0 -8 -16 -9 -13 -18 92
## 3 1997-01-03 -4 -6 -7 -6 -8 -9 93
## 4 1997-01-04 -3 -4 -5 -5 -6 -6 93
## 5 1997-01-05 -1 -3 -6 -4 -5 -7 100
## 6 1997-01-06 -2 -3 -4 -4 -5 -6 93
## 7 1997-01-07 0 -4 -9 -6 -9 -10 93
## 8 1997-01-08 0 -3 -7 -7 -7 -8 100
## 9 1997-01-09 0 -3 -6 -5 -6 -7 100
## 10 1997-01-10 -3 -4 -5 -4 -5 -6 100
## # ... with 7,185 more rows, and 15 more variables: mean_hum ,
## # min_hum , max_hpa , mean_hpa , min_hpa ,
## # max_visib , mean_visib , min_visib , max_wind ,
## # mean_wind , max_gust , prep , cloud ,
## # events , winddir

Window (AKA analytic) functions

Joins, set operations, that’s pretty cool to have but that’s not all. Additionally, a large number of analytic functions are available in dplyr. We have the familiar-from-SQL ranking functions (e.g., dense_rank(), row_number(), ntile(), and cume_dist()):

# 5% hottest days
filter(daily_2016, cume_dist(desc(mean_temp)) % select(day, mean_temp)

## # A tibble: 5 × 2
## day mean_temp
## 1 2016-06-24 22
## 2 2016-06-25 22
## 3 2016-07-09 22
## 4 2016-07-11 24
## 5 2016-07-30 22

# 3 coldest days
filter(daily_2016, dense_rank(mean_temp) % select(day, mean_temp) %>% arrange(mean_temp)

## # A tibble: 4 × 2
## day mean_temp
## 1 2016-01-22 -10
## 2 2016-01-19 -8
## 3 2016-01-18 -7
## 4 2016-01-20 -7

We have lead() and lag():

# consecutive days where mean temperature changed by more than 5 degrees:
daily_2016 %>% mutate(yesterday_temp = lag(mean_temp)) %>% filter(abs(yesterday_temp - mean_temp) > 5) %>% select(day, mean_temp, yesterday_temp)

## # A tibble: 6 × 3
## day mean_temp yesterday_temp
## 1 2016-02-01 10 4
## 2 2016-02-21 11 3
## 3 2016-06-26 16 22
## 4 2016-07-12 18 24
## 5 2016-08-05 14 21
## 6 2016-08-13 19 13

We also have lots of aggregation functions that, if already provided in base R, come with enhancements in dplyr. Such as, choosing the column that dictates accumulation order. New in dplyr is e.g., cummean(), the cumulative mean:

daily_2016 %>% mutate(cum_mean_temp = cummean(mean_temp)) %>% select(day, mean_temp, cum_mean_temp)

## # A tibble: 260 × 3
## day mean_temp cum_mean_temp
## 1 2016-01-01 2 2.0000000
## 2 2016-01-02 -1 0.5000000
## 3 2016-01-03 -2 -0.3333333
## 4 2016-01-04 0 -0.2500000
## 5 2016-01-05 2 0.2000000
## 6 2016-01-06 2 0.5000000
## 7 2016-01-07 3 0.8571429
## 8 2016-01-08 4 1.2500000
## 9 2016-01-09 4 1.5555556
## 10 2016-01-10 3 1.7000000
## # ... with 250 more rows

OK. Wrapping up so far, dplyr should make it easy to do data manipulation if you’re used to SQL. So why not just use SQL, what can we do in R that we couldn’t do before?


Well, one thing R excels at is visualization. First and foremost, there is ggplot2, Hadley Wickham‘s famous plotting package, the realization of a “grammar of graphics”. ggplot2 predates the tidyverse, but became part of it once it came to life. We can use ggplot2 to plot the average monthly temperatures from Berkeley Earth for selected cities and time ranges, like this:

cities = c("Munich", "Bern", "Oslo")
df_cities % filter(city %in% cities, year(dt) > 1949, !
(p_1950 <- ggplot(df_cities, aes(dt, avg_temp, color = city)) + geom_point() + xlab("") + ylab("avg monthly temp") + theme_solarized())

Average monthly temperatures for three selected cities

While this plot is two-dimensional (with axes time and temperature), a third “dimension” is added via the color aesthetic (aes (…, color = city)).

We can easily reuse the same plot, zooming in on a shorter time frame:

start_time <- as.Date("1992-01-01")
end_time <- as.Date("2013-08-01")
limits <- c(start_time,end_time)
(p_1992 <- p_1950 + (scale_x_date(limits=limits)))

The same plot, but zooming in on a shorter time period

It seems like overall, Bern is warmest, Oslo is coldest, and Munich is in the middle somewhere.
We can add smoothing lines to see this more clearly (by default, confidence intervals would also be displayed, but I’m suppressing them here so as to show the three lines more clearly):

(p_1992 <- p_1992 + geom_smooth(se = FALSE))

Adding smoothing lines to more clearly see the trend

Good. Now that we have these lines, can we rely on them to obtain a trend for the temperature? Because that is, ultimately, what we want to find out about.
From here on, we’re zooming in on Munich. Let’s display that trend line for Munich again, this time with the 95% confidence interval added:

p_munich_1992 <- p_munich_1950 + (scale_x_date(limits=limits))
p_munich_1992 + stat_smooth()

Smoothing line with confidence interval added

Calling stat_smooth() without specifying a smoothing method uses Local Polynomial Regression Fitting (LOESS). However, we could as well use another smoothing method, for example, we could fit a line using lm(). Let’s compare them both:

loess <- p_munich_1992 + stat_smooth(method = "loess", colour = "red") + labs(title = 'loess')
lm <- p_munich_1992 + stat_smooth(method = "lm", color = "green") + labs(title = 'lm')
grid.arrange(loess, lm, ncol=2) (p_1992 <- p_1950 + (scale_x_date(limits=limits)))

Comparing two different smoothing methods (LOESS vs. LM)

Both fits behave quite differently, especially as regards the shape of the confidence interval near the end (and beginning) of the time range. If we want to form an opinion regarding a possible trend, we will have to do more than just look at the graphs – time to do some time series analysis!
Given this post has become quite long already, we’ll continue in the next – so how about next winter? Stay tuned 🙂

Sentiment Analysis of Movie Reviews (3): doc2vec

This is the last – for now – installment of my mini-series on sentiment analysis of the Stanford collection of IMDB reviews.
So far, we’ve had a look at classical bag-of-words models and word vectors (word2vec).
We saw that from the classifiers used, logistic regression performed best, be it in combination with bag-of-words or word2vec.
We also saw that while the word2vec model did in fact model semantic dimensions, it was less successful for classification than bag-of-words, and we explained that by the averaging of word vectors we had to perform to obtain input features on review (not word) level.
So the question now is: How would distributed representations perform if we did not have to throw away information by averaging word vectors?

Document vectors: doc2vec

Shortly after word2vec, Le and Mikolov developed paragraph (document) vector models.
The basic models are

  • Distributed Memory Model of Paragraph Vectors (PV-DM) and
  • Distributed Bag of Words (PV-DBOW)

In PV-DM, in addition to the word vectors, there is a paragraph vector that keeps track of the whole document:

Fig.1: Distributed Memory Model of Paragraph Vectors (PV-DM) (from: Distributed Representations of Sentences and Documents)

With distributed bag-of-words (PV-DBOW), there even aren’t any word vectors, there’s just a paragraph vector trained to predict the context:

Fig.2: Distributed Bag of Words (PV-DBOW) (from: Distributed Representations of Sentences and Documents)

Like word2vec, doc2vec in Python is provided by the gensim library. Please see the gensim doc2vec tutorial for example usage and configuration.

doc2vec: performance on sentiment analysis task

I’ve trained 3 models, with parameter settings as in the above-mentioned doc2vec tutorial: 2 distributed memory models (with word & paragraph vectors averaged or concatenated, respectively), and one distributed bag-of-words model. Here, without further ado, are the results. I’m just referring results for logistic regression as again, this was the best-performing classifier:

test vectors inferred test vectors from model
Distributed memory, vectors averaged (dm/m) 0.81 0.87
Distributed memory, vectors concatenated (dm/c) 0.80 0.82
Distributed bag of words (dbow) 0.90 0.90

Hoorah! We’ve finally beaten bag-of-words … but only by a tiny little 0.1 percent, and we won’t even ask if that’s significant 😉
What should we conclude from that? In my opinion, there’s no reason to be sarcastic here (even if you might have thought I’d made it sound like that ;-)). With doc2vec, we’ve (at least) reached bag-of-words performance for classification, plus we now have semantic dimensions at our disposal. Speaking of which – let’s check what doc2vec thinks is similar to awesome/awful. Will the results be equivalent to those had with word2vec?
These are the words found most similar to awesome (note: the model we’re asking this question isn’t the one that performed best with Logistic Regression (PV-DBOW), as distributed bag-of-words doesn’t train word vectors, – this is instead obtained from the best-performing PV-DMM model):

model.most_similar('awesome', topn=10)

[(u'incredible', 0.9011116027832031),
(u'excellent', 0.8860622644424438),
(u'outstanding', 0.8797732591629028),
(u'exceptional', 0.8539372682571411),
(u'awful', 0.8104138970375061),
(u'astounding', 0.7750493884086609),
(u'alright', 0.7587056159973145),
(u'astonishing', 0.7556235790252686),
(u'extraordinary', 0.743841290473938)]

So, what we see is very similar to the output of word2vec – including the inclusion of awful. Same for what’s judged similar to awful:

model.most_similar('awful', topn=10)

[(u'abysmal', 0.8371909856796265),
(u'appalling', 0.8327066898345947),
(u'atrocious', 0.8309577703475952),
(u'horrible', 0.8192445039749146),
(u'terrible', 0.8124841451644897),
(u'awesome', 0.8104138970375061),
(u'dreadful', 0.8072893023490906),
(u'horrendous', 0.7981990575790405),
(u'amazing', 0.7926105260848999), 
(u'incredible', 0.7852109670639038)]

To sum up – for now – we’ve explored how three models: bag-of-words, word2vec, and doc2vec – perform on sentiment analysis of IMDB movie reviews, in combination with different classifiers the most successful of which was logistic regression. Very similar (around 10% error rate) performance was reached by bag-of-words and doc2vec.
From this you may of course conclude that as of today, there’s no reason not to stick with the straightforward bag-of-words approach.But you could also view this differently. While word2vec appeared in 2013, it was succeeded by doc2vec already in 2014. Now it’s 2016, and things have happened in the meantime, are happening at present, right now. It’s a fascinating field, and even if – in sentiment analysis – we don’t see impressive output yet, impressive output is quite likely to appear sooner or later. I’m curious what we’re going to see!

Sentiment Analysis of Movie Reviews (2): word2vec

This is the continuation of my mini-series on sentiment analysis of movie reviews. Last time, we had a look at how well classical bag-of-words models worked for classification of the Stanford collection of IMDB reviews. As it turned out, the “winner” was Logistic Regression, using both unigrams and bigrams for classification. The best classification accuracy obtained was .89 – not bad at all for sentiment analysis (but see the caveat regarding what’s good or bad in the first post).

Bag-of-words: limitations

So, bag-of-words models may be surprisingly successful, but they are limited in what they can do. First and foremost, with bag-of-words models, words are encoded using one-hot-encoding. One-hot-encoding means that each word is represented by a vector, of length the size of the vocabulary, where exactly one bit is “on” (1). Say we wanted to encode that famous quote by Hadley Wickham

Tidy datasets are all alike but every messy dataset is messy in its own way.

It would look like this:

alike all but dataset every in is its messy own tidy way
0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 0 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 0 0 0 0 0
6 0 0 0 0 1 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 0 1 0 0 0 0 0 0
9 0 0 0 0 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 0 0 1

Now the thing is – with this representation, all words are basically equidistant from each other!
If we want to find similarities between words, we have to look at a corpus of texts, build a co-occurrence matrix and perform dimensionality reduction (using, e.g., singular value decomposition).
Let’s take a second example sentence, like this one Lev Tolstoj wrote after having read Hadley 😉

Happy families are all alike; every unhappy family is unhappy in its own way.

This is what the beginning of a co-occurrence matrix would look like for the two sentences:

tidy dataset is all alike but every messy in its own way happy family unhappy
tidy 0 2 2 1 1 1 1 2 1 1 1 1 0 0 0
dataset 2 0 2 1 1 1 1 2 1 1 1 1 0 0 0
is 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1

As you can imagine, in reality such a co-occurrence matrix will quickly become very, very big. And after dimensionality reduction, we will have lost a lot of information about individual words.
So the question is, is there anything else we can do?

As it turns out, there is. Words do not have to be represented as one-hot vectors. Using so-called distributed representations, a word can be represented as a vector of (say 100, 200, … whatever works best) real numbers.
And as we will see, with this representation, it is possible to model semantic relationships between words!

Before I proceed, an aside: In this post, my focus is on walking through a real example, and showing real data exploration. I’ll just pick two of the most famous recent models and work with them, but this doesn’t mean that there aren’t others (there are!), and also there’s a whole history of using distributed representations for NLP. For an overview, I’d recommend Sebastian Ruder’s series on the topic, starting with the first part on the history of word embeddings.


The first model I’ll use is the famous word2vec developed by Mikolov et al. 2013 (see Efficient estimation of word representations in vector space).
word2vec arrives at word vectors by training a neural network to predict

  • a word in the center from its surroundings (continuous bag of words, CBOW), or
  • a word’s surroundings from the center word(!) (skip-gram model)

This is easiest to understand looking at figures from the Mikolov et al. article cited above:

Fig.1: Continuous Bag of Words (from: Efficient estimation of word representations in vector space)


Fig.2: Skip-Gram (from: Efficient estimation of word representations in vector space)

Now, to see what is meant by “model semantic relationships”, have a look at the following figure:

Fig. 3: Semantic-Syntactic Word Relationship test (from: Efficient estimation of word representations in vector space)

So basically what these models can do (with much higher-than-random accuracy) is answer questions like “Athens is to Greece what Oslo is to ???”. Or: “walking” is to “walked” what “swam” is to ???
Fascinating, right?
Let’s try this on our dataset.

Word embeddings for the IMDB dataset

In Python, word2vec is available through the gensim NLP library. There is a very nice tutorial how to use word2vec written by the gensim folks, so I’ll jump right in and present the results of using word2vec on the IMDB dataset.

I’ve trained a CBOW model, with a context size of 20, and a vector size of 100. Now let’s explore our model! Now that we have a notion of similarity for our words, let’s first ask: Which are the words most similar to awesome and awful, respectively? Start with awesome:

model.most_similar('awesome', topn=10)

[(u'amazing', 0.7929322123527527),
(u'incredible', 0.7127916812896729),
(u'awful', 0.7072071433067322),
(u'excellent', 0.6961393356323242),
(u'fantastic', 0.6925109624862671),
(u'alright', 0.6886886358261108),
(u'cool', 0.679090142250061),
(u'outstanding', 0.6213874816894531),
(u'astounding', 0.613292932510376),
(u'terrific', 0.6013768911361694)]

Interesting! This makes a lot of sense: So amazing is most similar, then we have words like excellent and outstanding.
But wait! What is awful doing there? I don’t really have a satisfying explanation. The most straightforward explanation would be that both occur in similar positions, and it is positions that the network learns. But then I’m still wondering why it is just awful appearing in this list, not any other positive word. Another important aspect might be emotional intensity. No doubt both words are very similar in intensity. (I’ve also checked co-occurrences, but those do not yield a simple explanation either.) Anyway, let’s keep an eye on it when further exploring the model. Of course we have to also keep in mind that for training a word2vec model, normally much bigger datasets are used.

These are the words most similar to awful:

model.most_similar('awful', topn=10)

[(u'terrible', 0.8212785124778748),
(u'horrible', 0.7955455183982849),
(u'atrocious', 0.7824822664260864),
(u'dreadful', 0.7722172737121582),
(u'appalling', 0.7244443893432617),
(u'horrendous', 0.7235419154167175),
(u'abysmal', 0.720653235912323),
(u'amazing', 0.708114743232727),
(u'awesome', 0.7072070837020874),
(u'bad', 0.6963905096054077)]

Again, this makes a lot of sense! And here we have the awesomeawful relationship again. On a close look, there’s also amazing appearing in the “awful list”.

For curiosity, I’ve tried to “subtract out” awful from the “awesome list”. This is what I ended up with:

model.most_similar(positive=['awesome'], negative=['awful'])

[(u'jolly', 0.3947059214115143),
(u'midget', 0.38988131284713745),
(u'knight', 0.3789686858654022),
(u'spooky', 0.36937469244003296),
(u'nice', 0.3680706322193146),
(u'looney', 0.3676275610923767),
(u'ho', 0.3594890832901001),
(u'gotham', 0.35877227783203125),
(u'lookalike', 0.3579031229019165),
(u'devilish', 0.35554438829421997)]

Funny … We seem to have entered some realm of fantasy story … But I’ll refrain from interpreting this result any further 😉

Now, I don’t want you to get the impression the model just did some crazy stuff and that’s all. Overall, the similar words I’ve checked make a lot of sense, and also the “what doesn’t match” is answered quite well, in the area we’re interested in (sentiment). Look at this:

model.doesnt_match("good bad awful terrible".split())
model.doesnt_match("awesome bad awful terrible".split())
model.doesnt_match("nice pleasant fine excellent".split())

In the last case, I wanted the model to sort out the outlier on the intensity dimension (excellent), and that is what it did!

Exploring word vectors is fun, but we need to get back to our classification task. Now that we have word vectors, how do we use them for sentiment analysis?
We need one vector per review, not per word. What is done most often is to average vectors, and when we input these averaged vectors to our classification algorithms (the same as in the last post), this is the result we get (listing the bag-of-words results again, too, for easy comparison):

Bag of words word2vec
Logistic Regression 0.89 0.83
Support Vector Machine 0.84 0.70
Random Forest 0.84 0.80

So cool as our word2vec model is, it actually performs worse on classification than bag-of-words. Thinking about it, this is not so surprising, given that averaging over word vectors, we lose a whole lot of information – information we spent quite some calculation effort to obtain before… What if we already had vectors for a complete review, and didn’t have to average? Enter .. doc2vec – document (or paragraph) vectors! But this is already the topic for the next post … stay tuned 🙂

Sentiment Analysis of Movie Reviews (1): Word Count Models

Imagine I show you a book review, on, say. Imagine I hide the number of stars, – all you get to see is the number of stars. And now I’m asking you, that review, is it good or bad? Just two categories, good or bad. That’s easy, right?
Well, it should be easy, for humans (although depending on the input there can be lots of disagreement between humans, too.) But if you want to do it automatically, it turns out to be surprisingly difficult.

This is the start of a short series on sentiment analysis, based on my TechEvent presentation. My focus will be more on data exploration than on achieving the best possible accuracy; more on getting a feeling for the difficulties than on jungling with parameters. More on the Natural Language Processing (NLP) aspect than on generic classification. And even though the series will be short (for now – given time constraints ;-)), it’s definitely a topic to be pursued (looking at the rapidity of developments in the field).

Let’s jump right in. So why would one do sentiment analysis? Because out there, not every relevant text comes labeled as “good” or “bad”. Take emails, blog posts, support tickets. Is that guy actually angry at us (our service desk/team/company)? Is she disappointed by our product? We’d like to know.

So while sentiment analysis is undoubtedly useful, the quality of the results will rely on having a big enough training set – and someone will have to sit down and categorize all those tweets/reviews/whatever. (Datasets are available where labeling of the training set was done automatically, see e.g. the Stanford Sentiment140 dataset, but this approach must induce biases, to say the least.) In our case, we care more about how things work than about the actual accuracies; still, keep in mind, when looking at the accuracy numbers, that especially for the models discussed in later posts, a bigger dataset might achieve much better performance.

The data

Our dataset consists of 25.000 labeled training reviews, plus 25.000 test reviews (also labeled), available from Of the 25.000 training / test reviews, 12.500 each have been rated positive, and 12.500 negative by human annotators.
The dataset has originally been used in Maas et al. (2011), Learning Word Vectors for Sentiment Analysis.
Preprocessing was done after the example of the gensim doc2vec notebook (we will describe doc2vec in a later post).

Good or bad?

Let’s load the preprocessed data and have a look at the very first training review. (For better readability, I’ll try not to clutter this text with too much code, so there’ll only be code when there’s a special point in showing it. For more code, see the notebook for the original talk.)

a reasonable effort is summary for this film . a good sixties film but lacking any sense of achievement . maggie smith gave a decent performance which was believable enough but not as good as she could have given , other actors were just dreadful ! a terrible portrayal . it wasn't very funny and so it didn't really achieve its genres as it wasn't particularly funny and it wasn't dramatic . the only genre achieved to a satisfactory level was romance . target audiences were not hit and the movie sent out confusing messages . a very basic plot and a very basic storyline were not pulled off or performed at all well and people were left confused as to why the film wasn't as good and who the target audiences were etc . however maggie was quite good and the storyline was alright with moments of capability . 4 . \n

Looking at this text, we already see complexity emerging. As a human reader, I’m sure you’ll say this is a negative review, and undoubtedly there are some clearly negative words (“dreadful”, “confusing”, “terrible”). But to a high degree, negativity comes from negated positive words: “lacking achievement”, “wasn’t very funny”, “not as good as she could have given”. So clearly we cannot just look at single words in isolation, but at sequences of words – n-grams (bigrams, trigrams, …) as they say in natural language processing.


The question is though, at how many consecutive words should we look? Let’s step through an example. “Funny” (unigram) is positive, “very funny” (bigram) even more so. “Not very funny” (trigram) is negative. If it were “not so very funny” we’d need 4-grams … How about “I didn’t think it was so very funny”? And this could probably go on like that… So how many adjacent words do we need to consider? There evidently is no clear border… how can we decide? Fortunately, we won’t have to decide upfront. We’ll do that as part of our search for the optimal classification algorithm.

So in general, how can automatic sentiment analysis work? The simplest approach is via word counts. Basically, we count the positive words, giving them different weights according to how positive they are. Same with the negative words. And then, the side with the highest score “wins”.

But – no-ones gonna sit there and categorize all those words! The algorithm has to figure that out itself. How can it do that? Via the labeled training samples. For them, we have the sentiment as well the information how often each word occurred, e.g., like this:

sentiment beautiful bad awful decent horrible ok awesome
review 1 0 0 1 2 1 1 0 0
review 2 1 1 0 0 0 0 0 1
review 3 0 0 0 0 1 1 0 0

From this, the algorithm can determine the words’ polarities and weights, and arrive at something like:

word beautiful bad awful decent horrible ok awesome
weight 3.4 -2.9 -5.6 -0.2 -4.9 -0.1 5.2

Now, what can do is run a grid search over combinations of

  • classification algorithms,
  • parameters for those algorithms (algorithm-dependent),
  • different ngrams,

and record the combinations that work best on the test set.
Algorithms included in the search were logistic regression (with different settings for regularization), support vector machines, and random forests (with different settings for the maximum tree depth). In that way, both linear and non-linear procedures were present. All aforementioned combinations of algorithms and parameters were tried with unigrams, unigrams + bigrams, and unigrams + bigrams + trigrams as features.

And so after long computations, the winner is … wait! I didn’t yet say anything about stopword removal. Without filtering, the most frequent words in this dataset are stopwords to a high degree, so we will definitely want to remove noise. The Python nltk library provides a stopword list, but this contains words like ‘not’, ‘nor’, ‘no’, ‘wasn’, ‘ain’, etc., words that we definitely do NOT want to remove when doing sentiment analysis. So I’ve used a subset of the nltk list where I’ve removed all negations / negated forms.

Here, then, are the accuracies obtained on the test set. For each classifier, I’m displaying the one with the most successful parameter settings (without detailing them here, in order not to distract from the main topic of the post) and the most successful n-gram configuration.

with stopword filtering
with stopword filtering
no stopword filtering
Logistic Regression 0.89
Support Vector Machine 0.84
Random Forest 0.84

Overall, these accuracies look astonishingly good, given that in general, for sentiment analysis, something around 80% is seen as a to-be-expected value for accuracy. However, I find it difficult to talk about a to-be-expected value here: The accuracy achieved will very much depend on the dataset in question! So we really would need to know, for the exact dataset used, what accuracies have been achieved by other algorithms, and most importantly: what is the agreement between human annotators here? If humans agree a 100% on whether items of a dataset are positive or negative, then 80% accuracy for a classifier sounds rather bad! But if agreement between humans is 85% only, the picture is totally different. (And then there’s a totally different angle, extremely important but not the focus of this post: Say we achieve 90% accuracy where others achieve 80% and humans agree to 90%. Technically we’re doing great! But we’re still misclassifying one in ten texts! Depending on why we’re doing this at all, what automated action we’re planning to take based on the results, getting one in ten wrong might turn out to be catastrophical!)

Having said that, I find the results interesting for two reasons: For one, logistic regression, a linear classifier, does best here. This just confirms something that is often seen in machine learning,- logistic regression being a simple but very powerful algorithm. Secondly, the logistic regression best result was reached when including bigrams as features, whereas trigrams did not bring on any further improvements. A great thing with logistic regression is that you can peep into the classifier’s brain and see what features it decided are important, by looking at the coefficients. Let’s inspect what words make a review positive. The most positive features, in order:

coef word
2969 0.672635 excellent
6681 0.563958 perfect
9816 0.521026 wonderful
8646 0.520818 superb
3165 0.505146 favorite
431 0.502118 amazing
5923 0.481505 must see
5214 0.461807 loved
3632 0.458645 funniest
2798 0.453481 enjoyable

Pretty much makes sense, doesn’t it? And we do see a bigram among these: “must see”. How about other bigrams contributing to the plus side?

coef word
5923 0.481505 must see
3 0.450675 10 10
6350 0.421314 one best
9701 0.389081 well worth
5452 0.371277 may not
6139 0.329485 not bad
6970 0.323805 pretty good
2259 0.307238 definitely worth
5208 0.303380 love movie
9432 0.301404 very good

These mostly make a lot of sense, too. How about words / ngrams that make it negative? First, the “overall ranking” – last one is worst:

coef word
6864 -0.564446 poor
2625 -0.565503 dull
9855 -0.575060 worse
4267 -0.588133 horrible
2439 -0.596302 disappointing
6866 -0.675187 poorly
1045 -0.681608 boring
2440 -0.688024 disappointment
702 -0.811184 awful
9607 -0.838195 waste

So we see worst of all is when it’s a waste of time. Could agree to that!
Now, this time, there are no bigrams among the 10 worst ranked features. Let’s look at them in isolation:

coef word
6431 -0.247169 only good
3151 -0.250090 fast forward
9861 -0.264564 worst movie
6201 -0.324169 not recommend
6153 -0.332796 not even
6164 -0.333147 not funny
6217 -0.357056 not very
6169 -0.368976 not good
6421 -0.437750 one worst
9609 -0.451138 waste time

Evidently, it was worth keeping the negations! So, the way this classifier works pretty much makes sense, and we seem to have reached acceptable accuracy (I hesitate do write this because … what is acceptable depends on … see above). If we take a less simple approach – move away from basically, just counting (weighted) words, where every word is a one-hot-encoded vector – can we do any better?
With that cliff hanger, I end for today … stay tuned for the continuation, where we dive into the fascinating world of word vectors … 🙂 See you there!

Sentiment Analysis of Movie Reviews (talk)

Yesterday at Tech Event, which was great as always, I presented on sentiment analysis, taking as example movie reviews. I intend to write a little series of blog posts on this, but as I’m not sure when exactly I’ll get to this, here are the pdf version and a link to the notebook.

The focus was not on the classification algorithms per se (treating text as just another domain for classification), but on the difficulties emerging from this being language: Can it really work to look at just single words? Do I need bigrams? Trigrams? More? Can we tackle the complexity using word embeddings – word vectors? Or better, paragraph vectors?

I had a lot of fun exploring this topic, and I really hope to write some posts on this – stay tuned 🙂

Doing Data Science

This is another untechnical post – actually, it’s even a personal one. If you’ve been to this blog lately, you may have noticed something, that is, the absence of something – the absence of posts over nearly six months …
I’ve been busy catching up, reading up, immersing myself in things I’ve been interested in for a long time – but which I never imagined could have a real relation to, let alone make part of, my professional life.
But recently, things changed. I’ll try to be doing “for real” what would have seemed just a dream a year ago: data science, machine learning, applied statistics (Bayesian statistics, preferredly).

Doing Data Science … why?

Well, while this may look like quite a change of interests to a reader of this blog, it really is not. I’ve been interested in statistics, probability and “data mining” (as it was called at the time) long before I even wound up in IT. Actually, I have a diploma in psychology, and I’ve never studied computer science (which of course I’ve often regretted for having missed so many fascinating things).
Sure, at that time, in machine learning, much of the interesting stuff was there, too. Neural nets were there, of course. But that was before the age of big data and the boost distributed computing brought to machine learning, before craftsman-like “data mining” became sexy “data science”…
Those were the olden days, when statistics (in psychology, at least), was (1) ANOVA, (2) ANOVA, and (3) … you name it. Whereas today, students (if they are lucky) might be learning statistics from a book like Richard McElreath’s “Statistical Rethinking” (
That was before the advent of deep learning, which fundamentally changed not just what seems possible but also the way it is approached. Take natural language processing, for example (just check out the materials for Stanford’s Deep Learning for Natural Language Processing course for a great introduction).
While I’m at it … where some people see it as “machine learning versus statistics”, or “machine learning instead of statistics”, for me there’s no antagonism there. Perhaps that’s because of my bio. For me, some of the books I admire most – especially the very well-written, very accessible ISLR – Introduction to Statistical Learning – and its big brother, Elements of Statistical Learning, – are the perfect synthesis.
Returning to the original topic – I’ve even wondered should I start a new blog on machine learning and data science, to avoid people asking the above question (you know, the why data science one, above). But then, your bio is something you can never undo, – all you can do is change the narrative, try to make the narrative work. The narrative works fine for me, I hope I’ve made it plausible to the outside world, too 😉 .
(BTW I’m lucky with the blog title I chose, a few years ago – no need to change that (see
And probably, it doesn’t hurt for a data scientist to know how to get data from databases, how to manipulate it in various programming languages, and quite a bit about IT architectures behind.
OK, that was the justification. The other question now is …

Doing Data Science …how?

Well, luckily, I’m not isolated at all with these interests at Trivadis. We’ve already had a strong focus on big data and streaming analytics for quite some time (just see my colleague Guido’s blog who is an internationally renowned expert on these topics), but now additionally there’s a highly motivated group of data scientists ready to turn data into insight 🙂 ).
If you’re reading this you might be a potential customer, so I can’t finish without a sales pitch:

It’s not about the tools. It’s not about the programming languages you use (though some make it easier than others, and I decidedly like the friendly and inspiring, open source Python and R ecosystems). It’s about discovering patterns, detecting underlying structure, uncovering the unknown. About finding out what you didn’t (necessarily) hypothesize, before. And most importantly: about assessing if what you found is valid and will generalize, to the future, to different conditions. If what you found is “real”. There’s a lot more to it than looking at extrapolated forecast curves.

Before I end the sales pitch, let me say that in addition to our consulting services we also offer courses on getting started with Data Science, using either R or Python (see Data Science with Python and Advanced Analytics with R). Both courses are a perfect combination as they work with different data sets and build up their own narratives.
OK, I think that’s it, for narratives. Upcoming posts will be technical again, just this time technical will mostly mean: on machine learning and data science.