Yesterday, the Munich datageeks Data Day took place. It was a totally fun event – great to see how much is going on, data-science-wise, in and around Munich, and how many people are interested in the topic! (By the way, I think that more than half the talks were about deep learning!)

I also had a talk, “Time series shootout: ARIMA vs. LSTM” (slides on RPubs, github).

Whatever the title, it was really about showing a systematic comparison of forecasting using ARIMA and LSTM, on synthetic as well as real datasets. I find it amazing how little is needed to get a very decent result with LSTM – how little data, how little hyperparameter tuning, how few training epochs.

Of course, it gets most interesting when we look at datasets where ARIMA has problems, as with multiple seasonality. I have such an example in the talk (in fact, it’s the main climax ;-)), but it’s definitely also an interesting direction for further experiments.

Welcome to part 2 of my “R for SQListas” series. Last time, it was all about how to get started with R if you’re a SQL girl (or guy)- and that basically meant an introduction to Hadley Wickham’s dplyr and the tidyverse. The logic being: Don’t fear, it’s not that different from what you’re used to.
This (and upcoming) times it will be about the other side of the coin – if R was “basically just like SQL”, why not stick with SQL in the first place?
So now, it’s about things you cannot do with SQL, things R excels at – those things you’re learning R for :-). Remember in the last post, I said I was interested in future developments of weather/climate, and we explored the Kaggle Earth dataset (as well as another one, daily data measured at weather station Munich airport)? In this post, we’ll finally try to find out what’s going to happen to future winters. We’ll go beyond adding trend lines to measurements, and do some real time series analysis!

Inspecting the time series

First, we create a time series object from the dataframe and plot it – time series objects have their own plot() methods:
start_time <- as.Date("1992-01-01")
ts_1950 <- ts(df_munich$avg_temp, start = c(1950,1), end=c(2013,8), frequency = 12)

Time series decomposition

Now, let’s decompose the time series into its components: trend, seasonal effect, and remainder. We clearly expect there to be seasonality – the influence of the month we’re in should be clearly visible – but as stated before we’re mostly interested in the trend.

The third row is the trend. Basically, there seems to be no trend – no long-term changes in the temperature level. However, by default, the trend displayed is rather „wiggly“.

We can experiment with different settings for the smoothing window of the trend. Let‘s use two different degrees of smoothing, both more „flattening“ than the default:

From these decompositions, it does not seem like there’s a significant trend. Let’s see if we can corroborate the visual impression by some statistical data. Let’s forecast the weather!
We will use two prominent approaches in time series modeling/forecasting: exponential smoothing and ARIMA.

Exponential smoothing

With exponential smoothing, the value at each point in time is basically seen as a weighted average, where more distant points weigh less and nearer points weigh more. In the simplest realization, a value at time t(n+1) is modeled as weighted average of the value at time t and the incoming observation at time t(n+1):

More complex models exist that factor in trends and seasonal effects.
For our case of a model with both trend and seasonal effects, the Holt-Winters exponential smoothing method generates point forecasts. Equivalently (conceptually that is, not implementation-wise; see http://robjhyndman.com/hyndsight/estimation2/), we can use the State Space Model (http://www.exponentialsmoothing.net/) that additionally generates prediction intervals.

The State Space Model is implemented in R by the ets() function in the forecast package. When we call ets() without any parameters, the method will determine a suitable model using maximum likelihood estimation. Let’s see the model chosen by ets():

fit <- ets(ts_1950)
summary(fit)

## ETS(A,N,A)
##
## Call:
## ets(y = ts_1950)
##
## Smoothing parameters:
## alpha = 0.0202
## gamma = 1e-04
##
## Initial states:
## l = 5.3364
## s=-8.3652 -4.6693 0.7114 5.4325 8.8662 9.3076
## 7.3288 4.1447 -0.677 -4.3463 -8.2749 -9.4586
##
## sigma: 1.7217
##
## AIC AICc BIC
## 5932.003 5932.644 6001.581
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02606379 1.72165 1.345745 22.85555 103.3956 0.7208272
## ACF1
## Training set 0.09684748

The model chosen does not contain a smoothing parameter for the trend (beta) – in fact, it is an A,N,A model, which is the acronym for Additive errors, No trend, Additive seasonal effects.
Let’s inspect the decomposition corresponding to this model – there is no trend line here:

plot(fit)

Now, let’s forecast the next 36 months!

plot(forecast(fit, h=36))

The forecasts look rather „shrunken to the mean“, and the prediction intervals – dark grey indicates the 95%, light grey the 80% prediction interval – seem rather narrow. Indeed, narrow prediction intervals are often a problem with time series, because there are many sources of error that aren’t factored in the model (see http://robjhyndman.com/hyndsight/narrow-pi/).

ARIMA

The second approach to forecasting we will use is ARIMA. With ARIMA, there are three important concepts:

AR(p): If a process is autoregressive, future values are obtained as linear combinations of past values, i.e.

where e is white noise. A process that uses values from up to p time points back is called an AR(p) process.

I(d):This refers to the number of times differencing (= subtraction of consecutive values) has to be applied to obtain a stationary series.
If a time series yt is stationary, then for all s, the distribution of (y(t),…, y(t+s)) does not depend on t. Stationarity can be determined visually, inspecting the autocorrelation and partial autocorrelation functions, as well as using statistical tests.

MA(q): An MA(q) process combines past forecast errors from time points up to p times back to forecast the current value:

While we can feed R’s Arima() function with our assumptions about the parameters (from data exploration or prior knowledge), there’s also auto.arima() which will determine the parameters for us.
Before calling auto.arima() though, let’s inspect the autocorrelation properties of our series. We clearly expect there to be autocorrelation – for one, adjacent months are similar in temperature, and second, we have similar temperatures every 12 months.

tsdisplay(ts_1950)

So we clearly see that temperatures for adjacent months are positively correlated, while months in „opposite“ seasons are negatively correlated. The partial autocorrelations (where correlations at lower lags are controlled for) are quite strong, too. Does this mean the series is non-stationary?
Not really. A seasonal series of temperatures can be seen as a cyclostationary process (https://en.wikipedia.org/wiki/Cyclostationary_process) , where mean and variance are constant for seasonally corresponding measurements. We can check for stationarity using a statistical test, too:

adf.test(ts_1950)

##
## Augmented Dickey-Fuller Test
##
## data: ts_1950
## Dickey-Fuller = -11.121, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

According to the Augmented Dickey-Fuller test, the null hypothesis onf non-stationarity should be rejected.
So now let’s run auto.arima on our time series.

fit <- auto.arima(ts_1950)
summary(fit)

## Series: ts_1950
## ARIMA(1,0,5)(0,0,2)[12] with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 sma1 sma2
## 0.6895 -0.0794 -0.0408 -0.1266 -0.3003 -0.2461 0.3972 0.3555
## s.e. 0.0409 0.0535 0.0346 0.0389 0.0370 0.0347 0.0413 0.0342
## intercept
## 5.1667
## s.e. 0.1137
##
## sigma^2 estimated as 7.242: log likelihood=-1838.47
## AIC=3696.94 AICc=3697.24 BIC=3743.33
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001132392 2.675279 2.121494 23.30576 116.9652 1.136345
## ACF1
## Training set 0.03740769

The model chosen by auto.arima is ARIMA(1,0,5)(0,0,2)[12] where the first triple of parameters refers to the non-seasonal part of ARIMA, the second to the seasonal one, and the subscript designates seasonality (12 in our case). So in both parts, no differences are applied (d=0). The non-seasonal part has an autoregressive and a moving average component, the seasonal one is moving average only.

Now, let’s get forecasting! Well, not so fast. ARIMA forecasts are based on the assumption that the residuals (errors) are uncorrelated and normally distributed. Let’s check this:

res <- fit$residuals
acf(res)

While normality of the errors is not a problem here, the ACF does not look good:

Clearly, the errors are not uncorrelated over time. We can improve auto.arima performance (at the cost of prolonged runtime) by allowing for a higher maximum number of parameters (max.order, which by default equals 5) and setting stepwise=FALSE:

fit <- auto.arima(ts_1950, max.order = 10, stepwise=FALSE)
summary(fit)

## Series: ts_1950
## ARIMA(5,0,1)(0,0,2)[12] with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 sma1 sma2
## 0.8397 -0.0588 -0.0691 -0.2087 -0.2071 -0.4538 0.0869 0.1422
## s.e. 0.0531 0.0516 0.0471 0.0462 0.0436 0.0437 0.0368 0.0371
## intercept
## 5.1709
## s.e. 0.0701
##
## sigma^2 estimated as 4.182: log likelihood=-1629.11
## AIC=3278.22 AICc=3278.51 BIC=3324.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.003793625 2.033009 1.607367 -0.2079903 110.0913 0.8609609
## ACF1
## Training set -0.0228192

The less constrained model indeed performs better (judging by AIC, which drops from to 3696 to 3278). Autocorrelation of errors also is reduced overall.

Now, with the improved models, let’s finally get forecasting!

fit <- auto.arima(ts_1997, stepwise=FALSE, max.order = 10)
summary(fit)

Comparing this forecast with that from exponential smoothing, we see that exponential smoothing seems to deal better with smaller training samples, as well as with longer extrapolation periods.
Now we could go on and use still more sophisticated methods, or use hybrid models that combine different forecast methods, analogously to random forests, or even use deep learning – but I’ll leave that for another time 🙂 Thanks for reading!

This is the 2-part blog version of a talk I’ve given at DOAG Conference this week. I’ve also uploaded the slides (no ppt; just pretty R presentation 😉 ) to the articles section, but if you’d like a little text I’m encouraging you to read on. That is, if you’re in the target group for this post/talk.
For this post, let me assume you’re a SQL girl (or guy). With SQL you’re comfortable (an expert, probably), you know how to get and manipulate your data, no nesting of subselects has you scared ;-). And now there’s this R language people are talking about, and it can do so many things they say, so you’d like to make use of it too – so now does this mean you have to start from scratch and learn – not only a new language, but a whole new paradigm? Turns out … ok. So that’s the context for this post.

Let’s talk about the weather

So in this post, I’d like to show you how nice R is to use if you come from SQL. But this isn’t going to be a syntax-only post. We’ll be looking at real datasets and trying to answer a real question.
Personally I’m very interested in how the weather’s going to develop in the future, especially in the nearer future, and especially regarding the area where I live (I know. It’s egocentric.). Specifically, what worries me are warm winters, and I’ll be clutching to any straw that tells me it’s not going to get warmer still 😉
So I’ve downloaded / prepared two datasets, both climate / weather-related. The first is the average global temperatures dataset from the Berkeley Earth Surface Temperature Study, nicely packaged by Kaggle (a website for data science competitions; https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data). This contains measurements from 1743 on, up till 2013. The monthly averages have been obtained using sophisticated scientific procedures available on the Berkeley Earth website (http://berkeleyearth.org/).
The second is daily weather data for Munich, obtained from http://www.wunderground.com. This dataset was retrieved manually, and the period was chosen so as to not contain too many missing values. The measurements range from 1997 to 2015, and have been aggregated by taking a monthly average.
Let’s start our journey through R land, reading in and looking at the beginning of the first dataset:
library(tidyverse)
library(lubridate)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)

## 1 1743-11-01 6.068 1.737 Århus
## 2 1743-12-01 NA NA Århus
## 3 1744-01-01 NA NA Århus
## 4 1744-02-01 NA NA Århus
## 5 1744-03-01 NA NA Århus
## 6 1744-04-01 5.788 3.624 Århus
## # ... with 3 more variables: Country , Latitude ,
## # Longitude

Now we’d like to explore the dataset. With SQL, this is easy: We use WHERE to filter rows, SELECT to select columns, GROUP BY to aggregate by one or more variables…And of course, we often need to JOIN tables, and sometimes, perform set operations. Then there’s all kinds of analytic functions, such as LAG() and LEAD(). How do we do all this in R?

Entering the tidyverse

Luckily for the SQLista, writing elegant, functional, and often rather SQL-like code in R is easy. All we need to do is … enter the tidyverse. Actually, we’ve already entered it – doing library(tidyverse) – and used it to read in our csv file (read_csv)!
The tidyverse is a set of packages, developed by Hadley Wickham, Chief Scientist at Rstudio, designed to make working with R easier and more consistent (and more fun). We load data from files using readr, clean up datasets that are not in third normal form using tidyr, manipulate data with dplyr, and plot them with ggplot2.
For our task of data exploration, it is dplyr we need. Before we even begin, let’s rename the columns so they have shorter names:
df <- rename(df, avg_temp = AverageTemperature, avg_temp_95p = AverageTemperatureUncertainty, city = City, country = Country, lat = Latitude, long = Longitude)
head(df)

## # A tibble: 6 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 1743-11-01 6.068 1.737 Århus Denmark 57.05N 10.33E
## 2 1743-12-01 NA NA Århus Denmark 57.05N 10.33E
## 3 1744-01-01 NA NA Århus Denmark 57.05N 10.33E
## 4 1744-02-01 NA NA Århus Denmark 57.05N 10.33E
## 5 1744-03-01 NA NA Århus Denmark 57.05N 10.33E
## 6 1744-04-01 5.788 3.624 Århus Denmark 57.05N 10.33E

distinct() (SELECT DISTINCT)

Good. Now that we have this new dataset containing temperature measurements, really the first thing we want to know is: What locations (countries, cities) do we have measurements for?
To find out, just do distinct():
distinct(df, country)

## # A tibble: 159 × 1
## country
##
## 1 Denmark
## 2 Turkey
## 3 Kazakhstan
## 4 China
## 5 Spain
## 6 Germany
## 7 Nigeria
## 8 Iran
## 9 Russia
## 10 Canada
## # ... with 149 more rows

distinct(df, city)

## # A tibble: 3,448 × 1
## city
##
## 1 Århus
## 2 Çorlu
## 3 Çorum
## 4 Öskemen
## 5 Ürümqi
## 6 A Coruña
## 7 Aachen
## 8 Aalborg
## 9 Aba
## 10 Abadan
## # ... with 3,438 more rows

filter() (WHERE)

OK. Now as I said I’m really first and foremost curious about measurements from Munich, so I’ll have to restrict the rows. In SQL I’d need a WHERE clause, in R the equivalent is filter():
filter(df, city == 'Munich')
## # A tibble: 3,239 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 1743-11-01 1.323 1.783 Munich Germany 47.42N 10.66E
## 2 1743-12-01 NA NA Munich Germany 47.42N 10.66E
## 3 1744-01-01 NA NA Munich Germany 47.42N 10.66E
## 4 1744-02-01 NA NA Munich Germany 47.42N 10.66E
## 5 1744-03-01 NA NA Munich Germany 47.42N 10.66E
## 6 1744-04-01 5.498 2.267 Munich Germany 47.42N 10.66E
## 7 1744-05-01 7.918 1.603 Munich Germany 47.42N 10.66E

This is how we combine conditions if we have more than one of them in a where clause:
# AND
filter(df, city == 'Munich', year(dt) > 2000)
## # A tibble: 153 × 7
## dt avg_temp avg_temp_95p city country lat long
##
## 1 2001-01-01 -3.162 0.396 Munich Germany 47.42N 10.66E
## 2 2001-02-01 -1.221 0.755 Munich Germany 47.42N 10.66E
## 3 2001-03-01 3.165 0.512 Munich Germany 47.42N 10.66E
## 4 2001-04-01 3.132 0.329 Munich Germany 47.42N 10.66E
## 5 2001-05-01 11.961 0.150 Munich Germany 47.42N 10.66E
## 6 2001-06-01 11.468 0.377 Munich Germany 47.42N 10.66E
## 7 2001-07-01 15.037 0.316 Munich Germany 47.42N 10.66E
## 8 2001-08-01 15.761 0.325 Munich Germany 47.42N 10.66E
## 9 2001-09-01 7.897 0.420 Munich Germany 47.42N 10.66E
## 10 2001-10-01 9.361 0.252 Munich Germany 47.42N 10.66E
## # ... with 143 more rows

# OR
filter(df, city == 'Munich' | year(dt) > 2000)

Now, often we don’t want to see all the columns/variables. In SQL we SELECT what we’re interested in, and it’s select() in R, too: select(filter(df, city == 'Munich'), avg_temp, avg_temp_95p)

## # A tibble: 3,239 × 2
## avg_temp avg_temp_95p
##
## 1 1.323 1.783
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 5.498 2.267
## 7 7.918 1.603
## 8 11.070 1.584
## 9 12.935 1.653
## 10 NA NA
## # ... with 3,229 more rows

arrange() (ORDER BY)

How about ordered output? This can be done using arrange():
arrange(select(filter(df, city == 'Munich'), dt, avg_temp), avg_temp)

Do you think this is starting to get difficult to read? What if we add FILTER and GROUP BY operations to this query? Fortunately, with dplyr it is possible to avoid paren hell as well as stepwise assignment using the pipe operator, %>%.

Meet: %>% – the pipe

The pipe transforms an expression of form x %>% f(y) into f(x, y) and so, allows us write the above operation like this:
df %>% filter(city == 'Munich') %>% select(dt, avg_temp) %>% arrange(avg_temp)

This looks a lot like the fluent API design popular in some object oriented languages, or the bind operator, >>=, in Haskell.
It also looks a lot more like SQL. However, keep in mind that while SQL is declarative, the order of operations matters when you use the pipe (as the name says, the output of one operation is piped to another). You cannot, for example, write this (trying to emulate SQL‘s SELECT – WHERE – ORDER BY ): df %>% select(dt, avg_temp) %>% filter(city == ‘Munich’) %>% arrange(avg_temp). This can’t work because after a new dataframe has been returned from the select, the column city is not longer available.

arrange() (GROUP BY)

Now that we’ve introduced the pipe, on to group by. This is achieved in dplyr using group_by() (for grouping, obviously) and summarise() for aggregation.
Let’s find the countries we have most – and least, respectively – records for:
# most records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count %>% desc())

## # A tibble: 159 × 2
## country count
##
## 1 India 1014906
## 2 China 827802
## 3 United States 687289
## 4 Brazil 475580
## 5 Russia 461234
## 6 Japan 358669
## 7 Indonesia 323255
## 8 Germany 262359
## 9 United Kingdom 220252
## 10 Mexico 209560
## # ... with 149 more rows

# least records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count)

How about finding the average, minimum and maximum temperatures per month, looking at just records from Germany, and that originate after 1949?
df %>% filter(country == 'Germany', !is.na(avg_temp), year(dt) > 1949) %>% group_by(month(dt)) %>% summarise(count = n(), avg = mean(avg_temp), min = min(avg_temp), max = max(avg_temp))

In this way, aggregation queries can be written that are powerful and very readable at the same time. So at this point, we know how to do basic selects with filtering and grouping. How about joins?

JOINs

Dplyr provides inner_join(), left_join(), right_join() and full_join() operations, as well as semi_join() and anti_join(). From the SQL viewpoint, these work exactly as expected.
To demonstrate a join, we’ll now load the second dataset, containing daily weather data for Munich, and aggregate it by month:
daily_1997_2015 % summarise(mean_temp = mean(mean_temp))
monthly_1997_2015

Fine. Now let’s join the two datasets on the date column (their respective keys), telling R that this column is named dt in one dataframe, month in the other:
df % select(dt, avg_temp) %>% filter(year(dt) > 1949)
df %>% inner_join(monthly_1997_2015, by = c("dt" = "month"), suffix )

As we see, average temperatures obtained for the same month differ a lot from each other. Evidently, the methods of averaging used (by us and by Berkeley Earth) were very different. We will have to use every dataset separately for exploration and inference.

Set operations

Having looked at joins, on to set operations. The set operations known from SQL can be performed using dplyr’s intersect(), union(), and setdiff() methods. For example, let’s combine the Munich weather data from before 2016 and from 2016 in one data frame:
daily_2016 % arrange(day)

Joins, set operations, that’s pretty cool to have but that’s not all. Additionally, a large number of analytic functions are available in dplyr. We have the familiar-from-SQL ranking functions (e.g., dense_rank(), row_number(), ntile(), and cume_dist()):
# 5% hottest days
filter(daily_2016, cume_dist(desc(mean_temp)) % select(day, mean_temp)

We have lead() and lag():
# consecutive days where mean temperature changed by more than 5 degrees:
daily_2016 %>% mutate(yesterday_temp = lag(mean_temp)) %>% filter(abs(yesterday_temp - mean_temp) > 5) %>% select(day, mean_temp, yesterday_temp)

We also have lots of aggregation functions that, if already provided in base R, come with enhancements in dplyr. Such as, choosing the column that dictates accumulation order. New in dplyr is e.g., cummean(), the cumulative mean:
daily_2016 %>% mutate(cum_mean_temp = cummean(mean_temp)) %>% select(day, mean_temp, cum_mean_temp)

OK. Wrapping up so far, dplyr should make it easy to do data manipulation if you’re used to SQL. So why not just use SQL, what can we do in R that we couldn’t do before?

Visualization

Well, one thing R excels at is visualization. First and foremost, there is ggplot2, Hadley Wickham‘s famous plotting package, the realization of a “grammar of graphics”. ggplot2 predates the tidyverse, but became part of it once it came to life. We can use ggplot2 to plot the average monthly temperatures from Berkeley Earth for selected cities and time ranges, like this:
cities = c("Munich", "Bern", "Oslo")
df_cities % filter(city %in% cities, year(dt) > 1949, !is.na(avg_temp))
(p_1950 <- ggplot(df_cities, aes(dt, avg_temp, color = city)) + geom_point() + xlab("") + ylab("avg monthly temp") + theme_solarized())

While this plot is two-dimensional (with axes time and temperature), a third “dimension” is added via the color aesthetic (aes (…, color = city)).

We can easily reuse the same plot, zooming in on a shorter time frame:
start_time <- as.Date("1992-01-01")
end_time <- as.Date("2013-08-01")
limits <- c(start_time,end_time)
(p_1992 <- p_1950 + (scale_x_date(limits=limits)))

It seems like overall, Bern is warmest, Oslo is coldest, and Munich is in the middle somewhere.
We can add smoothing lines to see this more clearly (by default, confidence intervals would also be displayed, but I’m suppressing them here so as to show the three lines more clearly):
(p_1992 <- p_1992 + geom_smooth(se = FALSE))

Good. Now that we have these lines, can we rely on them to obtain a trend for the temperature? Because that is, ultimately, what we want to find out about.
From here on, we’re zooming in on Munich. Let’s display that trend line for Munich again, this time with the 95% confidence interval added:
p_munich_1992 <- p_munich_1950 + (scale_x_date(limits=limits))
p_munich_1992 + stat_smooth()

Calling stat_smooth() without specifying a smoothing method uses Local Polynomial Regression Fitting (LOESS). However, we could as well use another smoothing method, for example, we could fit a line using lm(). Let’s compare them both:
loess <- p_munich_1992 + stat_smooth(method = "loess", colour = "red") + labs(title = 'loess')
lm <- p_munich_1992 + stat_smooth(method = "lm", color = "green") + labs(title = 'lm')
grid.arrange(loess, lm, ncol=2) (p_1992 <- p_1950 + (scale_x_date(limits=limits)))

Both fits behave quite differently, especially as regards the shape of the confidence interval near the end (and beginning) of the time range. If we want to form an opinion regarding a possible trend, we will have to do more than just look at the graphs – time to do some time series analysis!
Given this post has become quite long already, we’ll continue in the next – so how about next winter? Stay tuned 🙂

This is another untechnical post – actually, it’s even a personal one. If you’ve been to this blog lately, you may have noticed something, that is, the absence of something – the absence of posts over nearly six months …
I’ve been busy catching up, reading up, immersing myself in things I’ve been interested in for a long time – but which I never imagined could have a real relation to, let alone make part of, my professional life.
But recently, things changed. I’ll try to be doing “for real” what would have seemed just a dream a year ago: data science, machine learning, applied statistics (Bayesian statistics, preferredly).

Doing Data Science … why?

Well, while this may look like quite a change of interests to a reader of this blog, it really is not. I’ve been interested in statistics, probability and “data mining” (as it was called at the time) long before I even wound up in IT. Actually, I have a diploma in psychology, and I’ve never studied computer science (which of course I’ve often regretted for having missed so many fascinating things).
Sure, at that time, in machine learning, much of the interesting stuff was there, too. Neural nets were there, of course. But that was before the age of big data and the boost distributed computing brought to machine learning, before craftsman-like “data mining” became sexy “data science”…
Those were the olden days, when statistics (in psychology, at least), was (1) ANOVA, (2) ANOVA, and (3) … you name it. Whereas today, students (if they are lucky) might be learning statistics from a book like Richard McElreath’s “Statistical Rethinking” (http://xcelab.net/rm/statistical-rethinking/).
That was before the advent of deep learning, which fundamentally changed not just what seems possible but also the way it is approached. Take natural language processing, for example (just check out the materials for Stanford’s Deep Learning for Natural Language Processing course for a great introduction).
While I’m at it … where some people see it as “machine learning versus statistics”, or “machine learning instead of statistics”, for me there’s no antagonism there. Perhaps that’s because of my bio. For me, some of the books I admire most – especially the very well-written, very accessible ISLR – Introduction to Statistical Learning – and its big brother, Elements of Statistical Learning, – are the perfect synthesis.
Returning to the original topic – I’ve even wondered should I start a new blog on machine learning and data science, to avoid people asking the above question (you know, the why data science one, above). But then, your bio is something you can never undo, – all you can do is change the narrative, try to make the narrative work. The narrative works fine for me, I hope I’ve made it plausible to the outside world, too 😉 .
(BTW I’m lucky with the blog title I chose, a few years ago – no need to change that (see https://en.wikipedia.org/wiki/Markov_chain))
And probably, it doesn’t hurt for a data scientist to know how to get data from databases, how to manipulate it in various programming languages, and quite a bit about IT architectures behind.
OK, that was the justification. The other question now is …

Doing Data Science …how?

Well, luckily, I’m not isolated at all with these interests at Trivadis. We’ve already had a strong focus on big data and streaming analytics for quite some time (just see my colleague Guido’s blog who is an internationally renowned expert on these topics), but now additionally there’s a highly motivated group of data scientists ready to turn data into insight 🙂 ).
If you’re reading this you might be a potential customer, so I can’t finish without a sales pitch:

It’s not about the tools. It’s not about the programming languages you use (though some make it easier than others, and I decidedly like the friendly and inspiring, open source Python and R ecosystems). It’s about discovering patterns, detecting underlying structure, uncovering the unknown. About finding out what you didn’t (necessarily) hypothesize, before. And most importantly: about assessing if what you found is valid and will generalize, to the future, to different conditions. If what you found is “real”. There’s a lot more to it than looking at extrapolated forecast curves.

Before I end the sales pitch, let me say that in addition to our consulting services we also offer courses on getting started with Data Science, using either R or Python (see Data Science with Python and Advanced Analytics with R). Both courses are a perfect combination as they work with different data sets and build up their own narratives.
OK, I think that’s it, for narratives. Upcoming posts will be technical again, just this time technical will mostly mean: on machine learning and data science.

This is my third and last post covering my presentations from last week’s Tech Event, and it’s no doubt the craziest one :-).
Again I had the honour to be invited by my colleague Ludovico Caldara to participate in the cool “Oracle Database Lightning Talks” session. Now, there is a connection to Oracle here, but mainly, as you’ll see, it’s about a Reverend in a Casino.

So, imagine one of your customers calls you. They are going to use Oracle Transparent Data Encryption (TDE) in their main production OLTP database. They are in a bit of panic. Will performance get worse? Much worse?

Well, you can help. You perform tests using Swingbench Order Entry benchmark, and soon you can assure them: No problem. The below chart shows average response times for one of the transaction types of Order Entry, using either no encryption or TDE with one of AES-128 or AES-256. Wait a moment – doesn’t it even look as though response times are even lower with TDE!?

Well, they might be, or they might not be … the Swingbench results include how often a transaction was performed, and the standard deviation … so how about we ask somebody who knows something about statistics?

Cool, but … as you can see, for example, in this xkcd comic, there is no such thing as an average statistician … In short, there’s frequentist statistics and there’s Bayesian, and while there is a whole lot you can say about the respective philosophical backgrounds it mainly boils down to one thing: Bayesians take into account not just the actual data – the sample – but also the prior probability.

In short, Bayes Theorem says that the probability of a hypothesis, given my measured data (POSTERIOR), is equal to the the probability of the data, given the hypothesis is correct (LIKELIHOOD), times the prior probability of the hypothesis (PRIOR), divided by the overall probability of the data (EVIDENCE). (The evidence may be neglected if we’re interested in proportionality only, not equality.)

So … why should we care? Well, from what we know how TDE works, it cannot possibly make things faster! In the best case, we’re servicing all requests from the buffer cache and so, do not have to decrypt the data. Then we shouldn’t incur any performance loss. But I cannot imagine how TDE could cause a performance boost.

So we go to our colleague the Bayesian statistian, give him our data and tell him about our prior beliefs.(Prior belief sounds subjective? It is, but prior assumptions are out in the open, to be questioned, discussed and justified.).

Now harmless as Bayes theorem looks, in practice it may be difficult to compute the posterior probability. Fortunately, there is a solution:Markov Chain Monte Carlo (MCMC). MCMC is a method to obtain the parameters of the posterior distribution not by calculating them directly, but by sampling, performing a random walk along the posterior distribution.

We assume our data is gamma distributed, the gamma distribution generally being adequate for response time data (for motivation and background, see chapter 3.5 of Analyzing Computer System Performance with Perl::PDQ by Neil Gunther). Using R, JAGS (Just Another Gibbs Sampler), and rjags, we (oh, our colleague the statistian I meant, of course) go to work and create a model for the likelihood and the prior.
We have two groups, TDE and “no encryption”. For both, we define a gamma likelihood function, each having its own shape and rate parameters. (Shape and rate parameters can easily be calculated from mean and standard deviation.)

model {
for ( i in 1:Ntotal ) {
y[i] ~ dgamma( sh[x[i]] , ra[x[i]] )
}
sh[1] <- pow(m[1],2) / pow(sd[1],2)
sh[2] <- pow(m[2],2) / pow(sd[2],2)
ra[1] <- m[1] / pow(sd[1],2)
ra[2] <- m[2] / pow(sd[2],2
[...]

Second, we define prior distributions on the means and standard deviations of the likelihood functions:

As we are convinced that TDE cannot possibly make it run faster, we assume that the means of both likelihood functions are distributed according to prior distributions with the same mean. This mean is calculated from the data, averaging over all response times from both groups. As a consequence, in the above code, priorSha1 = priorSha2 and priorRa1 = priorRa2. On the other hand, we have no prior assumptions about the likelihood functions’ deviations, so we model them as uniformly distributed, choosing noninformative priors.

Now we start our sampler. Faites vos jeux … rien ne va plus … and what do we get?

Here we see the outcome of our random walks, the posterior distributions of means for the two conditions. Clearly, the modes are different. But what to conclude? The means of the two data sets were different too, right?

What we need to look at is the distribution of the difference of both parameters. What we see here is that the 95% highest density interval (HDI) [3] of the posterior distribution of the difference ranges from -0.69 to +1.89, and thus, includes 0. (HDI is a concept similar, but superior to the classical frequentist confidence interval, as it is composed of the 95% values with the highest probability.)

From this we conclude that statistically, our observed difference of means is not significant – what a shame. Wouldn’t it have been nice if we could speed up our app using TDE 😉

Recently, for a customer, I conducted performance tests comparing performance under TDE tablespace encryption to a baseline. While I won’t present the results of those tests here I will describe two test series I ran in my lab environment.
So you’re an application owner – or a cautious DBA who wants to avoid trouble 😉 – and you want to know, is there an overhead if you use TDE tablespace encryption? (I think there’s some material available on this and I think it rather goes in the “no difference” direction. But it’s always nice to test for yourself, and you really need an answer – yes or no? ;-)).

For both test series, the design was the same: 2*3-factorial with factors

encryption: none, AES 128, AES 256, and

absence/presence of AES-NI kernel module (on suitable hardware)

So let’s see…

No there’s not!

The first test series were runs of the widely used Swingbench Order Entry Benchmark.
Measurement duration (excluding warm-up) per condition was 20 minutes, which was enough – in this isolated environment – to get reproducible results (as confirmed by sample).
The setup and environmental characteristics were the following:

Oracle 12.1.0.2 on OEL 7

Order Entry schema size 6G

SGA 3G

2 (v)CPUs

Data files on ASM

4 concurrent users

Let’s look at the results:

At first, it even looks like test runs using encryption had higher throughput. However, with standard deviations ranging between 25 and 30, for the single Order Entry transaction types, this must be attributed to chance. (Also, it would be quite difficult to argue why throughput would actually increase with TDE… let alone be highest when using an algorithm that requires more rounds of encryption…)
So from this test series, we would conclude that their is no performance impact of TDE tablespace encryption. But then, most of the time, at this ratio of schema size to buffer cache, blocks are mostly found in the cache:

Logical read (blocks) per transaction: 105.9

Physical read (blocks) per transaction: 2.6

Buffer Hit %: 97.53

Now with TDE tablespace encryption, blocks are stored in the SGA unencrypted, which means only a process that has to get the block from disk has to perform decryption.
Encryption, on the other hand, is taken care of by database writer, and so happens in the background anyway. So in this setup, it’s not astonishing at all not to find throughput differences due to absence/presence of encryption.

Yes there is…

A second type of test was run against the same database, on the same host. These were SQL Loader direct path loads of a 14G file, done serially. In this case, the server process has to read every line from disk, encrypt it, and write it to disk. This is what we see in this case:

Now suddenly we DO have a substantial impact of TDE. Loading the file and storing data to disk now takes ~1.5 times as long as without encryption. And we see – which was to be expected, too – that this increase in elapsed time is mostly due to increased cpu consumption.

By the way, now that we have such a clear effect, do we see usage of AES-NI speeding up the load? No:

This is surprising and probably something deserving further investigation, but not the topic of this post…

Conclusion

So what’s the point? Whether or not you will experience negative impact on performance will depend on your workload. No simple yes-or-no here… (but of course, no mystique either. It just means we have to phrase our questions a little bit more precisely.)

When creating a resource profile for a session or a group of sessions, one usually has to decide how to handle the so-called idle wait events, first and foremost, SQL*Net message from client.

One possibility would be to simply remove (physically or logically) those lines from the trace, thereby avoiding the question of whether some wait should be classified as client (= application) response time, or as think time.
Another would be be to define a threshold and exclude only those lines with wait times above the threshold. In that case, of course, I have to know what makes for a sensible threshold. I could base that threshold on network latencies but then, I might miss information about how the application behaves – what, from the database server’s point of view, is the response time of the application?

So here I want to try another approach, which is applying common outlier detection methods and find out how far that gets me ;-). My test material, this time, will be “real life”, non-artificial trace files, my tools will be the mining infrastructure based on external tables described in the latest three blog posts (starting with Mining SQL Trace Files Using External Tables Part 1: Getting the Timeline and, for ease of use, uploaded on github, and the outlier detection methods used will be four common ones described in http://d-scholarship.pitt.edu/7948/1/Seo.pdf.

Outlier detection methods

The four methods chosen were selected for being both commonly used as well as easily applicable. None of them are based on hypothesis testing as this would require knowledge of the underlying distribution.

Standard deviation method

One commonly used method is classifying as outliers those measurements whose distance from the mean is more than 2 or 3 standard deviations. We may already suspect that a method based on the normality assumption might not be the most effective here, but we’ll have a look anyway, be it only for establishing a baseline for the other methods.

Boxplot

As per Tukey’s boxplot method, values are seen as outliers when either below the first quartile minus 1.5* (3*, respectively) the interquartile range or above the third quartile plus 1.5 * (3*, respectively) the interquartile range (i.e., x < Q1 – 1.5[3]*IQR or x > Q3 + 1.5[3]*IQR).

MAD_{e}

Using the MAD_{e} method, values are classified as outliers when there distance from the median is more than 2 times (3 times, respectively) the median of absolute differences from the median, scaled by 1.483 (this scaled median of absolute differences is the so-called MAD_{e}).
I.e., x < median – 2[3] MAD_{e} or x > median – 2[3] MAD_{e}.

Median method

Using the median method, values are classified as outliers when they are below/above the median by more than n*IQR, when n is adjustable and chosen as 2.3 in the above mentioned paper.

Outlier detection methods applied

Inspecting the data

First, for all four trace files, let me indicate some summary statistics for the SQL*Net message from client event and present the data graphically.
All trace files have about 50.000 lines for this wait event, with minimum values of ~ 0.35-0.55 milliseconds, a median value of ~ 0.55-0.75 ms, an average of ~1.5-1.8 ms and a maximum ranging from 2.2 to 12 seconds approximately.

SQL> select min(elapsed) min, median(elapsed) median, trunc(avg(elapsed)) avg, trunc(stddev(elapsed)) stddev, max(elapsed) max, count(elapsed) cnt from wait where event='SQL*Net message from client';
-- file 1
MIN MEDIAN AVG STDDEV MAX CNT
---------- ---------- ---------- ---------- ---------- ----------
368 541 1606 33961 3005346 53124
-- file 2
MIN MEDIAN AVG STDDEV MAX CNT
---------- ---------- ---------- ---------- ---------- ----------
523 683 1843 33805 2269024 51485
-- file 3
MIN MEDIAN AVG STDDEV MAX CNT
---------- ---------- ---------- ---------- ---------- ----------
339 530 1530 68874 11794066 47152
-- file 4
MIN MEDIAN AVG STDDEV MAX CNT
---------- ---------- ---------- ---------- ---------- ----------
547 770 1482 27795 3306050 52608

The following diagrams should help visualize how the data is distributed. I’ve chosen a bucket size of 250 microseconds/0.25 milliseconds, which after some experimentation seemed to work best.
(This means that the numbers on the x axis have to be multiplied by 0.25 to give the elapsed time in microseconds).

For all methods, I’m interested in the upper threshold only. Now let’s see what the different methods yield.

Standard deviation method

with sd_data as
(select avg(elapsed) avg, stddev(elapsed) stddev from wait where event='SQL*Net message from client'),
bounds as
(select avg + 2*stddev upper_2sd, avg + 3*stddev upper_3sd from sd_data),
candidates_2sd as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_2sd)),
candidates_3sd as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_3sd))
select ' 2sd' "method", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_2sd
union all
select ' 3sd', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_3sd;
-- file 1
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- -------------
2sd 64 75.2 3005.3
3sd 62 106.6 3005.3
-- file 2
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2sd 88 70.5 2269
3sd 84 116.3 2269
-- file 3
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2sd 31 145.9 11794.1
3sd 26 221.7 11794.1
-- file 4
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2sd 64 62.4 3306.1
3sd 55 93 3306.1

Not surprisingly (given the non-normality of the data), the standard deviation method does not yield useful results. We’ll certainly want to classify as think time values well below 100 ms.

Boxplot

with iqr_data as
(select percentile_cont(0.25) within group (order by elapsed) q1, percentile_cont(0.75) within group (order by elapsed) q3, abs(percentile_cont(0.25) within group (order by elapsed) - percentile_cont(0.75) within group (order by elapsed)) iqr from wait where event='SQL*Net message from client'),
bounds as
(select q3 + 1.5 * iqr lower_bound, q3 + 3 * iqr upper_bound from iqr_data),
candidates_inner as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.lower_bound)),
candidates_outer as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_bound))
select 'outside inner bound' "criterion", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_inner
union all
select 'outside outer bound', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_outer
;
-- file 1
criterion num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound 5370 .9 3005.3
outside outer bound 3983 1.1 3005.3
-- file 2
criterion num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound 3869 1 2269
outside outer bound 2264 1.2 2269
-- file 3
criterion num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound 2026 .8 11794.1
outside outer bound 809 .9 11794.1
-- file 4
criterion num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound 3836 1.1 3306.1
outside outer bound 1418 1.3 3306.1

This looks better! If we choose the outer bound, waits have to exceed 1 ms to be classified as think time. However, looking at the diagrams, I’m not sure if this is not too low. Let’s compare with median-based methods.

MAD_{e}

with median as
(select median(elapsed) median from wait where event='SQL*Net message from client'),
abs_differences as
(select abs(elapsed - median) absdiff from wait, median),
mad_e as
(select median(absdiff) mad_e from abs_differences),
bounds as
(select median + 2*mad_e upper_2mad_e, median + 3*mad_e upper_3mad_e from median, mad_e),
candidates_2mad_e as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_2mad_e)),
candidates_3mad_e as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_3mad_e))
select '2mad_e' "method", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_2mad_e
union all
select '3mad_e', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_3mad_e;
-- file 1
method num_outliers smallest_outlier_ms max_value_ms
------- ------------ ------------------- ------------
2mad_e 2740 1.6 3005.3
3mad_e 1523 2.2 3005.3
-- file 2
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2mad_e 287 2 2269
3mad_e 208 2.7 2269
-- file 3
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2mad_e 236 1.6 11794.1
3mad_e 142 2.1 11794.1
-- file 4
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2mad_e 487 2.3 3306.1
3mad_e 380 3.1 3306.1

Now this – again choosing the outer bound, so we end up at a threshold of ~ 2-2.5 ms, looks much better to me. Perhaps it is still a little low, from visual inspection?

Median method

with iqr_data as
(select median(elapsed) median, percentile_cont(0.25) within group (order by elapsed) q1, percentile_cont(0.75) within group (order by elapsed) q3, abs(percentile_cont(0.25) within group (order by elapsed) - percentile_cont(0.75) within group (order by elapsed)) iqr from wait where event='SQL*Net message from client'),
bounds as
(select median + 2.3 * iqr bound from iqr_data),
candidates as
(select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.bound))
select 'outside bound' "criterion", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates;
-- file 1
criterion num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound 5065 .9 3005.3
-- file 2
criterion num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound 3575 1 2269
-- file 3
criterion num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound 1729 .8 11794.1
-- file 4
criterion num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound 3316 1.1 3306.1

The median method, at least for the trace files given, seems to yield results very similar to boxplot. Same as with boxplot, it might be too strict though.

Conclusion

Of course, four trace files, taken from the same application and under similar conditions, do not make for a representative sample. However, the idea here was to demonstrate the approach, i.e., looking at think time from an outlier detection perspective. For the given sample, the MAD_{e} method seems to yield a plausible classification. However, what classification makes sense is, in the end, a question of (at least) two things – the application we’re looking at (a real application? or end users, sitting at their terminals?) and what we want to do with the data (diagnose the database only or the client, too).

I guess it’s the latter case when this approach will be the most useful.