I’m a developer, why should I care about matrices or calculus? (talk at MLConference 2017)

Yesterday at ML Conference, which took place this year for the first time, I had a talk on cool bits of calculus and linear algebra that are useful and fun to know if you’re writing code for deep learning and/or machine learning.

Originally, the title was something like “What every interested ML/DL developer should know about matrices and calculus”, but then really I didn’t like the schoolmasterly tone that had, as really what I’ve wanted to convey was the fun and the fascination of it …

So, without further ado, here are the slides and the raw presentation on github.

Thanks for reading!

Deep Learning with Keras – using R (talk)

This week in Kassel, [R]Kenntnistage 2017 took place, organised by EODA. It was all about Data Science (with R, mostly, as you could guess): Speakers presented interesting applications in industry, manufacturing, ecology, journalism and other fields, including use cases such as predictive maintenance, forecasting and risk analysis.

I had the honour to have a talk too (thanks guys!), combining two of my favorite topics – deep learning and R. The slides are on RPubs as usual, and the source code (including complete examples) can be found on github.

Last not least, it’s great to see data science, and R, gaining momentum like that (this is Europe, so I can still write such a sentence ;-))
If you allow me to include an advertisement here – if you’re wondering what insight might come out of your data: At Trivadis, we’re a (yet) smallish but super motivated team of data scientists and machine learning practitioners happy to help!

Time series shootout: ARIMA vs. LSTM (talk)

Yesterday, the Munich datageeks Data Day took place. It was a totally fun event – great to see how much is going on, data-science-wise, in and around Munich, and how many people are interested in the topic! (By the way, I think that more than half the talks were about deep learning!)

I also had a talk, “Time series shootout: ARIMA vs. LSTM” (slides on RPubs, github).

Whatever the title, it was really about showing a systematic comparison of forecasting using ARIMA and LSTM, on synthetic as well as real datasets. I find it amazing how little is needed to get a very decent result with LSTM – how little data, how little hyperparameter tuning, how few training epochs.

Of course, it gets most interesting when we look at datasets where ARIMA has problems, as with multiple seasonality. I have such an example in the talk (in fact, it’s the main climax ;-)), but it’s definitely also an interesting direction for further experiments.

Thanks for reading!

Automatic Crack Detection – with Deep Learning

On Friday at DOAG Big Data Days, I presented one possible application of deep learning: using deep learning for automatic crack detection – with some background theory, a Keras model trained from scratch, and the use of VGG16 pretrained on Imagenet. The amount of input data really was minimal, and the resulting accuracy, under these circumstances, not bad at all! Here are the slides.

If you’re interested, I’ll have a webcast on this as part of the Trivadis tricast series (registration). The talk will be in German though, so I guess some working knowledge of German would be helpful 🙂

Thanks for reading!

Deep Learning, deeplearning4j and Outlier Detection: Talks at Trivadis Tech Event

Last weekend, another edition of Trivadis Tech Event took place. As usual, it was great fun and a great source of inspiration.
I had the occasion to talk about deep learning twice: One talk was an intro to DL4J (deeplearning4j), zooming in on a few aspects I’ve found especially nice and useful while trying to provide a general introduction to deep learning at the same time. The audience was great, and the framework really is fun to work with, so this was a totally pleasant experience! Here are the slides, and here’s the example code.

The second talk was a joint session with my colleague Olaf on outlier / anomaly detection. We covered both ML and DL algorithms. For DL, I focused on variational autoencoders, the special challenge being to successfully apply the algorithm to datasets other than MNIST… and especially, datasets with a mix of categorical and continuous variables of different scale. As I say in the habitual “conclusion” slide, I don’t think I’ve arrived at a conclusion yet… any comments / suggestions are very welcome! Here’s the VAE presentation on RPubs, and here on github.
Thanks for reading!

Haskell, R, and HaskellR: Combining the best of two worlds (talk at UseR! 2017)

Earlier today, I presented at UseR! 2017 about HaskellR: a great piece of software, developed by Tweag I/O, that allows to seemlessly use R from Haskell.

It was my first UseR!, it was a great experience, and if I had the time I’d like to write a separate blog post about it, as there were things that did not quite align with my prior expectations… Stuff for thought, but not the topic of this post. (Mainly this would be about how the academic talks compared to the non-academic ones.)

So, why HaskellR? If you allow me one personal note… For the ex-psychologist, ex-software-developer, ex-database administrator, now “in over my head” data scientist and machine learning/deep learning person that I am (see this post for that story), there has always been some fixed point of interest (ideal, you could say), and that is the elegance of functional programming. It all started with SICP, which I first read as a (Java) programmer and recently read again (partly) when preparing R 4 hackers, a talk focused to a great part on the functional programming features of R.

For a database administrator, unless you’re very lucky, it’s hard to integrate use of a functional programming language into your work. How about deep learning and/or data science?
For deep learning, there’s Chris Olah’s wonderful blog post linking deep networks to functional programs, but the reality (of widely used frameworks) looks different: TensorFlow, Keras, PyTorch… it’s mostly Python around there, and whatever the niceties of Python (readability, list comprehensions…) writing Python certainly does not feel like writing FP code at all (much less than writing R!).

So in practice, the connections between data science/machine learning/deep learning and functional programming are scarce. If you look for connections, you will quickly stumble upon the Tweag I/O guys’ work: They’ve not just created HaskellR, they’ve also made Haskell run on Spark, thus enabling Haskell applications to use Spark’s MLLib for large-scale machine learning.

What, then, is HaskellR? It’s a way to seemlessly mix R code and Haskell code, with full interoperability in both directions. You can do that in source files, of course, but you can also quickly play around in the interpreter, appropriately called H (no, I was not thinking of its addictive potential here ;-)), and even use Jupyter notebook with HaskellR! In fact, that’s what I did in the demos.

If you’re interested in the technicalities of the implementation, you’ll find that documented in great detail on the HaskellR website (and even more, in their IFL 2014 paper), but otherwise I suggest you take a look at the demos from my talk: First, there’s a notebook showing how to use HaskellR, how to get values from Haskell to R and vice versa, and then, there’s the trading app scenario notebook: Suppose you have a trading app written in Haskell – it’s gotta be lightning fast and as bug-free as possible, right?
But, how about nice visualizations, time series diagnostics, all kinds of sophisticated statistical and machine learning algorithms… Chances are, someone’s implemented that algorithm in R, already! Let’s take ARIMA – one line of code with R.J. Hyndman’s auto.arima package! Visualization? ggplot2, of course! And last not least, an easy way to do deep learning with R’s keras package (interfacing to Python Keras).

Besides the notebooks, you might also want to check out the slides, especially if you’re an R user who hasn’t had much contact with Haskell. Ever wondered why the pipe looks the way it looks, or what the partial and compose functions are doing?

Last not least, a thousand thanks to the guys over at Tweag I/O, who’ve been incredibly helpful in getting the whole setup to run (the best way to get it up and running on Fedora is using nix, which I didn’t have any prior experience with… just at a second level of parentheses, I think I’d like to know more about nix, the package manager and the OS, now too ;-)). This is really the great thing about open source, the cool stuff people build and how helpful they are! So thanks again, guys – I hope to be doing things “at the interface” of ML/DL and FP more often in the future!

Update:
The talk was recorded, and can be viewed here.

Time series prediction – with deep learning

More and more often, and in more and more different areas, deep learning is making its appearance in the world around us.
Many small and medium businesses, however, will probably still think – Deep Learning, that’s for Google, Facebook & co., for the guys with big data and even bigger computing power (barely resisting the temptation to write “yuge power” here).

Partly this may be true. Certainly when it comes to running through immense permutations of hyperparameter settings. The question however is if we can’t obtain good results in more usual dimensions, too – in areas where traditional methods of data science / machine learning prevail. Prevail, as of today, that is.

One such area is time series prediction, with ARIMA & co. top on the leader board. Can deep learning be a serious competitor here? In what cases? Why? Exploring this is like starting out on an unknown road, fascinated by the magical things that may await us 😉
In any case, I’ve started walking down the road (not running!), in a rather take-your-time-and-explore-the-surroundings way. That means there’s much still to come, and it’s really just a beginning.

Here, anyway, is the travel report – the presentation slides, I mean: best viewed on RPubs, as RMarkdown on github, or downloadable as pdf).
Enjoy!

R 4 hackers

Yesterday at Trivadis Tech Event, I talked about R for Hackers. It was the first session slot on Sunday morning, it was a crazy, nerdy topic, and yet there were, like, 30 people attending! An emphatic thank you to everyone who came!

R a crazy, nerdy topic, – why that, you’ll be asking? What’s so nerdy about using R?
Well, it was about R. But it was neither an introduction (“how to get things done quickly with R”), nor was it even about data science. True, you do get things done super efficiently with R, and true, R is great for data science – but this time, it really was about R as a language!

Because as a language, too, R is cool. In contrast to most object oriented languages, it (at least in it’s most widely used version, S3) uses generic OO, not message-passing OO (ok, I don’t know if this is cool, but it’s really instructive to see how little you need to implement an OO system!).

What definitely is cool though is how R is, quite a bit, a functional programming language! Even using base R, you can write in a functional style, and then there’s Hadley Wickham’s purrr that implements things like function composition or partial application.

Finally, the talk goes into base object internals – closures, builtins, specials… and it ends with a promise … 😉
So, here’s the talk: rpubs, pdf, github. Enjoy!

R for SQListas (3): Classifying Digits with TensorFlow

Yesterday at PASS Meetup Munich, I talked about R for SQListas – thanks again for your interest and attention guys, it was a very nice evening!
Actually, in addition to the content from that original presentation, which I’ve also covered in two recent blog posts (R for SQListas(1): Welcome to the tidyverse and R for SQListas(2): Forecasting the future), there was a new, third part this time: an introduction to machine learning with R, by example of the most classical of examples: MNIST, with a special focus on using rstudio’s tensorflow package for R.
While I hope I’ll find the time to write a post on this part too, I’m not too sure when this will be, so I’ve uploaded the slides already and added links to the pdf, github repo and publication on rpubs to the Presentations/Papers section. Enjoy!

R for SQListas (2): Forecasting the Future

R for SQListas, part 2

Welcome to part 2 of my “R for SQListas” series. Last time, it was all about how to get started with R if you’re a SQL girl (or guy)- and that basically meant an introduction to Hadley Wickham’s dplyr and the tidyverse. The logic being: Don’t fear, it’s not that different from what you’re used to.
This (and upcoming) times it will be about the other side of the coin – if R was “basically just like SQL”, why not stick with SQL in the first place?
So now, it’s about things you cannot do with SQL, things R excels at – those things you’re learning R for :-). Remember in the last post, I said I was interested in future developments of weather/climate, and we explored the Kaggle Earth dataset (as well as another one, daily data measured at weather station Munich airport)? In this post, we’ll finally try to find out what’s going to happen to future winters. We’ll go beyond adding trend lines to measurements, and do some real time series analysis!

Inspecting the time series

First, we create a time series object from the dataframe and plot it – time series objects have their own plot() methods:

start_time <- as.Date("1992-01-01")
ts_1950 <- ts(df_munich$avg_temp, start = c(1950,1), end=c(2013,8), frequency = 12)

Plotting the time series with the default plotting method

Time series decomposition

Now, let’s decompose the time series into its components: trend, seasonal effect, and remainder. We clearly expect there to be seasonality – the influence of the month we’re in should be clearly visible – but as stated before we’re mostly interested in the trend.


fit1 <- stlplus(ts_1950, s.window="periodic")
fit2 <- stlplus(ts_1992, s.window="periodic")
grid.arrange(plot(fit1), plot(fit2), ncol = 2)

The third row is the trend. Basically, there seems to be no trend – no long-term changes in the temperature level. However, by default, the trend displayed is rather „wiggly“.

Time series decomposition

We can experiment with different settings for the smoothing window of the trend. Let‘s use two different degrees of smoothing, both more „flattening“ than the default:


fit1 <- stlplus(ts_1950, s.window="periodic", t.window=37)
fit2 <- stlplus(ts_1950, s.window="periodic", t.window=51)
grid.arrange(plot(fit1), plot(fit2), ncol = 2, top = 'Time series decomposition, 1950 - 2013')

Time series decomposition, different degrees of smoothing

From these decompositions, it does not seem like there’s a significant trend. Let’s see if we can corroborate the visual impression by some statistical data. Let’s forecast the weather!
We will use two prominent approaches in time series modeling/forecasting: exponential smoothing and ARIMA.

Exponential smoothing

With exponential smoothing, the value at each point in time is basically seen as a weighted average, where more distant points weigh less and nearer points weigh more. In the simplest realization, a value at time t(n+1) is modeled as weighted average of the value at time t and the incoming observation at time t(n+1):


More complex models exist that factor in trends and seasonal effects.
For our case of a model with both trend and seasonal effects, the Holt-Winters exponential smoothing method generates point forecasts. Equivalently (conceptually that is, not implementation-wise; see http://robjhyndman.com/hyndsight/estimation2/), we can use the State Space Model (http://www.exponentialsmoothing.net/) that additionally generates prediction intervals.

The State Space Model is implemented in R by the ets() function in the forecast package. When we call ets() without any parameters, the method will determine a suitable model using maximum likelihood estimation. Let’s see the model chosen by ets():


fit <- ets(ts_1950)
summary(fit)

## ETS(A,N,A)
##
## Call:
## ets(y = ts_1950)
##
## Smoothing parameters:
## alpha = 0.0202
## gamma = 1e-04
##
## Initial states:
## l = 5.3364
## s=-8.3652 -4.6693 0.7114 5.4325 8.8662 9.3076
## 7.3288 4.1447 -0.677 -4.3463 -8.2749 -9.4586
##
## sigma: 1.7217
##
## AIC AICc BIC
## 5932.003 5932.644 6001.581
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02606379 1.72165 1.345745 22.85555 103.3956 0.7208272
## ACF1
## Training set 0.09684748

The model chosen does not contain a smoothing parameter for the trend (beta) – in fact, it is an A,N,A model, which is the acronym for Additive errors, No trend, Additive seasonal effects.
Let’s inspect the decomposition corresponding to this model – there is no trend line here:


plot(fit)

Time series decomposition using ets()

Now, let’s forecast the next 36 months!


plot(forecast(fit, h=36))

Forecast from ets()

The forecasts look rather „shrunken to the mean“, and the prediction intervals – dark grey indicates the 95%, light grey the 80% prediction interval – seem rather narrow. Indeed, narrow prediction intervals are often a problem with time series, because there are many sources of error that aren’t factored in the model (see http://robjhyndman.com/hyndsight/narrow-pi/).

ARIMA

The second approach to forecasting we will use is ARIMA. With ARIMA, there are three important concepts:

  1. AR(p): If a process is autoregressive, future values are obtained as linear combinations of past values, i.e.


    where e is white noise. A process that uses values from up to p time points back is called an AR(p) process.

  2. I(d):This refers to the number of times differencing (= subtraction of consecutive values) has to be applied to obtain a stationary series.
    If a time series yt is stationary, then for all s, the distribution of (y(t),…, y(t+s)) does not depend on t. Stationarity can be determined visually, inspecting the autocorrelation and partial autocorrelation functions, as well as using statistical tests.
  3. MA(q): An MA(q) process combines past forecast errors from time points up to p times back to forecast the current value:


While we can feed R’s Arima() function with our assumptions about the parameters (from data exploration or prior knowledge), there’s also auto.arima() which will determine the parameters for us.
Before calling auto.arima() though, let’s inspect the autocorrelation properties of our series. We clearly expect there to be autocorrelation – for one, adjacent months are similar in temperature, and second, we have similar temperatures every 12 months.


tsdisplay(ts_1950)

Autocorrelation check

So we clearly see that temperatures for adjacent months are positively correlated, while months in „opposite“ seasons are negatively correlated. The partial autocorrelations (where correlations at lower lags are controlled for) are quite strong, too. Does this mean the series is non-stationary?
Not really. A seasonal series of temperatures can be seen as a cyclostationary process (https://en.wikipedia.org/wiki/Cyclostationary_process) , where mean and variance are constant for seasonally corresponding measurements. We can check for stationarity using a statistical test, too:


adf.test(ts_1950)

##
## Augmented Dickey-Fuller Test
##
## data: ts_1950
## Dickey-Fuller = -11.121, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

According to the Augmented Dickey-Fuller test, the null hypothesis onf non-stationarity should be rejected.
So now let’s run auto.arima on our time series.


fit <- auto.arima(ts_1950)
summary(fit)

## Series: ts_1950
## ARIMA(1,0,5)(0,0,2)[12] with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 sma1 sma2
## 0.6895 -0.0794 -0.0408 -0.1266 -0.3003 -0.2461 0.3972 0.3555
## s.e. 0.0409 0.0535 0.0346 0.0389 0.0370 0.0347 0.0413 0.0342
## intercept
## 5.1667
## s.e. 0.1137
##
## sigma^2 estimated as 7.242: log likelihood=-1838.47
## AIC=3696.94 AICc=3697.24 BIC=3743.33
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001132392 2.675279 2.121494 23.30576 116.9652 1.136345
## ACF1
## Training set 0.03740769

The model chosen by auto.arima is ARIMA(1,0,5)(0,0,2)[12] where the first triple of parameters refers to the non-seasonal part of ARIMA, the second to the seasonal one, and the subscript designates seasonality (12 in our case). So in both parts, no differences are applied (d=0). The non-seasonal part has an autoregressive and a moving average component, the seasonal one is moving average only.

Now, let’s get forecasting! Well, not so fast. ARIMA forecasts are based on the assumption that the residuals (errors) are uncorrelated and normally distributed. Let’s check this:


res <- fit$residuals
acf(res)

While normality of the errors is not a problem here, the ACF does not look good:

ACF of auto.arima() residuals

Clearly, the errors are not uncorrelated over time. We can improve auto.arima performance (at the cost of prolonged runtime) by allowing for a higher maximum number of parameters (max.order, which by default equals 5) and setting stepwise=FALSE:


fit <- auto.arima(ts_1950, max.order = 10, stepwise=FALSE)
summary(fit)

## Series: ts_1950
## ARIMA(5,0,1)(0,0,2)[12] with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 sma1 sma2
## 0.8397 -0.0588 -0.0691 -0.2087 -0.2071 -0.4538 0.0869 0.1422
## s.e. 0.0531 0.0516 0.0471 0.0462 0.0436 0.0437 0.0368 0.0371
## intercept
## 5.1709
## s.e. 0.0701
##
## sigma^2 estimated as 4.182: log likelihood=-1629.11
## AIC=3278.22 AICc=3278.51 BIC=3324.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.003793625 2.033009 1.607367 -0.2079903 110.0913 0.8609609
## ACF1
## Training set -0.0228192

The less constrained model indeed performs better (judging by AIC, which drops from to 3696 to 3278). Autocorrelation of errors also is reduced overall.

Now, with the improved models, let’s finally get forecasting!


fit <- auto.arima(ts_1997, stepwise=FALSE, max.order = 10)
summary(fit)

Time series forecast from auto.arima()

Comparing this forecast with that from exponential smoothing, we see that exponential smoothing seems to deal better with smaller training samples, as well as with longer extrapolation periods.
Now we could go on and use still more sophisticated methods, or use hybrid models that combine different forecast methods, analogously to random forests, or even use deep learning – but I’ll leave that for another time 🙂 Thanks for reading!