Doing Data Science

This is another untechnical post – actually, it’s even a personal one. If you’ve been to this blog lately, you may have noticed something, that is, the absence of something – the absence of posts over nearly six months …
I’ve been busy catching up, reading up, immersing myself in things I’ve been interested in for a long time – but which I never imagined could have a real relation to, let alone make part of, my professional life.
But recently, things changed. I’ll try to be doing “for real” what would have seemed just a dream a year ago: data science, machine learning, applied statistics (Bayesian statistics, preferredly).

Doing Data Science … why?

Well, while this may look like quite a change of interests to a reader of this blog, it really is not. I’ve been interested in statistics, probability and “data mining” (as it was called at the time) long before I even wound up in IT. Actually, I have a diploma in psychology, and I’ve never studied computer science (which of course I’ve often regretted for having missed so many fascinating things).
Sure, at that time, in machine learning, much of the interesting stuff was there, too. Neural nets were there, of course. But that was before the age of big data and the boost distributed computing brought to machine learning, before craftsman-like “data mining” became sexy “data science”…
Those were the olden days, when statistics (in psychology, at least), was (1) ANOVA, (2) ANOVA, and (3) … you name it. Whereas today, students (if they are lucky) might be learning statistics from a book like Richard McElreath’s “Statistical Rethinking” (http://xcelab.net/rm/statistical-rethinking/).
That was before the advent of deep learning, which fundamentally changed not just what seems possible but also the way it is approached. Take natural language processing, for example (just check out the materials for Stanford’s Deep Learning for Natural Language Processing course for a great introduction).
While I’m at it … where some people see it as “machine learning versus statistics”, or “machine learning instead of statistics”, for me there’s no antagonism there. Perhaps that’s because of my bio. For me, some of the books I admire most – especially the very well-written, very accessible ISLR – Introduction to Statistical Learning – and its big brother, Elements of Statistical Learning, – are the perfect synthesis.
Returning to the original topic – I’ve even wondered should I start a new blog on machine learning and data science, to avoid people asking the above question (you know, the why data science one, above). But then, your bio is something you can never undo, – all you can do is change the narrative, try to make the narrative work. The narrative works fine for me, I hope I’ve made it plausible to the outside world, too 😉 .
(BTW I’m lucky with the blog title I chose, a few years ago – no need to change that (see https://en.wikipedia.org/wiki/Markov_chain))
And probably, it doesn’t hurt for a data scientist to know how to get data from databases, how to manipulate it in various programming languages, and quite a bit about IT architectures behind.
OK, that was the justification. The other question now is …

Doing Data Science …how?

Well, luckily, I’m not isolated at all with these interests at Trivadis. We’ve already had a strong focus on big data and streaming analytics for quite some time (just see my colleague Guido’s blog who is an internationally renowned expert on these topics), but now additionally there’s a highly motivated group of data scientists ready to turn data into insight 🙂 ).
If you’re reading this you might be a potential customer, so I can’t finish without a sales pitch:

It’s not about the tools. It’s not about the programming languages you use (though some make it easier than others, and I decidedly like the friendly and inspiring, open source Python and R ecosystems). It’s about discovering patterns, detecting underlying structure, uncovering the unknown. About finding out what you didn’t (necessarily) hypothesize, before. And most importantly: about assessing if what you found is valid and will generalize, to the future, to different conditions. If what you found is “real”. There’s a lot more to it than looking at extrapolated forecast curves.

Before I end the sales pitch, let me say that in addition to our consulting services we also offer courses on getting started with Data Science, using either R or Python (see Data Science with Python and Advanced Analytics with R). Both courses are a perfect combination as they work with different data sets and build up their own narratives.
OK, I think that’s it, for narratives. Upcoming posts will be technical again, just this time technical will mostly mean: on machine learning and data science.

Advertisements

IT Myths (1): Best Practices

IT Myths (1): Best Practices

I don’t know about you, but I feel uncomfortable when asked about “Best Practices”. I manage to give the expected answers, but I still feel uncomfortable. Now, if you’re one of the people who liked – or retweeted – this tweet, you don’t need to be convinced that “Best Practices” are a dubious thing. Still, you might find it difficult to communicate to others, who do not share your instinctive doubts, what is the problem with it. Here, I’ll try to explain, in a few words, what is the problem with it in my view.

As I see it, this juxtaposition of words cannot be interpreted in a meaningful way. First, let’s stay in the realm of IT.
Assume you’ve bought expensive software, and now you’re going to set up your system. The software comes with default parameter settings. Should you need to follow “Best Practices” in choosing parameter values?

You shouldn’t have to. You should be fully entitled to trust the vendor to ship their software with sensible parameters. The defaults should, in general, make sense. Of course there often are things you have to adapt to your environment, but in principle you should be able to rely on sensible defaults.

One (counter-)example: the Oracle initialization parameter db_block_checking. This parameter governs whether and to which extent Oracle performs logical consistency checks on database blocks. (For details, see Performance overhead of db_block_checking and db_block_checksum non-default settings.)
Still as of version 12.1.0.2, the default value of this parameter is none. If it is set to medium or full, Oracle will either repair the corrupt block or – if that is not possible – at least prevent the corruption spreading in memory. In the Reference, it is advised to set the parameter to full if the performance overhead is acceptable. Why, then, is the default none? This, in my opinion, sends the wrong signal. The database administrator now has to justify her choice of medium, because it might, depending on the workload, have a negative impact on performance. But she shouldn’t have to invoke “Best Practices”. While performance issues can be addressed in multiple ways, nobody wants corrupt data in their database. Again, the software should be shipped with defaults that make such discussions unneccessary.

Second, imagine you hire a consultant to set up your system. Do you want him to follow “Best Practices”? You surely don’t: You want him to know exactly what he is doing. It’s his job to get the information he needs to the set up the system correctly, in the given environment and with the given requirements. You don’t pay him to do things that “work well on average”.

Thirdly, if you’re an administrator or a developer, the fact that you stay informed and up to date with current developments, that you try to understand how things “work” means that you’re doing more than just follow “Best Practices”. You’re trying to be knowledgeable enough to make the correct decisions in given, concrete circumstances.

So that was IT, from different points of view. How about “life”? In real life, we don’t follow “Best Practices” either. (We may employ heuristics, most of the time, but that’s another topic.)
If it’s raining outside, or it is x% (fill in your own threshold here ;-)) probable it will rain, I’m gonna take / put on my raining clothes for my commute … but I’m not going to take them every day, “just in case”. In “real life”, things are either too natural to be called “Best Practices”, or they need a little more reflection than that.

Finally, let’s end with philosophy 😉 Imagine we were ruled by a Platonic philosopher king (queen) … we’d want him/her to do a bit more than just follow “Best Practices”, wouldn’t we 😉