Sentiment Analysis of Movie Reviews (talk)

Yesterday at Tech Event, which was great as always, I presented on sentiment analysis, taking as example movie reviews. I intend to write a little series of blog posts on this, but as I’m not sure when exactly I’ll get to this, here are the pdf version and a link to the notebook.

The focus was not on the classification algorithms per se (treating text as just another domain for classification), but on the difficulties emerging from this being language: Can it really work to look at just single words? Do I need bigrams? Trigrams? More? Can we tackle the complexity using word embeddings – word vectors? Or better, paragraph vectors?

I had a lot of fun exploring this topic, and I really hope to write some posts on this – stay tuned🙂

Doing Data Science

This is another untechnical post – actually, it’s even a personal one. If you’ve been to this blog lately, you may have noticed something, that is, the absence of something – the absence of posts over nearly six months …
I’ve been busy catching up, reading up, immersing myself in things I’ve been interested in for a long time – but which I never imagined could have a real relation to, let alone make part of, my professional life.
But recently, things changed. I’ll try to be doing “for real” what would have seemed just a dream a year ago: data science, machine learning, applied statistics (Bayesian statistics, preferredly).

Doing Data Science … why?

Well, while this may look like quite a change of interests to a reader of this blog, it really is not. I’ve been interested in statistics, probability and “data mining” (as it was called at the time) long before I even wound up in IT. Actually, I have a diploma in psychology, and I’ve never studied computer science (which of course I’ve often regretted for having missed so many fascinating things).
Sure, at that time, in machine learning, much of the interesting stuff was there, too. Neural nets were there, of course. But that was before the age of big data and the boost distributed computing brought to machine learning, before craftsman-like “data mining” became sexy “data science”…
Those were the olden days, when statistics (in psychology, at least), was (1) ANOVA, (2) ANOVA, and (3) … you name it. Whereas today, students (if they are lucky) might be learning statistics from a book like Richard McElreath’s “Statistical Rethinking” (http://xcelab.net/rm/statistical-rethinking/).
That was before the advent of deep learning, which fundamentally changed not just what seems possible but also the way it is approached. Take natural language processing, for example (just check out the materials for Stanford’s Deep Learning for Natural Language Processing course for a great introduction).
While I’m at it … where some people see it as “machine learning versus statistics”, or “machine learning instead of statistics”, for me there’s no antagonism there. Perhaps that’s because of my bio. For me, some of the books I admire most – especially the very well-written, very accessible ISLR – Introduction to Statistical Learning – and its big brother, Elements of Statistical Learning, – are the perfect synthesis.
Returning to the original topic – I’ve even wondered should I start a new blog on machine learning and data science, to avoid people asking the above question (you know, the why data science one, above). But then, your bio is something you can never undo, – all you can do is change the narrative, try to make the narrative work. The narrative works fine for me, I hope I’ve made it plausible to the outside world, too😉 .
(BTW I’m lucky with the blog title I chose, a few years ago – no need to change that (see https://en.wikipedia.org/wiki/Markov_chain))
And probably, it doesn’t hurt for a data scientist to know how to get data from databases, how to manipulate it in various programming languages, and quite a bit about IT architectures behind.
OK, that was the justification. The other question now is …

Doing Data Science …how?

Well, luckily, I’m not isolated at all with these interests at Trivadis. We’ve already had a strong focus on big data and streaming analytics for quite some time (just see my colleague Guido’s blog who is an internationally renowned expert on these topics), but now additionally there’s a highly motivated group of data scientists ready to turn data into insight🙂 ).
If you’re reading this you might be a potential customer, so I can’t finish without a sales pitch:

It’s not about the tools. It’s not about the programming languages you use (though some make it easier than others, and I decidedly like the friendly and inspiring, open source Python and R ecosystems). It’s about discovering patterns, detecting underlying structure, uncovering the unknown. About finding out what you didn’t (necessarily) hypothesize, before. And most importantly: about assessing if what you found is valid and will generalize, to the future, to different conditions. If what you found is “real”. There’s a lot more to it than looking at extrapolated forecast curves.

Before I end the sales pitch, let me say that in addition to our consulting services we also offer courses on getting started with Data Science, using either R or Python (see Data Science with Python and Advanced Analytics with R). Both courses are a perfect combination as they work with different data sets and build up their own narratives.
OK, I think that’s it, for narratives. Upcoming posts will be technical again, just this time technical will mostly mean: on machine learning and data science.

IT Myths (1): Best Practices

IT Myths (1): Best Practices

I don’t know about you, but I feel uncomfortable when asked about “Best Practices”. I manage to give the expected answers, but I still feel uncomfortable. Now, if you’re one of the people who liked – or retweeted – this tweet, you don’t need to be convinced that “Best Practices” are a dubious thing. Still, you might find it difficult to communicate to others, who do not share your instinctive doubts, what is the problem with it. Here, I’ll try to explain, in a few words, what is the problem with it in my view.

As I see it, this juxtaposition of words cannot be interpreted in a meaningful way. First, let’s stay in the realm of IT.
Assume you’ve bought expensive software, and now you’re going to set up your system. The software comes with default parameter settings. Should you need to follow “Best Practices” in choosing parameter values?

You shouldn’t have to. You should be fully entitled to trust the vendor to ship their software with sensible parameters. The defaults should, in general, make sense. Of course there often are things you have to adapt to your environment, but in principle you should be able to rely on sensible defaults.

One (counter-)example: the Oracle initialization parameter db_block_checking. This parameter governs whether and to which extent Oracle performs logical consistency checks on database blocks. (For details, see Performance overhead of db_block_checking and db_block_checksum non-default settings.)
Still as of version 12.1.0.2, the default value of this parameter is none. If it is set to medium or full, Oracle will either repair the corrupt block or – if that is not possible – at least prevent the corruption spreading in memory. In the Reference, it is advised to set the parameter to full if the performance overhead is acceptable. Why, then, is the default none? This, in my opinion, sends the wrong signal. The database administrator now has to justify her choice of medium, because it might, depending on the workload, have a negative impact on performance. But she shouldn’t have to invoke “Best Practices”. While performance issues can be addressed in multiple ways, nobody wants corrupt data in their database. Again, the software should be shipped with defaults that make such discussions unneccessary.

Second, imagine you hire a consultant to set up your system. Do you want him to follow “Best Practices”? You surely don’t: You want him to know exactly what he is doing. It’s his job to get the information he needs to the set up the system correctly, in the given environment and with the given requirements. You don’t pay him to do things that “work well on average”.

Thirdly, if you’re an administrator or a developer, the fact that you stay informed and up to date with current developments, that you try to understand how things “work” means that you’re doing more than just follow “Best Practices”. You’re trying to be knowledgeable enough to make the correct decisions in given, concrete circumstances.

So that was IT, from different points of view. How about “life”? In real life, we don’t follow “Best Practices” either. (We may employ heuristics, most of the time, but that’s another topic.)
If it’s raining outside, or it is x% (fill in your own threshold here ;-)) probable it will rain, I’m gonna take / put on my raining clothes for my commute … but I’m not going to take them every day, “just in case”. In “real life”, things are either too natural to be called “Best Practices”, or they need a little more reflection than that.

Finally, let’s end with philosophy😉 Imagine we were ruled by a Platonic philosopher king (queen) … we’d want him/her to do a bit more than just follow “Best Practices”, wouldn’t we😉

Better call Thomas – Bayes in the Database

This is my third and last post covering my presentations from last week’s Tech Event, and it’s no doubt the craziest one🙂.
Again I had the honour to be invited by my colleague Ludovico Caldara to participate in the cool “Oracle Database Lightning Talks” session. Now, there is a connection to Oracle here, but mainly, as you’ll see, it’s about a Reverend in a Casino.

So, imagine one of your customers calls you. They are going to use Oracle Transparent Data Encryption (TDE) in their main production OLTP database. They are in a bit of panic. Will performance get worse? Much worse?

Well, you can help. You perform tests using Swingbench Order Entry benchmark, and soon you can assure them: No problem. The below chart shows average response times for one of the transaction types of Order Entry, using either no encryption or TDE with one of AES-128 or AES-256.
Wait a moment – doesn’t it even look as though response times are even lower with TDE!?

Swingbench Order Entry response times (Customer Registration)

Well, they might be, or they might not be … the Swingbench results include how often a transaction was performed, and the standard deviation … so how about we ask somebody who knows something about statistics?

Cool, but … as you can see, for example, in this xkcd comic, there is no such thing as an average statistician … In short, there’s frequentist statistics and there’s Bayesian, and while there is a whole lot you can say about the respective philosophical backgrounds it mainly boils down to one thing: Bayesians take into account not just the actual data – the sample – but also the prior probability.

This is where the Reverend enters the game . The fundament of Bayesian statistics is Bayes theorem, named after Reverend Thomas Bayes.

Bayes theorem

In short, Bayes Theorem says that the probability of a hypothesis, given my measured data (POSTERIOR), is equal to the the probability of the data, given the hypothesis is correct (LIKELIHOOD), times the prior probability of the hypothesis (PRIOR), divided by the overall probability of the data (EVIDENCE). (The evidence may be neglected if we’re interested in proportionality only, not equality.)

So … why should we care? Well, from what we know how TDE works, it cannot possibly make things faster! In the best case, we’re servicing all requests from the buffer cache and so, do not have to decrypt the data. Then we shouldn’t incur any performance loss. But I cannot imagine how TDE could cause a performance boost.

So we go to our colleague the Bayesian statistian, give him our data and tell him about our prior beliefs.(Prior belief sounds subjective? It is, but prior assumptions are out in the open, to be questioned, discussed and justified.).

Now harmless as Bayes theorem looks, in practice it may be difficult to compute the posterior probability. Fortunately, there is a solution:Markov Chain Monte Carlo (MCMC). MCMC is a method to obtain the parameters of the posterior distribution not by calculating them directly, but by sampling, performing a random walk along the posterior distribution.

We assume our data is gamma distributed, the gamma distribution generally being adequate for response time data (for motivation and background, see chapter 3.5 of Analyzing Computer System Performance with Perl::PDQ by Neil Gunther).
Using R, JAGS (Just Another Gibbs Sampler), and rjags, we (oh, our colleague the statistian I meant, of course) go to work and create a model for the likelihood and the prior.
We have two groups, TDE and “no encryption”. For both, we define a gamma likelihood function, each having its own shape and rate parameters. (Shape and rate parameters can easily be calculated from mean and standard deviation.)

model {
    for ( i in 1:Ntotal ) {
      y[i] ~ dgamma( sh[x[i]] , ra[x[i]] )
    }
    sh[1] <- pow(m[1],2) / pow(sd[1],2)
    sh[2] <- pow(m[2],2) / pow(sd[2],2)
    ra[1]  <- m[1] / pow(sd[1],2)
    ra[2]  <- m[2] / pow(sd[2],2
    [...]

Second, we define prior distributions on the means and standard deviations of the likelihood functions:

    [...]
    m[1] ~ dgamma(priorSha1, priorRa1)
    m[2] ~ dgamma(priorSha2, priorRa2)
    sd[1] ~ dunif(0,100)
    sd[2] ~ dunif(0,100)
  }

As we are convinced that TDE cannot possibly make it run faster, we assume that the means of both likelihood functions are distributed according to prior distributions with the same mean. This mean is calculated from the data, averaging over all response times from both groups. As a consequence, in the above code, priorSha1 = priorSha2 and priorRa1 = priorRa2. On the other hand, we have no prior assumptions about the likelihood functions’ deviations, so we model them as uniformly distributed, choosing noninformative priors.

Now we start our sampler. Faites vos jeux … rien ne va plus … and what do we get?

Posterior distribution of means

Here we see the outcome of our random walks, the posterior distributions of means for the two conditions. Clearly, the modes are different. But what to conclude? The means of the two data sets were different too, right?

What we need to look at is the distribution of the difference of both parameters. What we see here is that the 95% highest density interval (HDI) [3] of the posterior distribution of the difference ranges from -0.69 to +1.89, and thus, includes 0. (HDI is a concept similar, but superior to the classical frequentist confidence interval, as it is composed of the 95% values with the highest probability.)

Posterior distribution of difference between means

From this we conclude that statistically, our observed difference of means is not significant – what a shame. Wouldn’t it have been nice if we could speed up our app using TDE😉

Tune the App, Not the SQL – DBA Sherlock’s Adventures in Hibernate/jOOQ Land

Last weekend at Trivadis Tech Event, in addition to my session on Oracle 12c Client High Availability for Java (Application Continuity, Transaction Guard … choose your degrees of freedom”), I gave a quite different talk called Tune the App, Not the SQL – DBA Sherlock’s Adventures in Hibernate/jOOQ Land”.

In a nutshell, what is it all about? Well, if you’re a DBA, you might sometimes (or often, or seldom – depends on your job ;-)) do what we call “SQL Tuning”. It might look like this: You get a call that the application “is slow”. Fortunately, you’re able to narrow that down quickly, and it’s just “the click on that button” that is slow. Fortunately, too, the application is well instrumented, and getting traces for specific tasks is not a problem. So you get a SQL trace file, run tkprof on it, and obtain the problematic SQLs in order of elapsed time. Then you analyse the “Top SQL”. You may find it should be rewritten, or that it would benefit from an index. Perhaps adding additional data structures might help, such as creating a materialized view. Perhaps you’re unlucky and just an ugly hint will fix it.

So you tell the developers “please rewrite that, like this”, or you create that index. And after, it’s waiting and hoping that performance will improve – noticeably, that is. Because tuning that single statement might – if you’re unlucky – not make that much of a difference.

There is, however, another approach to tuning application performance (always talking about the DB related part of it here, of course), and that has to do with the application logic and not single statements. There is an excellent demonstration of this in Stéphane Faroults very recommendable book “Refactoring SQL applications”, in fact, this is right at the beginning of the book and something that immediately “draws you in” (it was like that for me :-)).

Application logic affects what, when, and how much data is fetched. Of course, there are many aspects to this – for example, it just may not be possible to replace two DB calls by one simply because another service has to be queried in between. Also, there will have to be a tradeoff between performance and readability/maintainability of the code. But often there will be a choice. And you will see in the presentation it is not always that combining two queries into one results in better performance.

In fact, it all depends. So the first conclusion is the ubiquitous “don’t just assume, but test”.

There is another aspect to this, though. While Stéphane Faroult, in his test case, uses plain JDBC, I am using – and comparing, in a way – two commonly used frameworks: Hibernate and jOOQ. (For an “interdisciplinary introduction” to Hibernate, see my talk from previous Tech Event, Object Relational Mapping Tools – let’s talk to each other!. Quite a contrast, jOOQ is a lightweight, elegant wrapper framework providing type safety and near-100% control over the generated SQL.)

Now while for a DBA looking at a raw trace file, it will always be a challenge to “reverse engineer” application logic from the SQL sent to the database (even if plain JDBC or jOOQ are being used), the difficulties rise to a new dimension with Hibernate🙂. In the talk, I am showing a DBA – who doesn’t need to be convinced any more about the importance of application logic – trying to make sense of the flow of statements in a trace file: the only information he’s got. And it turns out to be very difficult …

But as he’s not just “some DBA”, but Sherlock, he of course has his informants and doesn’t remain stuck with his DB-only view of the world – which brings me to one of my ceterum censeo’s, which is “DBAs and developers, let’s talk to each other”🙂.

The slides are here.

Oracle 12c Transaction Guard, Application Continuity … choose your degrees of freedom!

Last Saturday at Trivadis Tech Event, I presented on Application Continuity, Transaction Guard … choose your degrees of freedom”.
What a strange title, you will be asking! What do Application Continuity (AC) and Transaction Guard (TG) have to do with freedom😉
And why talk about Transaction Guard, isn’t that just a technology used by AC? Surely it’s not something you would care to implement – if the language you’re working with is Java?

Well, there are – in addition to the extensive, but not really easy to digest Oracle documentation – some very helpful resources (blog articles, presentations) on the net, but these mostly focus on infrastructural aspects: How does AC work with RAC, with Data Guard, with RAC One Node? What does the DBA have to do to enable AC (or TG)?

The demos are mostly designed to demonstrate that AC “just works”, in different environments. As the client code is not in focus, often the easiest way of implementing AC on the client is chosen: using Oracle Universal Connection Pool (UCP). (I’ve encountered one notable exception, which is Laurent Leturgez’ very interesting post on AC resource usage.)

However, in real life, much will depend on the developer teams: Are they comfortable with making the necessary modifications? Do they trust the technology? What if they, for whatever reasons, use their own connection pool, and so can’t use UCP?

In this presentation, the focus is on the developers’ part. How the code looks / might look, and what pitfalls there are – what errors you might see if you don’t do it right, and what they mean. This is for both AC and TG.

Let’s assume, however, that you’re VERY impatient and just want to know what the “main thing” is here😉 … I’d say it’s about TG.

As of today, I’m not aware of any Java code on the web implementing TG that is NOT from Oracle documentation / whitepapers. Of course, as the topic is not easy and probably a bit “scary”, we are thankful for the example code Oracle provides. In Transaction Guard with Oracle Database 12c Oracle provide the following code example, which shows how it works:


Connection jdbcConnection = getConnection();
boolean isJobDone = false;
while(!isJobDone) {
    try {
        updateEmployeeSalaries(jdbcConnection);
        isJobDone = true;
    } catch (SQLRecoverableException recoverableException) {
    try {
        jdbcConnection.close();
    } catch (Exception ex) {}
    }
    Connection newJDBCConnection = getConnection();
    LogicalTransactionId ltxid = ((OracleConnection)jdbcConnection).getLogicalTransactionId();
    isJobDone = getTransactionOutcome(newJDBCConnection, ltxid);
    jdbcConnection = newJDBCConnection;
  }
}

Basically we have a loop around our transaction. Normally that loop is left immediately. But in case we receive a recoverable exception, we get a new connection, obtain the Logical Transaction ID from the dead connection, and ask the database server for the transaction outcome for that LTXID. If the commit went through successfully, we’re done, otherwise we resubmit our transaction.

Now while this demonstrates how to do it, we do not want to clutter our code like this everywhere, do we? And fortunately, with Java 8, we don’t have to!

In Java 8, we have Functional Interfaces. Formally, a functional interface is an interface with exactly one explicitly declared abstract method. This abstract method can be implemented directly inline, using a lambda expression. That is, the lambda expression IS an implementation of the functional interface.

This allows us to separate transaction handling from the business methods, and get rid of the above loops. How?

On the one hand, this is how one business method could look:


private void updateSalaries(Connection conn) throws SQLException {
    String query = "select empno, sal from tg.emp";
    PreparedStatement stmt = conn.prepareStatement(query, ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_UPDATABLE);
    ResultSet rs = stmt.executeQuery();
    while (rs.next()) {
       int oldsal = rs.getInt("sal");
       int newsal = calculateNewValue(oldsal);
       rs.updateInt("sal", newsal);
       rs.updateRow();
    }
    rs.close();
    stmt.close();
}

On the other, here we have a functional interface:

@FunctionalInterface
public interface Transaction {
    public void execute(Connection connection) throws SQLException;
}

One implementation of a transaction can be a lambda expression that wraps the updateSalaries business method, like this: (conn -> updateSalaries(conn)):

TGTransactionProcessor tp = new TGTransactionProcessor(url, appUser, appPasswd);
    if (tp.process(conn -> updateSalaries(conn))) {
        logger.fine("Salaries updated.");
    }


public boolean process(Transaction transaction) throws SQLException {

    boolean done = false;
    int tries = 0;
    Connection conn = getConnection();

    while (!done && tries <= MAXRETRIES) {

        try {

            transaction.execute(conn);
            conn.commit();
            conn.close();
            done = true;

        } catch (SQLRecoverableException e) {

            try {
                conn.close();
            } catch (Exception ex) {
           }
            LogicalTransactionId ltxid = ((OracleConnection) conn).getLogicalTransactionId();
            Connection newconn = getConnection();
            setModule(newconn, moduleName);
            done = isLTxIdCommitted(ltxid, newconn);
            if (done) {
                logger.info("Failed transaction had already been committed.");
           } else {
                logger.info("Replay of transaction neccessary.");
               tries++;
               conn = newconn;
           }

        }
    }
    return true;
}

So with Java 8 Functional Interfaces, we have an elegant way to separate business logic and transaction handling in general, and implement TG in Java, specifically.
So that’s the end of the “highlights of” section, for more information just have a look at the slides🙂.

Overhead of TDE tablespace encryption, is there?

Recently, for a customer, I conducted performance tests comparing performance under TDE tablespace encryption to a baseline. While I won’t present the results of those tests here I will describe two test series I ran in my lab environment.
So you’re an application owner – or a cautious DBA who wants to avoid trouble😉 – and you want to know, is there an overhead if you use TDE tablespace encryption? (I think there’s some material available on this and I think it rather goes in the “no difference” direction. But it’s always nice to test for yourself, and you really need an answer – yes or no? ;-)).

For both test series, the design was the same: 2*3-factorial with factors

  • encryption: none, AES 128, AES 256, and
  • absence/presence of AES-NI kernel module (on suitable hardware)

So let’s see…

No there’s not!

The first test series were runs of the widely used Swingbench Order Entry Benchmark.
Measurement duration (excluding warm-up) per condition was 20 minutes, which was enough – in this isolated environment – to get reproducible results (as confirmed by sample).
The setup and environmental characteristics were the following:

  • Oracle 12.1.0.2 on OEL 7
  • Order Entry schema size 6G
  • SGA 3G
  • 2 (v)CPUs
  • Data files on ASM
  • 4 concurrent users

Let’s look at the results:

Swingbench OE TPS

At first, it even looks like test runs using encryption had higher throughput. However, with standard deviations ranging between 25 and 30, for the single Order Entry transaction types, this must be attributed to chance. (Also, it would be quite difficult to argue why throughput would actually increase with TDE… let alone be highest when using an algorithm that requires more rounds of encryption…)
So from this test series, we would conclude that their is no performance impact of TDE tablespace encryption. But then, most of the time, at this ratio of schema size to buffer cache, blocks are mostly found in the cache:

  • Logical read (blocks) per transaction: 105.9
  • Physical read (blocks) per transaction: 2.6
  • Buffer Hit %: 97.53

Now with TDE tablespace encryption, blocks are stored in the SGA unencrypted, which means only a process that has to get the block from disk has to perform decryption.
Encryption, on the other hand, is taken care of by database writer, and so happens in the background anyway. So in this setup, it’s not astonishing at all not to find throughput differences due to absence/presence of encryption.

Yes there is…

A second type of test was run against the same database, on the same host. These were SQL Loader direct path loads of a 14G file, done serially. In this case, the server process has to read every line from disk, encrypt it, and write it to disk. This is what we see in this case:

SQL Loader Direct Path Load times

Now suddenly we DO have a substantial impact of TDE. Loading the file and storing data to disk now takes ~1.5 times as long as without encryption. And we see – which was to be expected, too – that this increase in elapsed time is mostly due to increased cpu consumption.

By the way, now that we have such a clear effect, do we see usage of AES-NI speeding up the load? No:

SQL Loader Direct Path Load, with and without AES-NI

This is surprising and probably something deserving further investigation, but not the topic of this post…

Conclusion

So what’s the point? Whether or not you will experience negative impact on performance will depend on your workload. No simple yes-or-no here… (but of course, no mystique either. It just means we have to phrase our questions a little bit more precisely.)