Tune the App, Not the SQL – DBA Sherlock’s Adventures in Hibernate/jOOQ Land

Last weekend at Trivadis Tech Event, in addition to my session on Oracle 12c Client High Availability for Java (Application Continuity, Transaction Guard … choose your degrees of freedom”), I gave a quite different talk called Tune the App, Not the SQL – DBA Sherlock’s Adventures in Hibernate/jOOQ Land”.

In a nutshell, what is it all about? Well, if you’re a DBA, you might sometimes (or often, or seldom – depends on your job ;-)) do what we call “SQL Tuning”. It might look like this: You get a call that the application “is slow”. Fortunately, you’re able to narrow that down quickly, and it’s just “the click on that button” that is slow. Fortunately, too, the application is well instrumented, and getting traces for specific tasks is not a problem. So you get a SQL trace file, run tkprof on it, and obtain the problematic SQLs in order of elapsed time. Then you analyse the “Top SQL”. You may find it should be rewritten, or that it would benefit from an index. Perhaps adding additional data structures might help, such as creating a materialized view. Perhaps you’re unlucky and just an ugly hint will fix it.

So you tell the developers “please rewrite that, like this”, or you create that index. And after, it’s waiting and hoping that performance will improve – noticeably, that is. Because tuning that single statement might – if you’re unlucky – not make that much of a difference.

There is, however, another approach to tuning application performance (always talking about the DB related part of it here, of course), and that has to do with the application logic and not single statements. There is an excellent demonstration of this in Stéphane Faroults very recommendable book “Refactoring SQL applications”, in fact, this is right at the beginning of the book and something that immediately “draws you in” (it was like that for me :-)).

Application logic affects what, when, and how much data is fetched. Of course, there are many aspects to this – for example, it just may not be possible to replace two DB calls by one simply because another service has to be queried in between. Also, there will have to be a tradeoff between performance and readability/maintainability of the code. But often there will be a choice. And you will see in the presentation it is not always that combining two queries into one results in better performance.

In fact, it all depends. So the first conclusion is the ubiquitous “don’t just assume, but test”.

There is another aspect to this, though. While Stéphane Faroult, in his test case, uses plain JDBC, I am using – and comparing, in a way – two commonly used frameworks: Hibernate and jOOQ. (For an “interdisciplinary introduction” to Hibernate, see my talk from previous Tech Event, Object Relational Mapping Tools – let’s talk to each other!. Quite a contrast, jOOQ is a lightweight, elegant wrapper framework providing type safety and near-100% control over the generated SQL.)

Now while for a DBA looking at a raw trace file, it will always be a challenge to “reverse engineer” application logic from the SQL sent to the database (even if plain JDBC or jOOQ are being used), the difficulties rise to a new dimension with Hibernate :-). In the talk, I am showing a DBA – who doesn’t need to be convinced any more about the importance of application logic – trying to make sense of the flow of statements in a trace file: the only information he’s got. And it turns out to be very difficult …

But as he’s not just “some DBA”, but Sherlock, he of course has his informants and doesn’t remain stuck with his DB-only view of the world – which brings me to one of my ceterum censeo’s, which is “DBAs and developers, let’s talk to each other” :-).

The slides are here.

Object Relational Mapping Tools – let’s talk to each other

Following up with the follow up of Trivadis Tech Event :-).
In addition to the five minute talk about Linux Tracing for Oracle, I presented on a quite different subject: “Object Relational Mapping Tools – let’s talk to each other”.
My original intention was to write 2 – 3 successive posts on the subject, but as always, time’s scarce and so I’ve decided to just upload the presentation, see here.

But allow me some words on the background. A database admin (though spare time coder, too ;-)), presenting on O/R Mappers? How could that be? At least then, these performance killers should really get pulled to pieces, right?
Well, no. They aren’t. Instead, my intentions have been the following:

  • Explain to a DBA audience why developers might want to use them.
  • Explain – to DBAs and developers both – what are probable difficulties, on a conceptual level (the infamous “Object-Relational Impedance Mismatch”)
  • Explain – again on a conceptual level, but with concrete examples, and to both audiences, too – what to watch out for when fetching data (i.e., doing SELECTs as opposed to data manipulation).

“What to watch out for” – this has actually been the talk’s main purpose. (I plan to follow up with load tests / “real empirical stuff” at a later time).
And how about the “let’s talk to each other”? You may have guessed already …

Book Review: Oracle SQL Tricks and Workarounds

Some time ago on oracle-l, someone asked for opinions regarding this book: Oracle SQL Tricks and Workarounds, by Zahar Hilkevich. As far as I know, no one knew the book (or felt like answering), and as I tend to think I know about the “books to read” in Oracle domains I’m interested in, I got curious. Plus, as a database administrator, I always have this bad conscience about not being fluent enough in SQL (What way round did it work again, connect by? How was the syntax for first_value again? … And I’m still grateful to another book – Advanced Oracle SQL Programming by Laurent Schneider – where I picked up such useful stuff as sys.odcivarchar2list…).
So I thought, I might as well get the book and write a review.

Unfortunately for me, as I do not like to criticize other people, this review will not be positive, and I was quite unsure should I really do it. Now that you read this, it’s clear how I decided, and that was for two reasons:
First, to offer people my opinion in case they’d like some buying advice, and second, more importantly, because what I think is not good about the book can be best illustrated by explaining how other approaches are better, and why, and how they do it, – which is probably the most interesting point here.

Now let me start with something positive: I do like the (intended) focus on creativity. (The question is, does the author’s method indeed result in more creative work).
Jumping right in, then, the first chapter is called “Terms and Definitions”, and here already things get strange – to put it politely. One terminological thing that bothers me may be a subjective reaction, possibly even related to my not being a native English or American speaker. The author explains how he had to choose a name for “the set of alternative solutions to a specific problem” and decided

“We call the family of distinct solutions for a given problem the workarounds for that problem.”

To me, workaround implies the existence of an obstacle, something I’ve had to work around… but well, this does not really matter, it is just a terminology thing.

Now the author has to come up with a criterion of what makes for a distinct solution, and here the decision does matter: Astonishingly, he chooses the execution plan (the plan hash value, to be precise):

“When we come up with a candidate solution, we examine its execution plan against the other solutions found so far. If it is different, we consider what we have found a new, distinct workaround; otherwise, we have only found a workaround equivalent to one of the other workarounds, which we also refer to as a synonymous workaround.”

Synonymous workaround (or solution)? In other words, things mean the same because they are executed the same way? This is a severe category error and must be incredibly confusing (and frustrating!) to SQL developers who lack any optimizer-related background knowledge. Say I rewrite my query completely, and still I get the same execution plan, – really? And then I execute it on another system, where a different Oracle version is running, and the plan is totally different? What’s the difference to trial and error? Nothing, as far as this book is concerned.

Let’s say I obtain a sizeable set of phrasings for my query, how do I decide which one is the best? In the predictive, explain plan style execution plans displayed in the book, all I may consider is the cost – may I rely on that? But choosing among different plans is not addressed by the author (guess I just execute and time the queries in my development environment ;-)). All that is addressed is how to enlarge my set of equivalent formulations.

Now, had the author just been content with showing different ways of writing SQL queries, not mixing in execution plans and pseudo-theoretic jargon, all would have been better or at least less confusing. (I also prefer – and am used to, from the “Oracle literature” in general – a much more modest style of writing, but I will not focus on that here). But given the author does show the explain plan for (just about) every rephrasing, at least he should have made some global comments about how these plans differ in general. But in the book, plans where a scan of some table is executed for every row of another table are not treated any different from plans where every table is scanned once – no ranking of plans, no comment, no nothing.

So we’re back to where I started, the “approach thing”. In a nutshell what this book does wrong is, in my opinion, to promote some sort of mechanical creativity while not providing insight. How do I get a query to run faster? I better start off with a lot of knowledge: How Oracle works, how the query optimizer works, what Oracle does with the query I’m passing in, how queries get transformed and why, how access paths are chosen and why… This is one kind of story (next to, e.g., the reverse engineering adventures) I always love reading on Jonathan Lewis’ blog: He knows what kind of execution plan, what access paths etc. he wants and then he rewrites and/or hints the query until he is there :-).

So optimizer and performance related knowledge is key, and for that, I totally recommend studying another book: Chris Antognini’s Troubleshooting Oracle Performance. Most readers will surely know the book, so I won’t add much more here, apart from it definitely being one of my “Oracle books to take on a lone island” if I were to choose 😉

Secondly, as we’re not talking about Oracle Performance in general, but SQL performance specifically, it certainly makes sense to work on specific techniques and methods (like analytic functions, set-based thinking, model clause etc) – and in general, to keep up with what’s going on, sql-wise, as new Oracle versions get out. In this domain, one book I liked and found instructive was Pro Oracle SQL.

Thirdly, a methodological approach may work, if it is based on criteria that really play a role in execution plans. Although I have no personal experience with it, it may be interesting to look into SQL Tuning, by Dan Tow, who bases his methods on relevant (single-table and join) selectivities.

Of course, this book list is not exhaustive, – the main point is, we have some wonderful, insight-providing books around in the Oracle Performance domain, there’s no need for trial-and-error if we want to be creative :-).

Monitoring execution plan changes

Times being a bit busy for me at the moment, unfortunately I have to postpone any “larger” projects, blogging-wise 😉 … but not wanting to lose the habit, I’ve thought I’d publish a little script I’ve written to track changes in execution plans.

The intention is to alert DBAs / developers to SQL statements that due to plan changes have experienced large variation in execution time.
When run, as I’d suggest, every day, the script quickly points out statements whose elapsed time at the most recent point of measurement is much different (in either direction) from elapsed times encountered in earlier snapshots that form part of the comparison moving window.

The moving window size, as well as the threshold (what kind of difference should be considered a big difference?) are parameterized. Different environments / applications will have different optimal values here, plus you will have to find a configuration where the amount of information (number of changed plans) can be handled by the DBAs 🙂

So, here’s the script.

set sqlblanklines on
set lines 800
set pages 1000

/* statements captured during last  days */
with samples as 
 (select *
  from dba_hist_sqlstat st
  join dba_hist_snapshot sn
  using (snap_id, instance_number) 
  where parsing_schema_name = '&schema_name'
  and module  'DBMS_SCHEDULER' -- no sql tuning task
  and executions_delta > 0
  and begin_interval_time between sysdate - '&num_days' and sysdate),

/* just statements that had at least 2 different plans during that time */
  sql_ids as 
   (select sql_id,
    count(distinct plan_hash_value) plancount
    from samples
    group by sql_id
    having count(distinct plan_hash_value) > 2),

/* per combination of sql_id and plan_hash_value, elapsed times per execution */
    plan_stats as 
     (select sql_id,
      count(snap_id) snap_count,
      max(end_interval_time) last_seen,
      sum(executions_delta) total_execs,
      sum(elapsed_time_delta) / sum(executions_delta) elapsed_per_exec_thisplan
      from sql_ids
      join samples
      using (sql_id)
      group by sql_id, plan_hash_value),

/* how much different is the elapsed time most recently encountered from other elapsed times in the measurement interval? */
      elapsed_time_diffs as 
       (select p.*,
        elapsed_per_exec_thisplan - first_value(elapsed_per_exec_thisplan)
          over(partition by sql_id order by last_seen desc) elapsed_per_exec_diff,
        (elapsed_per_exec_thisplan - first_value(elapsed_per_exec_thisplan)
          over(partition by sql_id order by last_seen desc)) / elapsed_per_exec_thisplan elapsed_per_exec_diff_ratio
        from plan_stats p),

/* consider just statements for which the difference is bigger than our configured threshold */
        impacted_sql_ids as 
         (select *
          from elapsed_time_diffs
          where abs(elapsed_per_exec_diff_ratio) > &threshold),

/* for those statements, get all required information */
          all_info as
           (select sql_id,
            round(elapsed_per_exec_thisplan / 1e6, 2) elapsed_per_exec_thisplan,
            round(elapsed_per_exec_diff / 1e6, 2) elapsed_per_exec_diff,
            round(100 * elapsed_per_exec_diff_ratio, 2) elapsed_per_exec_diff_pct,
              over(partition by sql_id), 2) * 100 max_abs_diff,
            round(max(elapsed_per_exec_diff_ratio) over(partition by sql_id), 2) * 100 max_diff,
            'select * from table(dbms_xplan.display_awr(sql_id=>''' || sql_id ||
            ''', plan_hash_value=>' || plan_hash_value || '));' xplan
            from elapsed_time_diffs
            where sql_id in (select sql_id from impacted_sql_ids))

/* format the output */
            select case sign(elapsed_per_exec_diff)
              when 0 then max_abs_diff ||
                case when max_abs_diff  max_diff then
                  '% worse than another plan found during the last 7 days'
                  '% better than the next best plan during the last 7 days'
             end status,
             to_char(last_seen, 'yy-mm-dd hh24:mi:ss') last_seen,
             to_char(a.elapsed_per_exec_thisplan, '999999.99') elapsed_per_exec_thisplan,
             to_char(a.elapsed_per_exec_diff, '999999.99') elapsed_per_exec_diff,
             to_char(a.elapsed_per_exec_diff_pct, '999999.99') elapsed_per_exec_diff_pct,
             from all_info a
             order by sql_id, last_seen desc;

And this is how it might look (“pivoting” the output to make it readable here, and showing one statement only):

status                       	9073% worse than another plan found during the last 7 days 
sql_id   			698s47bt259sk
snap_count 			5
last_seen       		14-09-22  06:30:07
elapsed_per_exec_thisplan	315.74 
elapsed_per_exec_diff 		.00
elapsed_per_exec_diff_pct  	.00
xplan       			select * from table (dbms_xplan.dislay_awr(sql_id=>'698s47bt259sk', plan_hash_value=>3533318143));
sql_id   			698s47bt259sk
snap_count 			5
last_seen       		14-09-22  06:30:07
elapsed_per_exec_thisplan	3.44 
elapsed_per_exec_diff 		-312.29
elapsed_per_exec_diff_pct  	-9072.77
xplan       			select * from table (dbms_xplan.dislay_awr(sql_id=>'698s47bt259sk', plan_hash_value=>3837578073));

Of course, the script would have to be adapted if you had to deal with statements using different literals on each execution (which was not the case for my usecase this time).

Speeding up a non-equijoin

This is just a small piece about an idea I’ve had to speed up the unfortunate “where tab1.col1 between tab2.col2 and tab2.col3” query.
This type of join is notoriously difficult because if you have a large amount of data in both tables, you will want Oracle to perform a hash join, which it can’t because we’re having a non-equijoin here.

Basically, the idea is to find a way to join the tables on a less precise identity condition, keeping false positives, and then filter out the rows that do not fulfill the real requirement. Instead of a sort merge join running endlessly, you want to end up with a hash join followed by filtering and sorting out duplicates.

While the idea is simple, the realization depends on finding such a “less precise condition” that will not lose any required rows – what we need is

f(tab1.col1) = f(tab2.col2) = f(tab2.col3)

and that might be difficult to find, or might necessitate a split into several subsets, each joined and filtered by itself, which again deteriorates the splendid performance we were hoping for. There might be, however, ways to find such a condition without bloating the query too much.

To demonstrate, I’ve built a test case analogous to the real case I’ve encountered, and I’ve populated the test tables in a way that did not make my life simple (this is to say, I did not “fake” the data to be easily handled by the idea, but generated random data that ended up having the same “problems” as encountered in the real world situation).

The test case

As in the real world example I’d encountered, the table containing the range definitions has two indexes, the “range start” column being the primary key, and the “range end” column having on it a unique index:

SQL> create table ranges (range_start number primary key, range_end number, label varchar2(10));
SQL> create unique index jointest.range_end_uk on jointest.ranges(range_end);
SQL> alter table ranges add constraint range_end_unique unique(range_end) using index jointest.range_end_uk;

SQL> create table cases (id number generated always as identity primary key, range_code number);

The amount of data inserted is a fraction of the real example’s size, adapting to the circumstances of the test being run on my laptop…

SQL> insert into ranges (range_start, range_end, label)
        with r (startnum, endnum, lbl) as
            (select 10000 startnum, 10000 + round(dbms_random.value(9000, 11000)) endnum,   dbms_random.string('a', 10) lbl from dual
             union all
             select endnum + 1, endnum + 1 + round(dbms_random.value(9000, 11000)), dbms_random.string('a', 10) from r
        where startnum < 1000000000)
        select * from r;
SQL> exec dbms_stats.gather_table_stats(user, 'ranges');

SQL> insert into cases (range_code, flag1, flag2)
        select round(dbms_random.value(10000, 1000000000)) from dual 
        connect by level < 100000;
SQL> exec dbms_stats.gather_table_stats(user, 'cases');

SQL> select table_name, num_rows from user_tables;

------------------------------ ----------
CASES				    99999
RANGES				   100021

SQL> select table_name, column_name, num_distinct, histogram from user_tab_columns order by 1,2;

------------------------------ ------------------------------ ------------ ---------------
CASES			       ID				     99999 NONE
CASES			       RANGE_CODE			     99999 NONE
RANGES			       LABEL				    100021 NONE
RANGES			       RANGE_END			    100021 NONE
RANGES			       RANGE_START			    100021 NONE

The original query

Now first, for the original query.

SQL> select c.id, r.label from cases c join ranges r
  on (c.range_code between r.range_start and r.range_end);

For an output of 99999 rows, this query runs forever (40 mins on my laptop), doing a merge join between ranges (accessed via the primary key index) and cases on c.range_code>=r.range_start, while the join result is filtered on c.range_code<=r.range_end.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

SQL_ID	54nqyafhsw8fp, child number 0
select c.id, r.label from jointest.cases c join jointest.ranges r on
(c.range_code between r.range_start and r.range_end)

Plan hash value: 3569056763

| Id  | Operation		     | Name	   | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
|   0 | SELECT STATEMENT	     |		   |	  1 |	     |	     |	     | 34778 (100)|	     |	99999 |00:40:10.12 |   14854 |	 1500 |       |       | 	 |
|   1 |  MERGE JOIN		     |		   |	  1 |	2500M|	  86G|	     | 34778  (97)| 00:00:02 |	99999 |00:40:10.12 |   14854 |	 1500 |       |       | 	 |
|   2 |   TABLE ACCESS BY INDEX ROWID| RANGES	   |	  1 |	 100K|	2441K|	     |	 627   (1)| 00:00:01 |	  100K|00:00:04.38 |   13975 |	  625 |       |       | 	 |
|   3 |    INDEX FULL SCAN	     | SYS_C009905 |	  1 |	 100K|	     |	     |	 214   (1)| 00:00:01 |	  100K|00:00:01.26 |	6929 |	  214 |       |       | 	 |
|*  4 |   FILTER		     |		   |	100K|	     |	     |	     |		  |	     |	99999 |00:40:05.43 |	 879 |	  875 |       |       | 	 |
|*  5 |    SORT JOIN		     |		   |	100K|  99999 |	1171K|	3944K|	 699   (2)| 00:00:01 |	 4990M|00:25:59.14 |	 879 |	  875 |  3525K|   816K| 3133K (0)|
|   6 |     TABLE ACCESS FULL	     | CASES	   |	  1 |  99999 |	1171K|	     |	 240   (1)| 00:00:01 |	99999 |00:00:00.37 |	 879 |	  875 |       |       | 	 |

Predicate Information (identified by operation id):

   4 - filter("C"."RANGE_CODE"="R"."RANGE_START")

The workaround

So, I am looking for some function to apply to cases.range_code that will yield the same result when applied to range_start and range_end in the ranges table. Unfortunately, a simple


(lower numbers will work “even less”, while higher ones will render the hash join less effective) would lose rows:

SQL> select * from      
       (select range_start, range_end,
        trunc(range_start/100000) range_start_truncated,
        trunc(range_end/100000) range_end_truncated
        from ranges) 
        where range_start_truncated != range_end_truncated;

----------- ---------- --------------------- -------------------
903798841  903809794			9037		    9038
903899431  903908527			9038		    9039
903991927  904002717			9039		    9040
904095228  904104978			9040		    9041

9999 rows selected.

Still, the situation is not too bad: range_start and range_end, when divided by 100000 and truncated, differ by at most 1:

SQL> select * from      
       (select range_start, range_end,
        trunc(range_start/100000) range_start_truncated,
        trunc(range_end/100000) range_end_truncated
        from jointest.ranges) 
        where abs(range_start_truncated - range_end_truncated) > 1;

No rows selected.

which means I can simply divide up the hash join in two parts – trunc(range_code/100000) has to match either trunc(range_start/100000) or trunc(range_end/100000) – , union all the result set and look for distinct rows that match the original condition:

SQL> with pre_filter as
       (select c.id, c.range_code, r.range_start, r.range_end
        from cases c join ranges r
        on trunc(c.range_code/100000) = trunc(r.range_start/100000)
        union all
        select c.id, c.range_code, r.range_start, r.range_end
        from cases c join ranges r on
        trunc(c.range_code/100000) = trunc(r.range_end/100000))
          select distinct * from pre_filter
          where range_code between range_start and range_end;

This way, the query runs in one and a half seconds:

| Id  | Operation	      | Name   | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time	 | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
|   0 | SELECT STATEMENT      |        |      1 |	 |	 |	 |   374K(100)| 	 |  99999 |00:00:01.60 |    2670 |	 |	 |	    |
|   1 |  HASH UNIQUE	      |        |      1 |     50M|  1240M|  1915M|   374K  (2)| 00:00:15 |  99999 |00:00:01.60 |    2670 |  8288K|  3198K| 6597K (0)|
|   2 |   VIEW		      |        |      1 |     50M|  1240M|	 |  2570  (53)| 00:00:01 |    189K|00:00:01.37 |    2670 |	 |	 |	    |
|   3 |    UNION-ALL	      |        |      1 |	 |	 |	 |	      | 	 |    189K|00:00:01.30 |    2670 |	 |	 |	    |
|*  4 |     HASH JOIN	      |        |      1 |     25M|   620M|  2344K|  1285  (53)| 00:00:01 |  94838 |00:00:00.58 |    1335 |  6824K|  1791K|    9M (0)|
|   5 |      TABLE ACCESS FULL| CASES  |      1 |  99999 |  1171K|	 |   240   (1)| 00:00:01 |  99999 |00:00:00.03 |     879 |	 |	 |	    |
|   6 |      TABLE ACCESS FULL| RANGES |      1 |    100K|  1367K|	 |   137   (1)| 00:00:01 |    100K|00:00:00.03 |     456 |	 |	 |	    |
|*  7 |     HASH JOIN	      |        |      1 |     25M|   620M|  2344K|  1285  (53)| 00:00:01 |  95020 |00:00:00.54 |    1335 |  6824K|  1791K|    9M (0)|
|   8 |      TABLE ACCESS FULL| CASES  |      1 |  99999 |  1171K|	 |   240   (1)| 00:00:01 |  99999 |00:00:00.03 |     879 |	 |	 |	    |
|   9 |      TABLE ACCESS FULL| RANGES |      1 |    100K|  1367K|	 |   137   (1)| 00:00:01 |    100K|00:00:00.03 |     456 |	 |	 |	    |

Predicate Information (identified by operation id):

   4 - access(TRUNC("C"."RANGE_CODE"/100000)=TRUNC("R"."RANGE_START"/100000))

Of course, the data might make things even more complicated in other cases. Still, even a union of several hash joins (where you carefully select the groups so no rows are lost) might run considerably faster than the sort merge join the optimizer has to do otherwise.