SQL trace (event 10046) – how much overhead is there, really?

With the granularity and high precision given by dbms_monitor in Oracle 11g – you can trace by client identifier, specifically, by service/module/action, if your application is instrumented that well, or even by sql_id – one might start to wonder whether tracing wasn’t an option in circumstances where it definitely was not an option before (e.g., the busy the production system). After all, tracing may be done in a very targeted, specific, limited way… Still, even in such cases, it would be best to have a clear idea of what – how much overhead – to expect. So instead of speculating, I’ve decided to run a little experiment, intended to be sufficiently realistic.

This experiment draws deeply on the Alexander Anokhin’s impressive, in-depth series on rowsource statistics (see part 1, http://alexanderanokhin.wordpress.com/2012/12/24/timing-query-execution-statistics-rowsource-statistics-part-1-how-it-works/ and part 2, http://alexanderanokhin.wordpress.com/2012/12/24/timing-rowsource-statistics-part-2-overhead-and-inconsistent-time/), and while I’ll refer to his blog for the facts and workings, the purpose here is to mimic a possible “real workload” and see how tracing affects performance.

First, let me quickly sum up some principal facts related by Alexander, facts that will determine my pre-experiment assumptions.

Background

1. Enabling rowsource statistics: sql trace, gather_plan_statistics, and statistics_level=all

All of the above cause rowsource statistics (mainly, the time spent executing a rowsource, the number of rows returned, and statistics like number of consistent and current gets and physical reads/writes) to be gathered. They differ, however, in how often the current timestamp will be obtained from the OS and so, result in more or less accurate execution time information.

  • statistics_level=all:
    If you set this parameter, a timer call will be made for every execution of the “statistics-snapping function pair” qerstSnapStats()/qerstUpdateStats(), provided you did not change (that is, leave at its default of 128) the underscore parameter _rowsource_statistics_sampfreq. If you did, the parameter’s setting will be used.
  • hint gather_plan_statistics:
    If you use the hint gather_plan_statistics, _rowsource_statistics_sampfreq will be used, whatever its value.
  • sql trace (event 10046):
    The same: _rowsource_statistics_sampfreq will be used, whatever its value.

These facts are interesting, in the context of my current question, because they suggest that if I actually decide to do some tracing, I still have options – I might be able to control the overhead by modifying _rowsource_statistics_sampfreq.

2. Overhead of timer calls and the VSDO

The actual overhead of timer calls – on Linux, using the gettimeofday function – varies with configuration of the related VDSO (Virtual Dynamic Shared Object) – for background information see e.g. https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html).
Citing the RedHat documentation, VDSO

“is a shared library that allows application in user space to perform some kernel actions without as much overhead as a system call. […] Enabling the VDSO instructs the kernel to use its definition of the symbols in the VDSO, rather than the ones found in any user-space shared libraries, particularly the glibc. […] When enabled, the VDSO overrides the glibc definition of gettimeofday with it’s own. This removes the overhead of a system call, as the call is made direct to the kernel memory, rather than going through the glibc.”

Configuration options – in /proc/sys/kernel/vsyscall64 – are:

  • 0: uses regular system call, highest overhead, resolution in μs
  • 1: uses VSDO definition, lower overhead, but still at μs resolution – this is the default if you don’t modify the parameter
  • 2: uses VSDO definition, lowest overhead, ms resolution

The importance for me here is twofold: First – sweet hope – perhaps the default value of 1 already causes timer calls to be sufficiently cheap? And second, how much improvement can I get by setting the parameter to 2? (Although I doubt that ms resolution is enough for each and every real world problem).

3. Interdependencies

Thirdly, as Alexander demonstrates, the actual flow of execution determines the frequency of timer calls. For example, in a hash join, the hash table is build and probed during row fetch, which means there has to be a timer call before and after every row fetch (see the last paragraphs of http://alexanderanokhin.wordpress.com/2012/12/24/timing-query-execution-statistics-rowsource-statistics-part-1-how-it-works/).
For practical purposes, this means that I expect detrimental effects of tracing to be most pronounced when e.g. hash joins are involved.

The experiment

So, my assumptions before the experiment were:

  1. Elapsed time will be higher when tracing is on (naturally)
  2. When tracing is on, elapsed time will correlate to the setting of _rowsource_statistics_sampfreq.
  3. Elapsed time will be less with a VDSO parameter of 1 than of 0, and possibly even less with a setting of 2.
  4. The kind of operation will make a difference. To keep this simple, I’ve decided to test 2 different types of workload, one OLTP-like workload where lookup goes through indexes and returns a rather smallish number of rows, and a data warehouse – like workload where data is retrieved by full table scan, many rows are returned, and joins are mostly hash joins. For my setup, I expect any effects between different settings of _rowsource_statistics_sampfreq to be more pronounced with the DWH workload, and less visible with the OLTP workload. Also in general, I expect the difference between tracing on & off to be more substantial with the DWH workload.

The setup

First, the test tables and the measurement table. The indexes were created to allow for a simple distinction between the OLTP and DWH workloads (selective predicates facilitating index access vs. non-selective predicates necessitating full table scan:

create table t1 as select * from all_objects;
create table t2 as select * from all_objects;

create index t1_i1 on t1(owner);
create index t1_i2 on t1(object_type);
create index t2_i1 on t1(owner);
create index t2_i2 on t1(object_type);

exec dbms_stats.gather_table_stats(user, 't1');
exec dbms_stats.gather_table_stats(user, 't2');

create table tracetest.elapsed (runid number, proc varchar2(30), tracing varchar2(3), sampfreq number, vsyscall64 number, elapsed number);
alter table tracetest.elapsed add primary key(runid, proc, tracing, sampfreq, vsyscall64);

The OLTP workload performs selective queries and uses the indexes created on object_type and owner (in case of the join query , cursor 3, doing a nested loop join):

create or replace procedure tracetest.tracetest_highsel_small as

cursor c1 is
select * from tracetest.t1 where object_type = 'JOB'; -- 22 rows returned
l1 c1%rowtype;
cursor c2 is
select * from tracetest.t1 where owner = 'DBSNMP'; -- 63 rows returned
l2 c2%rowtype;
cursor c3 is -- 10 rows returned
select object_id id1,
t1.object_type tp1,
t2.object_type tp2,
t1.created cr1,
t2.created cr2
from tracetest.t1
join tracetest.t2
using (object_id)
where t1.object_type = 'CLUSTER'
and t2.object_type = 'CLUSTER';
l3 c3%rowtype;

begin

for i in 1 .. 15000 loop

open c1;
open c2;
open c3;

loop
fetch c1
into l1;
exit when c1%notfound;
end loop;
loop
fetch c2
into l2;
exit when c2%notfound;
end loop;
loop
fetch c3
into l3;
exit when c3%notfound;
end loop;

close c1;
close c2;
close c3;

end loop;

end;

The DWH workload asks for a high portion of the data, and consequently, performs full table scans and for cursor 3, a hash join:

create or replace procedure tracetest.tracetest_lowsel_small as

cursor c1 is
select * from tracetest.t1 where object_type = 'SYNONYM'; -- 25652 rows returned
l1 c1%rowtype;
cursor c2 is
select * from tracetest.t1 where owner = 'SYS'; -- 32095 rows returned
l2 c2%rowtype;
cursor c3 is -- 20788 rows returned
select object_id id1,
t1.object_type tp1,
t2.object_type tp2,
t1.created cr1,
t2.created cr2
from tracetest.t1
join tracetest.t2
using (object_id)
where t1.object_type = 'JAVA CLASS'
and t2.object_type = 'JAVA CLASS'
; --
l3 c3%rowtype;

begin

for i in 1 .. 24 loop

open c1;
open c2;
open c3;

loop
fetch c1
into l1;
exit when c1%notfound;
end loop;
loop
fetch c2
into l2;
exit when c2%notfound;
end loop;
loop
fetch c3
into l3;
exit when c3%notfound;
end loop;

close c1;
close c2;
close c3;

end loop;

end;

For both workloads, I created a procedure that loops through calls of this procedure, using different settings of _rowsource_statistics_sampfreq with tracing turned on, as well as tracing turned off, and records the elapsed time:

create or replace procedure tracerun_highsel_small (p_vsyscall64 in number)
is
l_time number;

begin

for i in 1 .. 20 loop

-- no tracing, _rowsource_statistics_sampfreq = 128;
dbms_monitor.session_trace_disable;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 128';
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'off',
128,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;

-- tracing, _rowsource_statistics_sampfreq = 128;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 128';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
128,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;

-- tracing, _rowsource_statistics_sampfreq = 16;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 64';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
16,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;

-- tracing, _rowsource_statistics_sampfreq = 4;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 4';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
4,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;

-- tracing, _rowsource_statistics_sampfreq = 1;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 1';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
1,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;
end loop;
end;

This procedure in turn was called with the actual setting of vsyscall64.

The tests were performed on RedHat 6.4 system with 20G of memory and 2 CPUs. And the results were…
… a bit unexpected.

Results

First, the OLTP workload:

SQL> select proc, vsyscall64, tracing, sampfreq, round(avg(elapsed)) from tracetest.elapsed
2 where proc='tracetest_highsel_small'
3 group by proc, vsyscall64, proc, tracing, sampfreq
4 order by 1,2,3,4 desc;

PROC VSYSCALL64 TRA SAMPFREQ ROUND(AVG(ELAPSED))
----------------------- ---------- --- ---------
tracetest_highsel_small   0    off   128   2907
tracetest_highsel_small   0    on    128   5349
tracetest_highsel_small   0    on    16    5415
tracetest_highsel_small   0    on    4     5385
tracetest_highsel_small   0    on    1     5402

tracetest_highsel_small   1    off   128   2770
tracetest_highsel_small   1    on    128   5269
tracetest_highsel_small   1    on    16    5199
tracetest_highsel_small   1    on    4     5214
tracetest_highsel_small   1    on    1     5241

tracetest_highsel_small   2    off   128   2746
tracetest_highsel_small   2    on    128   5248
tracetest_highsel_small   2    on    16    5212
tracetest_highsel_small   2    on    4     5250
tracetest_highsel_small   2    on    1     5221

Well, I was prepared to perform some statistical tests, but this hardly seems to make sense here…

For my first assumption, visibly tracing does take more time than non-traced execution, namely, nearly twice as much (in my test, 188%, to be precise).

For the second assumption- the lower the setting of _rowsource_sampling_freq, the higher the time elapsed, I don’t find any confirmation. Evidently in my experiment, modifying _rowsource_sampling_freq had no effect.

My third assumption regarding the setting of vsyscall64 is not really confirmed either:

select vsyscall64, round(avg(elapsed)) from tracetest.elapsed
where proc='tracetest_highsel_small'
group by vsyscall64
order by 1;

VSYSCALL64 ROUND(AVG(ELAPSED))
---------- -------------------
0          4891
1          4739
2          4735

It does look like a setting of 0 leads to a very slight increase in elapsed time, but the difference is minimal – with a setting of 1, we’re saving about 3% of the time spent with a setting of 0.
Setting vsyscall64 to 2 does not seem to cause any further time savings.

The DWH workload

Let’s now double-check against the results from the DWH-like workload. Perhaps – in accordance with the fourth assumption – will _rowsource_sampling_freq make a difference here (given we’re returning so many more rows)?

SQL> set pages 10000
select proc, vsyscall64, tracing, sampfreq, round(avg(elapsed)) from tracetest.elapsed
where proc='tracetest_lowsel_small'
group by proc, vsyscall64, proc, tracing, sampfreq
4 order by 1,2,3,4 desc;

PROC VSYSCALL64 TRA SAMPFREQ ROUND(AVG(ELAPSED))
----------------------- ---------- --- ---------- 
tracetest_lowsel_small    0    off   128   2898
tracetest_lowsel_small    0    on    128   5713
tracetest_lowsel_small    0    on    16    5735
tracetest_lowsel_small    0    on    4     5725
tracetest_lowsel_small    0    on    1     5744

tracetest_lowsel_small    1    off   128   2721
tracetest_lowsel_small    1    on    128   5575
tracetest_lowsel_small    1    on    16    5556
tracetest_lowsel_small    1    on    4     5546 
tracetest_lowsel_small    1    on    1     5551

tracetest_lowsel_small    2    off   128   2669
tracetest_lowsel_small    2    on    128   5490
tracetest_lowsel_small    2    on    16    5488
tracetest_lowsel_small    2    on    4     5489
tracetest_lowsel_small    2    on    1     5477

First, non-traced execution, compared with tracing, takes 49% of the time – a similar ratio as with the OLTP workload, though even slightly less favorable to tracing.
Secondly – and answering our question as to whether we’d see any effect of _rowsource_sampling_freq with the changed workload – it’s no, we don’t, even with the high number of rows returned from the queries involved.
Thirdly, regarding vsyscall64,

select vsyscall64, round(avg(elapsed)) from tracetest.elapsed
where proc='tracetest_lowsel_small'
group by vsyscall64
4 order by 1;

VSYSCALL64 ROUND(AVG(ELAPSED))
---------- -------------------
0          5163

1          4990

2          4923

we again see minimal differences, – not really what was expected originally.
And lastly, it’s still another no, – the two workloads do not behave in different ways, as far as the kinds of measurements taken are concerned.

Conclusion

So, what’s the conclusion? As regards the impact of tracing, I think an approximative statement can be made about its costliness in real-world situations. I also need to mention that variance was small, and that I also experimented with a test procedure that included – though to a lesser degree – insert, update and delete statements. With that “mixed” procedure,the difference between tracing and not was much less – even hardly – visible.

That’s why I think the statement could be made that the ratio of 1:2 between tracing and not tracing is something like an upper bound for real-world situations (meaning that the difference might often be much less pronounced).

Other than that, two questions remain: Why is there no visible effect of varying either _rowsource_sampling_freq or vsyscall64?
I can conceive of at least three possible types of answer (which I will not investigate in this post, which is long enough already):

First, there could be something in the experiment’s setup that precludes getting the expected results. (Actually, I if knew that was the case I’d be modifying and rerunning the tests right now :-)).

Then, there could be independent reasons for both “non-findings”, one having to do with Oracle’s sampling method and the other, with the workings of the OS timer call.

Finally, there’s also the possibility that it is not the frequency and implementation of timer calls that constitutes the main part of tracing overhead, but something else, unaffected by either _rowsource_sampling_freq or vsyscall64. This possibility looks intuitively plausible to me because the overhead has to arise some place…
Of course, OS level tracing should be able to help clarifying this issue. Another time…

Advertisements

One thought on “SQL trace (event 10046) – how much overhead is there, really?

  1. I think you’ll find that the I/O required to write the trace data to the file system is a dominant cost of having trace activated. Oracle is especially inefficient in that each line of trace data requires *two* write calls: one for the content, and then a separate one for the “\n” character at the end. Especially for SQL statements with lots of (e.g., hundreds of) placeholders, there is a lot of additional writing going on when you consider that there are several lines of output for each placeholder with the binds=>true tracing activated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s