Direct Path, Nologging, CTAS, and GTT: a comparison of undo and redo generated

To log or log not

Well of course, as an Oracle DBA, I was conscious of the usual “nologging question” – how to weigh the impact on your recovery procedures against the performance gain and space savings (redo log-wise) achieved by turning off logging on certain tables.
But honestly, when it came to nologging and direct path loads, I wouldn’t have been able to give any figures as to how big the difference in redo and undo generation really might be (let alone impact on performance).

So I’ve decided to set up a simple experiment, measuring undo and redo generation while varying the following factors:

  • table logging vs. nologging
  • conventional vs. direct path load
  • conventional table vs. global temporary table
  • insert vs. CTAS
  • small vs. big table

While it would have been nice to compare elapsed times, I would have had to perform an extensive series of measurements in order to guarantee reliable results (especially given the environment I used), which is why I deemed it more prudent not to relate any performance related data 🙂
Speaking of environment, the tests were conducted on an 11.2.0.3 2-node cluster comprised of virtual machines running RedHat 5.

The rules

To keep it simple, I’ve restricted myself to the – at least from my point of view – rather common (for a data warehouse) situation of having the database running in archivelog, but not force logging mode.
In this case, we expect the following (see http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_9014.htm#i2163698, as well as http://docs.oracle.com/cd/E11882_01/server.112/e25494/tables002.htm#i1106265):

  • conventional insert generates maximal redo and undo, regardless of logging setting
  • direct path insert does NOT generate any undo for data (as opposed to metadata) changes. Redo for data changes is generated ONLY when the table is not set to nologging
  • a temporary table (being nologging “per se”) has redo generated just for the undo that is being created, not for the inserts – unless direct path load is used, in which case any undo and consequent redo generation is skipped
  • CTAS, being a direct path operation, does NOT generate any undo for data (as opposed to metadata) changes. Redo for data changes is generated ONLY when the table is not set to nologging. Compared to an insert that is direct path as well as nologging, metadata-related redo and undo generated are of course expected to be much bigger
  • nologging setting should have bigger effect with big tables than with small ones (as stated by the documentation linked above,- but it should logically make sense too, the difference in amount of metadata between a big and a small table being, probably, smaller than the difference in data size…)

The figures

In order to see the effect of table size, I have performed two sets of tests, the first with a rather smallish table (inexplicably, quite precisely of dba_objects size ;-), ~ 60000 rows), the second of size ~ 8 million rows.

First, the results for the small table. More out of curiosity than expectancy of big differences, I’ve included the table size, as determined by dbms_stats.

Small table
direct path nologging redo (bytes) undo (bytes) blocks
insert into <table> select * from no no 7,300,176 285,530 1000
insert into <table> select * from nologging no yes 7,312,736 281,960 1000
insert /*+ append */ into <table> select * from yes no 7,365,808 8,804 909
insert /*+ append */ into <table> select * from nologging yes yes 48,684 8,804 909
insert into <GTT> select * from no yes 385,780 272,480 883
insert into <GTT> select * from nologging yes yes 140 80 883
create table <table> as select * from yes no 7,392,692 17,964 909
create table <table> nologging as select * from yes yes 82,848 17,844 909

Now, what do the figures say?

Firstly, as to the four combinations of logging on/off and conventional vs. direct path load, we see exactly what we’d expect to, from the Oracle documentation: it is only when combining direct path and nologging that we have any effects on redo. This effect, however, is quite impressive: redo is being reduced by a factor of 140!

How does insert append compare to CTAS? When performed logging, CTAS in fact generates an amount of redo similar to the respective version of insert append. Done nologging, redo amounts to nearly the double amount, compared to insert append – the same is the case for undo. Actually, not surprising when we take into account that metadata changes have to be logged always – and CTAS, of course, includes the metadata for table creation…

So, when there is insert append, can it get any better? Well yes, it can, with GTT. Now, while CTAS is direct path always and can be logging as well as nologging, GTTs are nologging always and can optionally be direct path. And in contrast to inserts into a normal table, with GTT we do see an effect of nologging alone: A GTT, even when populated by conventional insert, has redo reduced by a factor of about 20. But in combination with direct path load, the effect is, well, quite awesome: Redo (as well as undo) nearly vanishes completely!

So, I guess what this means is you need rather compelling reasons not to use GTTs in every situation where they are appropriate. And fortunately, some compelling reasons now go away with 12c: session-level statistics, inserts on the standby … for sure something worth to check out!

Big table

Now, we still have the question whether savings in redo generation will be bigger with bigger tables. Let’s see.

direct path nologging redo (bytes) undo (bytes) blocks
insert into <table> select * from no no 928,535,948 35,028,304 113117
insert into <table> select * from nologging no yes 928,664,192 35,027,486 113117
insert /*+ append */ into <table> select * from yes no 937,730,612 76,952 113789
insert /*+ append */ into <table> select * from nologging yes yes 772,916 76,644 113789
create table <table> as select * from yes no 938,164,896 157,080 113789
create table <table> as select * from nologging yes yes 962,696 121,376 113789
insert into <GTT> select * from no yes 49,459,284 34,941,516 112702
insert into <GTT> select * from nologging yes yes 420 80 112702

And they are: With the big table, redo gets reduced not by a factor of 140, but of 1200… and with the GTT, we get 8 million rows populated for virtually no (420 bytes of) redo.

Now, that’s it for now. If you’re impressed by the effects of nologging and are in a situation where its advantages do count, it might be time to check out the exact consequences on recoverability (possibly starting out with this blog post: http://oraclecomplete.blogspot.de/2013/01/benefits-and-consequences-of-nologging.html).

SQL trace (event 10046) – how much overhead is there, really?

With the granularity and high precision given by dbms_monitor in Oracle 11g – you can trace by client identifier, specifically, by service/module/action, if your application is instrumented that well, or even by sql_id – one might start to wonder whether tracing wasn’t an option in circumstances where it definitely was not an option before (e.g., the busy the production system). After all, tracing may be done in a very targeted, specific, limited way… Still, even in such cases, it would be best to have a clear idea of what – how much overhead – to expect. So instead of speculating, I’ve decided to run a little experiment, intended to be sufficiently realistic.

This experiment draws deeply on the Alexander Anokhin’s impressive, in-depth series on rowsource statistics (see part 1, http://alexanderanokhin.wordpress.com/2012/12/24/timing-query-execution-statistics-rowsource-statistics-part-1-how-it-works/ and part 2, http://alexanderanokhin.wordpress.com/2012/12/24/timing-rowsource-statistics-part-2-overhead-and-inconsistent-time/), and while I’ll refer to his blog for the facts and workings, the purpose here is to mimic a possible “real workload” and see how tracing affects performance.

First, let me quickly sum up some principal facts related by Alexander, facts that will determine my pre-experiment assumptions.

Background

1. Enabling rowsource statistics: sql trace, gather_plan_statistics, and statistics_level=all

All of the above cause rowsource statistics (mainly, the time spent executing a rowsource, the number of rows returned, and statistics like number of consistent and current gets and physical reads/writes) to be gathered. They differ, however, in how often the current timestamp will be obtained from the OS and so, result in more or less accurate execution time information.

  • statistics_level=all:
    If you set this parameter, a timer call will be made for every execution of the “statistics-snapping function pair” qerstSnapStats()/qerstUpdateStats(), provided you did not change (that is, leave at its default of 128) the underscore parameter _rowsource_statistics_sampfreq. If you did, the parameter’s setting will be used.
  • hint gather_plan_statistics:
    If you use the hint gather_plan_statistics, _rowsource_statistics_sampfreq will be used, whatever its value.
  • sql trace (event 10046):
    The same: _rowsource_statistics_sampfreq will be used, whatever its value.

These facts are interesting, in the context of my current question, because they suggest that if I actually decide to do some tracing, I still have options – I might be able to control the overhead by modifying _rowsource_statistics_sampfreq.

2. Overhead of timer calls and the VSDO

The actual overhead of timer calls – on Linux, using the gettimeofday function – varies with configuration of the related VDSO (Virtual Dynamic Shared Object) – for background information see e.g. https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html).
Citing the RedHat documentation, VDSO

“is a shared library that allows application in user space to perform some kernel actions without as much overhead as a system call. […] Enabling the VDSO instructs the kernel to use its definition of the symbols in the VDSO, rather than the ones found in any user-space shared libraries, particularly the glibc. […] When enabled, the VDSO overrides the glibc definition of gettimeofday with it’s own. This removes the overhead of a system call, as the call is made direct to the kernel memory, rather than going through the glibc.”

Configuration options – in /proc/sys/kernel/vsyscall64 – are:

  • 0: uses regular system call, highest overhead, resolution in μs
  • 1: uses VSDO definition, lower overhead, but still at μs resolution – this is the default if you don’t modify the parameter
  • 2: uses VSDO definition, lowest overhead, ms resolution

The importance for me here is twofold: First – sweet hope – perhaps the default value of 1 already causes timer calls to be sufficiently cheap? And second, how much improvement can I get by setting the parameter to 2? (Although I doubt that ms resolution is enough for each and every real world problem).

3. Interdependencies

Thirdly, as Alexander demonstrates, the actual flow of execution determines the frequency of timer calls. For example, in a hash join, the hash table is build and probed during row fetch, which means there has to be a timer call before and after every row fetch (see the last paragraphs of http://alexanderanokhin.wordpress.com/2012/12/24/timing-query-execution-statistics-rowsource-statistics-part-1-how-it-works/).
For practical purposes, this means that I expect detrimental effects of tracing to be most pronounced when e.g. hash joins are involved.

The experiment

So, my assumptions before the experiment were:

  1. Elapsed time will be higher when tracing is on (naturally)
  2. When tracing is on, elapsed time will correlate to the setting of _rowsource_statistics_sampfreq.
  3. Elapsed time will be less with a VDSO parameter of 1 than of 0, and possibly even less with a setting of 2.
  4. The kind of operation will make a difference. To keep this simple, I’ve decided to test 2 different types of workload, one OLTP-like workload where lookup goes through indexes and returns a rather smallish number of rows, and a data warehouse – like workload where data is retrieved by full table scan, many rows are returned, and joins are mostly hash joins. For my setup, I expect any effects between different settings of _rowsource_statistics_sampfreq to be more pronounced with the DWH workload, and less visible with the OLTP workload. Also in general, I expect the difference between tracing on & off to be more substantial with the DWH workload.

The setup

First, the test tables and the measurement table. The indexes were created to allow for a simple distinction between the OLTP and DWH workloads (selective predicates facilitating index access vs. non-selective predicates necessitating full table scan:

create table t1 as select * from all_objects;
create table t2 as select * from all_objects;

create index t1_i1 on t1(owner);
create index t1_i2 on t1(object_type);
create index t2_i1 on t1(owner);
create index t2_i2 on t1(object_type);

exec dbms_stats.gather_table_stats(user, 't1');
exec dbms_stats.gather_table_stats(user, 't2');

create table tracetest.elapsed (runid number, proc varchar2(30), tracing varchar2(3), sampfreq number, vsyscall64 number, elapsed number);
alter table tracetest.elapsed add primary key(runid, proc, tracing, sampfreq, vsyscall64);

The OLTP workload performs selective queries and uses the indexes created on object_type and owner (in case of the join query , cursor 3, doing a nested loop join):

create or replace procedure tracetest.tracetest_highsel_small as

cursor c1 is
select * from tracetest.t1 where object_type = 'JOB'; -- 22 rows returned
l1 c1%rowtype;
cursor c2 is
select * from tracetest.t1 where owner = 'DBSNMP'; -- 63 rows returned
l2 c2%rowtype;
cursor c3 is -- 10 rows returned
select object_id id1,
t1.object_type tp1,
t2.object_type tp2,
t1.created cr1,
t2.created cr2
from tracetest.t1
join tracetest.t2
using (object_id)
where t1.object_type = 'CLUSTER'
and t2.object_type = 'CLUSTER';
l3 c3%rowtype;

begin

for i in 1 .. 15000 loop

open c1;
open c2;
open c3;

loop
fetch c1
into l1;
exit when c1%notfound;
end loop;
loop
fetch c2
into l2;
exit when c2%notfound;
end loop;
loop
fetch c3
into l3;
exit when c3%notfound;
end loop;

close c1;
close c2;
close c3;

end loop;

end;

The DWH workload asks for a high portion of the data, and consequently, performs full table scans and for cursor 3, a hash join:

create or replace procedure tracetest.tracetest_lowsel_small as

cursor c1 is
select * from tracetest.t1 where object_type = 'SYNONYM'; -- 25652 rows returned
l1 c1%rowtype;
cursor c2 is
select * from tracetest.t1 where owner = 'SYS'; -- 32095 rows returned
l2 c2%rowtype;
cursor c3 is -- 20788 rows returned
select object_id id1,
t1.object_type tp1,
t2.object_type tp2,
t1.created cr1,
t2.created cr2
from tracetest.t1
join tracetest.t2
using (object_id)
where t1.object_type = 'JAVA CLASS'
and t2.object_type = 'JAVA CLASS'
; --
l3 c3%rowtype;

begin

for i in 1 .. 24 loop

open c1;
open c2;
open c3;

loop
fetch c1
into l1;
exit when c1%notfound;
end loop;
loop
fetch c2
into l2;
exit when c2%notfound;
end loop;
loop
fetch c3
into l3;
exit when c3%notfound;
end loop;

close c1;
close c2;
close c3;

end loop;

end;

For both workloads, I created a procedure that loops through calls of this procedure, using different settings of _rowsource_statistics_sampfreq with tracing turned on, as well as tracing turned off, and records the elapsed time:

create or replace procedure tracerun_highsel_small (p_vsyscall64 in number)
is
l_time number;

begin

for i in 1 .. 20 loop

-- no tracing, _rowsource_statistics_sampfreq = 128;
dbms_monitor.session_trace_disable;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 128';
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'off',
128,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;

-- tracing, _rowsource_statistics_sampfreq = 128;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 128';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
128,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;

-- tracing, _rowsource_statistics_sampfreq = 16;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 64';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
16,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;

-- tracing, _rowsource_statistics_sampfreq = 4;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 4';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
4,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;

-- tracing, _rowsource_statistics_sampfreq = 1;
execute immediate 'alter system set "_rowsource_statistics_sampfreq" = 1';
dbms_monitor.session_trace_enable;
l_time := dbms_utility.get_time;
tracetest.tracetest_highsel_small;
insert into tracetest.elapsed
values
(i,
'tracetest_highsel_small',
'on',
1,
p_vsyscall64,
dbms_utility.get_time - l_time);
commit;
dbms_monitor.session_trace_disable;
end loop;
end;

This procedure in turn was called with the actual setting of vsyscall64.

The tests were performed on RedHat 6.4 system with 20G of memory and 2 CPUs. And the results were…
… a bit unexpected.

Results

First, the OLTP workload:

SQL> select proc, vsyscall64, tracing, sampfreq, round(avg(elapsed)) from tracetest.elapsed
2 where proc='tracetest_highsel_small'
3 group by proc, vsyscall64, proc, tracing, sampfreq
4 order by 1,2,3,4 desc;

PROC VSYSCALL64 TRA SAMPFREQ ROUND(AVG(ELAPSED))
----------------------- ---------- --- ---------
tracetest_highsel_small   0    off   128   2907
tracetest_highsel_small   0    on    128   5349
tracetest_highsel_small   0    on    16    5415
tracetest_highsel_small   0    on    4     5385
tracetest_highsel_small   0    on    1     5402

tracetest_highsel_small   1    off   128   2770
tracetest_highsel_small   1    on    128   5269
tracetest_highsel_small   1    on    16    5199
tracetest_highsel_small   1    on    4     5214
tracetest_highsel_small   1    on    1     5241

tracetest_highsel_small   2    off   128   2746
tracetest_highsel_small   2    on    128   5248
tracetest_highsel_small   2    on    16    5212
tracetest_highsel_small   2    on    4     5250
tracetest_highsel_small   2    on    1     5221

Well, I was prepared to perform some statistical tests, but this hardly seems to make sense here…

For my first assumption, visibly tracing does take more time than non-traced execution, namely, nearly twice as much (in my test, 188%, to be precise).

For the second assumption- the lower the setting of _rowsource_sampling_freq, the higher the time elapsed, I don’t find any confirmation. Evidently in my experiment, modifying _rowsource_sampling_freq had no effect.

My third assumption regarding the setting of vsyscall64 is not really confirmed either:

select vsyscall64, round(avg(elapsed)) from tracetest.elapsed
where proc='tracetest_highsel_small'
group by vsyscall64
order by 1;

VSYSCALL64 ROUND(AVG(ELAPSED))
---------- -------------------
0          4891
1          4739
2          4735

It does look like a setting of 0 leads to a very slight increase in elapsed time, but the difference is minimal – with a setting of 1, we’re saving about 3% of the time spent with a setting of 0.
Setting vsyscall64 to 2 does not seem to cause any further time savings.

The DWH workload

Let’s now double-check against the results from the DWH-like workload. Perhaps – in accordance with the fourth assumption – will _rowsource_sampling_freq make a difference here (given we’re returning so many more rows)?

SQL> set pages 10000
select proc, vsyscall64, tracing, sampfreq, round(avg(elapsed)) from tracetest.elapsed
where proc='tracetest_lowsel_small'
group by proc, vsyscall64, proc, tracing, sampfreq
4 order by 1,2,3,4 desc;

PROC VSYSCALL64 TRA SAMPFREQ ROUND(AVG(ELAPSED))
----------------------- ---------- --- ---------- 
tracetest_lowsel_small    0    off   128   2898
tracetest_lowsel_small    0    on    128   5713
tracetest_lowsel_small    0    on    16    5735
tracetest_lowsel_small    0    on    4     5725
tracetest_lowsel_small    0    on    1     5744

tracetest_lowsel_small    1    off   128   2721
tracetest_lowsel_small    1    on    128   5575
tracetest_lowsel_small    1    on    16    5556
tracetest_lowsel_small    1    on    4     5546 
tracetest_lowsel_small    1    on    1     5551

tracetest_lowsel_small    2    off   128   2669
tracetest_lowsel_small    2    on    128   5490
tracetest_lowsel_small    2    on    16    5488
tracetest_lowsel_small    2    on    4     5489
tracetest_lowsel_small    2    on    1     5477

First, non-traced execution, compared with tracing, takes 49% of the time – a similar ratio as with the OLTP workload, though even slightly less favorable to tracing.
Secondly – and answering our question as to whether we’d see any effect of _rowsource_sampling_freq with the changed workload – it’s no, we don’t, even with the high number of rows returned from the queries involved.
Thirdly, regarding vsyscall64,

select vsyscall64, round(avg(elapsed)) from tracetest.elapsed
where proc='tracetest_lowsel_small'
group by vsyscall64
4 order by 1;

VSYSCALL64 ROUND(AVG(ELAPSED))
---------- -------------------
0          5163

1          4990

2          4923

we again see minimal differences, – not really what was expected originally.
And lastly, it’s still another no, – the two workloads do not behave in different ways, as far as the kinds of measurements taken are concerned.

Conclusion

So, what’s the conclusion? As regards the impact of tracing, I think an approximative statement can be made about its costliness in real-world situations. I also need to mention that variance was small, and that I also experimented with a test procedure that included – though to a lesser degree – insert, update and delete statements. With that “mixed” procedure,the difference between tracing and not was much less – even hardly – visible.

That’s why I think the statement could be made that the ratio of 1:2 between tracing and not tracing is something like an upper bound for real-world situations (meaning that the difference might often be much less pronounced).

Other than that, two questions remain: Why is there no visible effect of varying either _rowsource_sampling_freq or vsyscall64?
I can conceive of at least three possible types of answer (which I will not investigate in this post, which is long enough already):

First, there could be something in the experiment’s setup that precludes getting the expected results. (Actually, I if knew that was the case I’d be modifying and rerunning the tests right now :-)).

Then, there could be independent reasons for both “non-findings”, one having to do with Oracle’s sampling method and the other, with the workings of the OS timer call.

Finally, there’s also the possibility that it is not the frequency and implementation of timer calls that constitutes the main part of tracing overhead, but something else, unaffected by either _rowsource_sampling_freq or vsyscall64. This possibility looks intuitively plausible to me because the overhead has to arise some place…
Of course, OS level tracing should be able to help clarifying this issue. Another time…