Mining SQL Trace Files Using External Tables Part 2: Resource Profiling

Extending the timeline

Last time in Mining SQL Trace Files Using External Tables Part 1: Getting the Timeline, we saw how to extract essential information from a 10046 trace file and establish the order in which SQL statements are issued. Today, we will turn our attention to the most prominent application of trace file parsing, namely, the derivation of resource profiles (leaving more specialized applications for the future).

To that purpose, we have to extend the timeline information to EXEC and FETCH calls as well as waits, so we can map the respective trace file lines to the correct sql_id. This is done analogously to the last post’s simple timeline:

create or replace view timeline_extended_raw as
 select "LINENO","CALL_TYPE","SQLID","CRSR","TIM","CALL_DEPTH","CLOSE_TYPE" from
  (
   select lineno, 'OPEN' call_type, sqlid, crsr, tim, call_depth, null close_type from cursor
   union all
   select lineno, 'PARSE', null, crsr, tim, call_depth, null from call where call_type='PARSE'
   union all
   select lineno, 'EXEC', null, crsr, tim, call_depth, null from call where call_type='EXEC'
   union all
   select lineno, 'FETCH', null, crsr, tim, call_depth, null from call where call_type='FETCH'
   union all
   select lineno, 'CLOSE', null, crsr, tim, call_depth, type from cursor_close
   union all
   select lineno, 'WAIT', null, crsr, tim, null, null from wait
  )
 order by lineno;

create or replace view timeline_extended as
 select
  lineno,
  call_type,
  case when sqlid is not null then sqlid else lag(sqlid) ignore nulls over (partition by crsr order by lineno) end sqlid,
  crsr,
  tim,
  call_depth,
  close_type
 from timeline_extended_raw order by lineno;

With this missing part of infrastructure in place, let’s start with the session’s overall resource profile.

Overall resource profile

The example trace file will be the same one as used in the first post. In the following resource profile, I am excluding idle events, splitting up CPU usage by call type, and out of the vast range of statistical measures, displaying just average, median and maximum value – but as the query is operating on the raw data, you’re free of course to display exactly the information you’re interested in (e.g., not excluding idle wait events or making the decision dependent on the actual values).

col percent_wait for 99.99
select
 cast(event as varchar2(32)) event,
 cast(wait_class as varchar2(20)) wait_class,
 time_waited_total,
 num_waits,
 round(100 * ratio_to_report(time_waited_total) over (), 2) percent_wait,
 round(avg_wait) avg_wait,
 round(median_wait) median_wait, round(max_wait) max_wait from
  (
   select
    event, wait_class, sum(elapsed) time_waited_total, count(elapsed) num_waits,
    avg(elapsed) avg_wait, median(elapsed) median_wait, max(elapsed) max_wait
    from wait w join v$event_name e on (w.event=e.name)
    where e.wait_class  'Idle'
    group by event, wait_class
    union all
    select 
    call_type, 'CPU', sum(cpu), count(cpu), avg(cpu), median(cpu), max(cpu) from call
    group by call_type
  )
order by time_waited_total desc;

EVENT				 WAIT_CLASS	      TIME_WAITED_TOTAL  NUM_WAITS PERCENT_WAIT       AVG_WAIT	  MEDIAN_WAIT	MAX_WAIT
-------------------------------- -------------------- ----------------- ---------- ------------ 
db file sequential read 	 User I/O			3270429        156	  65.95 	 20964		 9395	  169189
EXEC				 CPU				1039000        680	  20.95 	  1528		    0	  509000
PARSE				 CPU				 307000        143	   6.19 	  2147		    0	  139000
FETCH				 CPU				 143000       1609	   2.88 	    89		    0	   18000
row cache lock			 Concurrency			  66660        197	   1.34 	   338		  300	    2230
gc cr grant 2-way		 Cluster			  64633 	84	   1.30 	   769		  584	    5995
library cache lock		 Concurrency			  29250 	44	    .59 	   665		  561	    4255
library cache pin		 Concurrency			  17864 	43	    .36 	   415		  349	    1625
db file scattered read		 User I/O			  11566 	 2	    .23 	  5783		 5783	   10537
enq: RC - Result Cache: Contenti Application			   2140 	 4	    .04 	   535		  580	     638
enq: TM - contention		 Application			   1606 	 2	    .03 	   803		  803	    1062
PX Deq: Slave Session Stats	 Other				   1319 	 2	    .03 	   660		  660	    1204
gc cr multi block request	 Cluster			   1168 	 1	    .02 	  1168		 1168	    1168
reliable message		 Other				    844 	 1	    .02 	   844		  844	     844
enq: PS - contention		 Other				    627 	 1	    .01 	   627		  627	     627
KJC: Wait for msg sends to compl Other				    622 	46	    .01 	    14		   10	      40
IPC send completion sync	 Other				    499 	 2	    .01 	   250		  250	     332
PX Deq: Signal ACK EXT		 Other				    473 	 2	    .01 	   237		  237	     405
PX Deq: reap credit		 Other				    327 	23	    .01 	    14		   10	      62
SQL*Net message to client	 Network			    214 	31	    .00 	     7		    7	      15
Disk file operations I/O	 User I/O			      7 	 1	    .00 	     7		    7	       7

Noticing that for response time’s most important contributor, db file sequential read, mean and median are more than moderately different (20964 vs. 9395 microseconds), we can easily get an impression of the distribution, not even leaving sqlplus:

col plot for a100
select range, min(elapsed) low_value, max(elapsed) high_value, max(elapsed) - min(elapsed) width, round(156/(max(elapsed) - min(elapsed)), 3) height, rpad('*', 100 * 156/(max(elapsed) - min(elapsed)), '*') plot
from
(select elapsed, ntile(10) over (order by elapsed) range from wait where event='db file sequential read')
group by range
order by range;


 RANGE  LOW_VALUE HIGH_VALUE      WIDTH	HEIGHT PLOT -------------------------------------------------------------------------------------------------------
 1	  408	     672	264	  .591 ********************************************************
 2	  701	     979	278	  .561 ********************************************************
 3	 1025	    1673	648	  .241 ************************
 4	 1851	    6587       4736	  .033 ***
 5	 6619	    9563       2944	  .053 *****
 6	 9678	   15456       5778	  .027 **
 7	15491	   20224       4733	  .033 ***
 8	20279	   31984      11705	  .013 *
 9	33681	   59401      25720	  .006
10	63073	  169189     106116	  .001

Of course, deciding on the number of bins for the histogram may need a little bit of experimentation, as well as choosing the multiplicator (here, I chose 100) to display the plot. Nicer graphs can always be generated using other tools, but here I’m focusing on what can be done quickly in sqlplus, and if we’re interested in getting quick statistical information on the distribution of a wait event, there’s a nice convenience function (or rather, procedure) in dbms_stat_funcs, dbms_stat_funcs.summary.
Reckoning that I will use it a lot, I create a wrapper around it:

create or replace procedure wrap_stats_summary (p_owner in varchar2, p_tabname in varchar2, p_colname in varchar2)
as
  l_summary dbms_stat_funcs.SummaryType;
begin
   dbms_stat_funcs.summary(p_owner, p_tabname, p_colname, 3, l_summary);
   dbms_output.put_line ('Statistics summary for ' || p_owner || '.' || p_tabname || '.' || p_colname);
   dbms_output.put_line (rpad('Count: ', 20, ' ')           || l_summary.count);
   dbms_output.put_line (rpad('Min: ', 20, ' ')             || l_summary.min);
   dbms_output.put_line (rpad('Max: ', 20, ' ')             || l_summary.max);
   dbms_output.put_line (rpad('Range: ', 20, ' ')           || l_summary.range);
   dbms_output.put_line (rpad('Mean: ', 20, ' ')            || round(l_summary.mean));
   dbms_output.put_line (rpad('Mode Count: ', 20, ' ')      || l_summary.cmode.count);
   dbms_output.put_line (rpad('Mode: ', 20, ' ')            || l_summary.cmode(1));
   dbms_output.put_line (rpad('Variance: ', 20, ' ')        || round(l_summary.variance));
   dbms_output.put_line (rpad('Stddev: ', 20, ' ')          || round(l_summary.stddev));
   dbms_output.put_line (rpad('5th percentile: ', 20, ' ')  || l_summary.quantile_5);
   dbms_output.put_line (rpad('25th percentile: ', 20, ' ') || l_summary.quantile_25);
   dbms_output.put_line (rpad('Median: ', 20, ' ')          || l_summary.median);
   dbms_output.put_line (rpad('75th percentile: ', 20, ' ') || l_summary.quantile_75);
   dbms_output.put_line (rpad('95th percentile: ', 20, ' ') || l_summary.quantile_95);
   dbms_output.put_line (rpad('Extreme Count: ', 20, ' ')   || l_summary.extreme_values.count);
   dbms_output.put_line (rpad('Top 5: ', 20, ' ')           || l_summary.top_5_values(1) || ','
                                                            || l_summary.top_5_values(2) || ','
                                                            || l_summary.top_5_values(3) || ',' 
                                                            || l_summary.top_5_values(4) || ',' 
                                                            || l_summary.top_5_values(5));
   dbms_output.put_line (rpad('Bottom 5: ', 20, ' ')        || l_summary.bottom_5_values(5) || ','
                                                            || l_summary.bottom_5_values(4) || ',' 
                                                            || l_summary.bottom_5_values(3) || ','
                                                            || l_summary.bottom_5_values(2) || ',' 
                                                            || l_summary.bottom_5_values(1));
end;
/

Calling the wrapper with a view created for db file sequential read events (again, creating the views could be automated) and the column we’re interested in yields

set serveroutput on
create view v_db_file_sequential_read as select * from wait where event='db file sequential read';
exec wrap_stats_summary('myuser','v_db_file_sequential_read','elapsed');

Statistics summary for myuser.v_db_file_sequential_read.elapsed:
Count:		    156
Min:		    408
Max:		    169189
Range:		    168781
Mean:		    20964
Mode Count:	    1
Mode:		    1370
Variance:	    921834830
Stddev: 	    30362
5th percentile:     546.5
25th percentile:    1181.25
Median: 	    9395
75th percentile:    23279.25
95th percentile:    84211.75
Extreme Count:	    4
Top 5:		    169189,163650,125610,124584,99328
Bottom 5:	    408,413,438,466,506

Should we, from looking at the plot, be tempted to think we might have an exponential distribution here, why not do a quick check using dbms_stat_funcs.exponential_dist_fit? Unfortunately, the package documentation does not say whether it is acceptable to pass a lambda value calculated from the sample mean – AFAIK, in principle, different significance thresholds should apply depending on whether lambda was given apriori or derived from the sample mean. (Actually, as there is no parameter to specify how the lambda passed in came about, we have no choice anyway.)

set serveroutput on
declare
  lambda number := 1/20964;
  mu     number := 0;
  sig    number;
begin
 dbms_stat_funcs.exponential_dist_fit('C##FROG', 'v_db_file_sequential_read', 'elapsed', 'KOLMOGOROV_SMIRNOV', lambda, mu, sig);
 dbms_output.put_line('Lambda  : ' || lambda);
 dbms_output.put_line('Sig     : ' || to_char(sig,'9.999'));
end;
/
D value : .2309900895441184871772937767930610144644
Lambda	: .00004770082045411181072314443808433505056287
Sig	:   .000

So, evidently we do not have an exponential distribution here (sorry for the spoiler – but the point here is to show what’s possible with the external table setup, not to draw conclusions from a minimal data set ;-)).
So let’s go on with the main topic of this post – resource profiling.

Overall session profile – drilldowns

With the data extracted from the trace file, it is easy to pick a wait event and drill down into its parameters. E.g., for db file sequential read, we can find the files and blocks with the most reads, or else – as the object information itself is included in the trace file – go for the objects directly:

select cast(owner as varchar2(30)) owner, cast(object_name as varchar2(30)) name, sum(elapsed) elapsed from wait w join dba_objects o on (w.obj=o.object_id) where event='db file sequential read' group by owner, object_name order by 3 desc fetch next 5 rows only;

OWNER			       NAME				 ELAPSED
------------------------------ ------------------------------ ----------
SYS			       CDEF$				  323772
SYS			       HIST_HEAD$			  322557
SYS			       I_HH_OBJ#_INTCOL#		  235055
SYS			       TAB$				  206864
SYS			       I_DEFERRED_STG1			  193338

(Don’t wonder about these being internal objects only – in the example trace session I was querying dba_objects directly instead of a table “created as select”.)

Profiling by SQL statement

Whereas in some cases the overall resource profile might be the thing we’re most interested in, in others it will be single SQL statements (most problably it will be both). Again, the range of possible queries is limited only by the information provided in a trace file. As an example, here I’m sorting by elapsed time, grouping not by sql_id but by rollup(sql_id, call_type) so I can directly see, for the most time-consuming statements, where time went, call-wise – e.g. here, for the only level 0 statement in the trace file, I see that time was mostly spent in the EXEC phase:

with elapsed as
 (select t.sqlid, c.call_type, sum(c.elapsed) elapsed, count(c.elapsed) call_count
  from call c join timeline_extended t using (lineno)
  group by rollup(t.sqlid, c.call_type)
 ),
 sqlids as 
  (select sqlid, sql_text, count(1) occurrence_count, min(call_depth) min_call_depth
   from cursor group by sqlid, sql_text
  )
  select elapsed.*, cast(substr(sqlids.sql_text,1,30) as varchar2(30)) sql_text, sqlids.occurrence_count, sqlids.min_call_depth 
  from elapsed join sqlids on (elapsed.sqlid=sqlids.sqlid)
  order by elapsed desc
  fetch next 10 rows only;

SQLID	      CALL_    ELAPSED CALL_COUNT SQL_TEXT			 OCCURRENCE_COUNT MIN_CALL_DEPTH
------------- ----- ---------- ---------- ------------------------------ ---------------- --------------
5dqx2k1st1uxa	       4602081	       30 select owner, object_name from		3	       0
5dqx2k1st1uxa EXEC     4566966		3 select owner, object_name from		3	       0
4b4wp0a8dvkf0	       2783682		3 SELECT executions, end_of_fetc		1	       1
4b4wp0a8dvkf0 EXEC     2747671		1 SELECT executions, end_of_fetc		1	       1
64hu4m7arrfdh	       2036951		1 SELECT * FROM NO_OBJECT_LINK("		1	       2
64hu4m7arrfdh PARSE    2036951		1 SELECT * FROM NO_OBJECT_LINK("		1	       2
616m6uhpa2usu		785493	       55 select i.obj#,i.ts#,i.file#,i.		1	       2
96g93hntrzjtr		641889	      539 select /*+ rule */ bucket_cnt,		1	       3
616m6uhpa2usu FETCH	636347	       37 select i.obj#,i.ts#,i.file#,i.		1	       2
96g93hntrzjtr FETCH	581688	      269 select /*+ rule */ bucket_cnt,		1	       3

I’ve also displayed the number of times a statement was executed (occurrence_count), to put the elapsed time in context, and the (minimal) call depth, so I can immediately spot the level 0 statements.
Now of course, we still need the tkprof-like output:

select c.call_type "call", count(c.elapsed) "count", sum(c.cpu) "cpu", sum(c.elapsed) "elapsed",
 sum(c.phys_reads) "disk", sum(c.cons_reads) "query", sum(c.cur_reads) "current", sum(c.num_rows) "rows",
 sum(c.misses) "misses"
from call c join timeline_extended t using (lineno)
where t.sqlid='5dqx2k1st1uxa'
group by t.sqlid, c.call_type
order by decode(c.call_type, 'PARSE', 1, 'EXEC', 2, 'FETCH', 3);

call	   count	cpu    elapsed	     disk      query	current       rows     misses
----- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
PARSE	       3       2000	  4721		0	   0	      0 	 0	    1
EXEC	       3     511000    4566966	      157	2532	      0 	 0	    1
FETCH	      24      12000	 30394		6	 123	      0        298	    0

select event "event", count(1) "times waited", max(elapsed) "max wait", sum(elapsed) "total waited"
from wait w join timeline_extended t using (lineno)
where t.sqlid='5dqx2k1st1uxa'
group by event
order by 3 desc;

event				times waited	max wait   total waited
---------------------------------------------------------------- ------------ 
SQL*Net message from client		24	11510302     27972579
db file sequential read 		 4	   29541	35812
gc cr multi block request		 1	    1168	 1168
row cache lock				 6	    1094	 3865
db file scattered read			 1	    1029	 1029
library cache lock			12	     927	 6261
gc cr grant 2-way			 4	     733	 2670
library cache pin			12	     555	 3432
KJC: Wait for msg sends to complete      7	      27	   96
SQL*Net message to client	        24	      15	  154

That’s for classical resource profiling with the external table trace profiler method :-). Next time, we’ll look more into options provided by this specific method, and then, more interesting trace files will be used, I promise 😉

Advertisements

One thought on “Mining SQL Trace Files Using External Tables Part 2: Resource Profiling

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s