SQL trace files, think time, and outliers

Idle wait events and think time

When creating a resource profile for a session or a group of sessions, one usually has to decide how to handle the so-called idle wait events, first and foremost, SQL*Net message from client.

One possibility would be to simply remove (physically or logically) those lines from the trace, thereby avoiding the question of whether some wait should be classified as client (= application) response time, or as think time.
Another would be be to define a threshold and exclude only those lines with wait times above the threshold. In that case, of course, I have to know what makes for a sensible threshold. I could base that threshold on network latencies but then, I might miss information about how the application behaves – what, from the database server’s point of view, is the response time of the application?

So here I want to try another approach, which is applying common outlier detection methods and find out how far that gets me ;-). My test material, this time, will be “real life”, non-artificial trace files, my tools will be the mining infrastructure based on external tables described in the latest three blog posts (starting with Mining SQL Trace Files Using External Tables Part 1: Getting the Timeline and, for ease of use, uploaded on github, and the outlier detection methods used will be four common ones described in http://d-scholarship.pitt.edu/7948/1/Seo.pdf.

Outlier detection methods

The four methods chosen were selected for being both commonly used as well as easily applicable. None of them are based on hypothesis testing as this would require knowledge of the underlying distribution.

Standard deviation method

One commonly used method is classifying as outliers those measurements whose distance from the mean is more than 2 or 3 standard deviations. We may already suspect that a method based on the normality assumption might not be the most effective here, but we’ll have a look anyway, be it only for establishing a baseline for the other methods.

Boxplot

As per Tukey’s boxplot method, values are seen as outliers when either below the first quartile minus 1.5* (3*, respectively) the interquartile range or above the third quartile plus 1.5 * (3*, respectively) the interquartile range (i.e., x < Q1 – 1.5[3]*IQR or x > Q3 + 1.5[3]*IQR).

MADe

Using the MADe method, values are classified as outliers when there distance from the median is more than 2 times (3 times, respectively) the median of absolute differences from the median, scaled by 1.483 (this scaled median of absolute differences is the so-called MADe).
I.e., x < median – 2[3] MADe or x > median – 2[3] MADe.

Median method

Using the median method, values are classified as outliers when they are below/above the median by more than n*IQR, when n is adjustable and chosen as 2.3 in the above mentioned paper.

Outlier detection methods applied

Inspecting the data

First, for all four trace files, let me indicate some summary statistics for the SQL*Net message from client event and present the data graphically.
All trace files have about 50.000 lines for this wait event, with minimum values of ~ 0.35-0.55 milliseconds, a median value of ~ 0.55-0.75 ms, an average of ~1.5-1.8 ms and a maximum ranging from 2.2 to 12 seconds approximately.

SQL> select min(elapsed) min, median(elapsed) median, trunc(avg(elapsed)) avg, trunc(stddev(elapsed)) stddev, max(elapsed) max, count(elapsed) cnt from wait where event='SQL*Net message from client';

-- file 1
       MIN     MEDIAN	     AVG     STDDEV	   MAX	      CNT
---------- ---------- ---------- ---------- ---------- ----------
       368	  541	    1606      33961    3005346	    53124

-- file 2
       MIN     MEDIAN	     AVG     STDDEV	   MAX	      CNT
---------- ---------- ---------- ---------- ---------- ----------
       523	  683	    1843      33805    2269024	    51485

-- file 3
       MIN     MEDIAN	     AVG     STDDEV	   MAX	      CNT
---------- ---------- ---------- ---------- ---------- ----------
       339	  530	    1530      68874   11794066	    47152

-- file 4
       MIN     MEDIAN	     AVG     STDDEV	   MAX	      CNT
---------- ---------- ---------- ---------- ---------- ----------
       547	  770	    1482      27795    3306050	    52608

The following diagrams should help visualize how the data is distributed. I’ve chosen a bucket size of 250 microseconds/0.25 milliseconds, which after some experimentation seemed to work best.
(This means that the numbers on the x axis have to be multiplied by 0.25 to give the elapsed time in microseconds).

SQLNET message from client wait elapsed time distribution

SQLNET message from client wait elapsed time distribution

SQLNET message from client wait elapsed time distribution

SQLNET message from client wait elapsed time distributionFor all methods, I’m interested in the upper threshold only. Now let’s see what the different methods yield.

Standard deviation method
with sd_data as
  (select avg(elapsed) avg, stddev(elapsed) stddev from wait where event='SQL*Net message from client'),
  bounds as 
    (select avg + 2*stddev upper_2sd, avg + 3*stddev upper_3sd from sd_data),
    candidates_2sd as
      (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.upper_2sd)),
      candidates_3sd as
        (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.upper_3sd))
        select '   2sd' "method", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_2sd
        union all 
        select '   3sd', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_3sd;


-- file 1
method num_outliers smallest_outlier_ms  max_value_ms
------ ------------ ------------------- -------------
   2sd		 64		   75.2        3005.3
   3sd		 62		  106.6        3005.3

-- file 2
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
   2sd		 88		   70.5 	2269
   3sd		 84		  116.3 	2269

-- file 3
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
   2sd		 31		  145.9      11794.1
   3sd		 26		  221.7      11794.1

-- file 4
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
   2sd		 64		   62.4       3306.1
   3sd		 55		     93       3306.1

Not surprisingly (given the non-normality of the data), the standard deviation method does not yield useful results. We’ll certainly want to classify as think time values well below 100 ms.

Boxplot
with iqr_data as 
  (select percentile_cont(0.25) within group (order by elapsed) q1, percentile_cont(0.75) within group (order by elapsed) q3, abs(percentile_cont(0.25) within group (order by elapsed) - percentile_cont(0.75) within group (order by elapsed)) iqr from wait where event='SQL*Net message from client'),
  bounds as 
    (select q3 + 1.5 * iqr lower_bound, q3 + 3 * iqr upper_bound from iqr_data),
      candidates_inner as
          (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.lower_bound)),
          candidates_outer as
            (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.upper_bound))
            select 'outside inner bound' "criterion", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_inner
            union all
            select 'outside outer bound', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_outer
        ;
   

-- file 1
criterion	    num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound	    5370		  .9	   3005.3
outside outer bound	    3983		 1.1	   3005.3

-- file 2
criterion	    num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound	    3869		   1	     2269
outside outer bound	    2264		 1.2	     2269

-- file 3
criterion	    num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound	    2026		  .8	  11794.1
outside outer bound	     809		  .9	  11794.1

-- file 4
criterion	    num_outliers smallest_outlier_ms max_value_ms
------------------- ------------ ------------------- ------------
outside inner bound	    3836		 1.1	   3306.1
outside outer bound	    1418		 1.3	   3306.1

This looks better! If we choose the outer bound, waits have to exceed 1 ms to be classified as think time. However, looking at the diagrams, I’m not sure if this is not too low. Let’s compare with median-based methods.

MADe
with median as
  (select median(elapsed) median from wait where event='SQL*Net message from client'),
  abs_differences as
    (select abs(elapsed - median) absdiff from wait, median),
    mad_e as 
      (select median(absdiff) mad_e from abs_differences),
        bounds as 
          (select median + 2*mad_e upper_2mad_e, median + 3*mad_e upper_3mad_e from median, mad_e),
          candidates_2mad_e as
            (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.upper_2mad_e)),
            candidates_3mad_e as
              (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.upper_3mad_e))
                select '2mad_e' "method", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_2mad_e
                union all 
                select '3mad_e', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_3mad_e;


-- file 1
method	num_outliers smallest_outlier_ms max_value_ms
------- ------------ ------------------- ------------
2mad_e   	2740		     1.6       3005.3
3mad_e  	1523		     2.2       3005.3

-- file 2
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2mad_e		287		      2 	2269
3mad_e		208		    2.7 	2269

-- file 3
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2mad_e		236		    1.6      11794.1
3mad_e		142		    2.1      11794.1

-- file 4
method num_outliers smallest_outlier_ms max_value_ms
------ ------------ ------------------- ------------
2mad_e		487		    2.3       3306.1
3mad_e		380		    3.1       3306.1

Now this – again choosing the outer bound, so we end up at a threshold of ~ 2-2.5 ms, looks much better to me. Perhaps it is still a little low, from visual inspection?

Median method
with iqr_data as 
  (select median(elapsed) median, percentile_cont(0.25) within group (order by elapsed) q1, percentile_cont(0.75) within group (order by elapsed) q3, abs(percentile_cont(0.25) within group (order by elapsed) - percentile_cont(0.75) within group (order by elapsed)) iqr from wait where event='SQL*Net message from client'),
  bounds as 
    (select median + 2.3 * iqr bound from iqr_data),
      candidates as
          (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed >  b.bound))
            select 'outside bound' "criterion", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates;


-- file 1   
criterion     num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound	      5065		    .9	     3005.3

-- file 2
criterion     num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound	      3575		     1	       2269

-- file 3
criterion     num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound	      1729		    .8	    11794.1

-- file 4
criterion     num_outliers smallest_outlier_ms max_value_ms
------------- ------------ ------------------- ------------
outside bound	      3316		   1.1	     3306.1

The median method, at least for the trace files given, seems to yield results very similar to boxplot. Same as with boxplot, it might be too strict though.

Conclusion

Of course, four trace files, taken from the same application and under similar conditions, do not make for a representative sample. However, the idea here was to demonstrate the approach, i.e., looking at think time from an outlier detection perspective. For the given sample, the MADe method seems to yield a plausible classification. However, what classification makes sense is, in the end, a question of (at least) two things – the application we’re looking at (a real application? or end users, sitting at their terminals?) and what we want to do with the data (diagnose the database only or the client, too).

I guess it’s the latter case when this approach will be the most useful.

.

Advertisements

2 thoughts on “SQL trace files, think time, and outliers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s