### Idle wait events and think time

When creating a resource profile for a session or a group of sessions, one usually has to decide how to handle the so-called idle wait events, first and foremost, *SQL*Net message from client*.

One possibility would be to simply remove (physically or logically) those lines from the trace, thereby avoiding the question of whether some wait should be classified as client (= application) response time, or as think time.

Another would be be to define a threshold and exclude only those lines with wait times above the threshold. In that case, of course, I have to know what makes for a sensible threshold. I could base that threshold on network latencies but then, I might miss information about how the application behaves – what, from the database server’s point of view, is the response time of the application?

So here I want to try another approach, which is applying common outlier detection methods and find out how far that gets me ;-). My test material, this time, will be “real life”, non-artificial trace files, my tools will be the mining infrastructure based on external tables described in the latest three blog posts (starting with Mining SQL Trace Files Using External Tables Part 1: Getting the Timeline and, for ease of use, uploaded on github, and the outlier detection methods used will be four common ones described in http://d-scholarship.pitt.edu/7948/1/Seo.pdf.

### Outlier detection methods

The four methods chosen were selected for being both commonly used as well as easily applicable. None of them are based on hypothesis testing as this would require knowledge of the underlying distribution.

##### Standard deviation method

One commonly used method is classifying as outliers those measurements whose distance from the mean is more than 2 or 3 standard deviations. We may already suspect that a method based on the normality assumption might not be the most effective here, but we’ll have a look anyway, be it only for establishing a baseline for the other methods.

##### Boxplot

As per Tukey’s boxplot method, values are seen as outliers when either below the first quartile minus 1.5* (3*, respectively) the interquartile range or above the third quartile plus 1.5 * (3*, respectively) the interquartile range (i.e., x < Q1 – 1.5[3]*IQR or x > Q3 + 1.5[3]*IQR).

##### MAD_{e}

Using the MAD_{e} method, values are classified as outliers when there distance from the median is more than 2 times (3 times, respectively) the median of absolute differences from the median, scaled by 1.483 (this scaled median of absolute differences is the so-called MAD_{e}).

I.e., x < median – 2[3] MAD_{e} or x > median – 2[3] MAD_{e}.

##### Median method

Using the median method, values are classified as outliers when they are below/above the median by more than n*IQR, when n is adjustable and chosen as 2.3 in the above mentioned paper.

### Outlier detection methods applied

##### Inspecting the data

First, for all four trace files, let me indicate some summary statistics for the *SQL*Net message from client* event and present the data graphically.

All trace files have about 50.000 lines for this wait event, with minimum values of ~ 0.35-0.55 milliseconds, a median value of ~ 0.55-0.75 ms, an average of ~1.5-1.8 ms and a maximum ranging from 2.2 to 12 seconds approximately.

SQL> select min(elapsed) min, median(elapsed) median, trunc(avg(elapsed)) avg, trunc(stddev(elapsed)) stddev, max(elapsed) max, count(elapsed) cnt from wait where event='SQL*Net message from client'; -- file 1 MIN MEDIAN AVG STDDEV MAX CNT ---------- ---------- ---------- ---------- ---------- ---------- 368 541 1606 33961 3005346 53124 -- file 2 MIN MEDIAN AVG STDDEV MAX CNT ---------- ---------- ---------- ---------- ---------- ---------- 523 683 1843 33805 2269024 51485 -- file 3 MIN MEDIAN AVG STDDEV MAX CNT ---------- ---------- ---------- ---------- ---------- ---------- 339 530 1530 68874 11794066 47152 -- file 4 MIN MEDIAN AVG STDDEV MAX CNT ---------- ---------- ---------- ---------- ---------- ---------- 547 770 1482 27795 3306050 52608

The following diagrams should help visualize how the data is distributed. I’ve chosen a bucket size of 250 microseconds/0.25 milliseconds, which after some experimentation seemed to work best.

(This means that the numbers on the x axis have to be multiplied by 0.25 to give the elapsed time in microseconds).

For all methods, I’m interested in the upper threshold only. Now let’s see what the different methods yield.

##### Standard deviation method

with sd_data as (select avg(elapsed) avg, stddev(elapsed) stddev from wait where event='SQL*Net message from client'), bounds as (select avg + 2*stddev upper_2sd, avg + 3*stddev upper_3sd from sd_data), candidates_2sd as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_2sd)), candidates_3sd as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_3sd)) select ' 2sd' "method", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_2sd union all select ' 3sd', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_3sd; -- file 1 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------- 2sd 64 75.2 3005.3 3sd 62 106.6 3005.3 -- file 2 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------ 2sd 88 70.5 2269 3sd 84 116.3 2269 -- file 3 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------ 2sd 31 145.9 11794.1 3sd 26 221.7 11794.1 -- file 4 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------ 2sd 64 62.4 3306.1 3sd 55 93 3306.1

Not surprisingly (given the non-normality of the data), the standard deviation method does not yield useful results. We’ll certainly want to classify as think time values well below 100 ms.

##### Boxplot

with iqr_data as (select percentile_cont(0.25) within group (order by elapsed) q1, percentile_cont(0.75) within group (order by elapsed) q3, abs(percentile_cont(0.25) within group (order by elapsed) - percentile_cont(0.75) within group (order by elapsed)) iqr from wait where event='SQL*Net message from client'), bounds as (select q3 + 1.5 * iqr lower_bound, q3 + 3 * iqr upper_bound from iqr_data), candidates_inner as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.lower_bound)), candidates_outer as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_bound)) select 'outside inner bound' "criterion", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_inner union all select 'outside outer bound', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_outer ; -- file 1 criterion num_outliers smallest_outlier_ms max_value_ms ------------------- ------------ ------------------- ------------ outside inner bound 5370 .9 3005.3 outside outer bound 3983 1.1 3005.3 -- file 2 criterion num_outliers smallest_outlier_ms max_value_ms ------------------- ------------ ------------------- ------------ outside inner bound 3869 1 2269 outside outer bound 2264 1.2 2269 -- file 3 criterion num_outliers smallest_outlier_ms max_value_ms ------------------- ------------ ------------------- ------------ outside inner bound 2026 .8 11794.1 outside outer bound 809 .9 11794.1 -- file 4 criterion num_outliers smallest_outlier_ms max_value_ms ------------------- ------------ ------------------- ------------ outside inner bound 3836 1.1 3306.1 outside outer bound 1418 1.3 3306.1

This looks better! If we choose the outer bound, waits have to exceed 1 ms to be classified as think time. However, looking at the diagrams, I’m not sure if this is not too low. Let’s compare with median-based methods.

##### MAD_{e}

with median as (select median(elapsed) median from wait where event='SQL*Net message from client'), abs_differences as (select abs(elapsed - median) absdiff from wait, median), mad_e as (select median(absdiff) mad_e from abs_differences), bounds as (select median + 2*mad_e upper_2mad_e, median + 3*mad_e upper_3mad_e from median, mad_e), candidates_2mad_e as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_2mad_e)), candidates_3mad_e as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.upper_3mad_e)) select '2mad_e' "method", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates_2mad_e union all select '3mad_e', count(elapsed), round(min(elapsed)/1000, 1), round(max(elapsed)/1000, 1) from candidates_3mad_e; -- file 1 method num_outliers smallest_outlier_ms max_value_ms ------- ------------ ------------------- ------------ 2mad_e 2740 1.6 3005.3 3mad_e 1523 2.2 3005.3 -- file 2 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------ 2mad_e 287 2 2269 3mad_e 208 2.7 2269 -- file 3 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------ 2mad_e 236 1.6 11794.1 3mad_e 142 2.1 11794.1 -- file 4 method num_outliers smallest_outlier_ms max_value_ms ------ ------------ ------------------- ------------ 2mad_e 487 2.3 3306.1 3mad_e 380 3.1 3306.1

Now this – again choosing the outer bound, so we end up at a threshold of ~ 2-2.5 ms, looks much better to me. Perhaps it is still a little low, from visual inspection?

##### Median method

with iqr_data as (select median(elapsed) median, percentile_cont(0.25) within group (order by elapsed) q1, percentile_cont(0.75) within group (order by elapsed) q3, abs(percentile_cont(0.25) within group (order by elapsed) - percentile_cont(0.75) within group (order by elapsed)) iqr from wait where event='SQL*Net message from client'), bounds as (select median + 2.3 * iqr bound from iqr_data), candidates as (select elapsed from wait w join bounds b on (w.event='SQL*Net message from client' and w.elapsed > b.bound)) select 'outside bound' "criterion", count(elapsed) "num_outliers", round(min(elapsed)/1000, 1) "smallest_outlier_ms", round(max(elapsed)/1000, 1) "max_value_ms" from candidates; -- file 1 criterion num_outliers smallest_outlier_ms max_value_ms ------------- ------------ ------------------- ------------ outside bound 5065 .9 3005.3 -- file 2 criterion num_outliers smallest_outlier_ms max_value_ms ------------- ------------ ------------------- ------------ outside bound 3575 1 2269 -- file 3 criterion num_outliers smallest_outlier_ms max_value_ms ------------- ------------ ------------------- ------------ outside bound 1729 .8 11794.1 -- file 4 criterion num_outliers smallest_outlier_ms max_value_ms ------------- ------------ ------------------- ------------ outside bound 3316 1.1 3306.1

The median method, at least for the trace files given, seems to yield results very similar to boxplot. Same as with boxplot, it might be too strict though.

### Conclusion

Of course, four trace files, taken from the same application and under similar conditions, do not make for a representative sample. However, the idea here was to demonstrate the approach, i.e., looking at think time from an outlier detection perspective. For the given sample, the MAD_{e} method seems to yield a plausible classification. However, what classification makes sense is, in the end, a question of (at least) two things – the application we’re looking at (a real application? or end users, sitting at their terminals?) and what we want to do with the data (diagnose the database only or the client, too).

I guess it’s the latter case when this approach will be the most useful.

.