Wanted: Your Take on the Cy Bernetic Optimizer (CBO)

This is not really a technical blog post, it’s an invitation to discuss something. The “something” is what Tony Hasler calls “The Strategic Direction for the CBO”.
Short time ago I started reading his – very much worth reading, very entertaining, too – book on “Expert Oracle SQL”, where without much ado he says, regarding that direction:

Despite the concerns about predictability that customers started raising from the outset and continue to raise today, the strategic direction of the CBO continues to be based on adaptability. Features like dynamic sampling and bind variable peeking started to appear. When 11g arrived it became very clear that the CBO team was looking towards ever more adaptable and ever less predictable behaviors. The signals came in the form of Adaptive Cursor Sharing in 11gR1 and Cardinality Feedback in 11gR2. […] As if this wasn’t bad enough, 12cR1 introduced Adaptive Execution Plans. This feature might result in the execution plan of a statement changing in mid-flight! We don’t want these features!

(See below for a longer quotation, providing more context).

I found this very interesting – normally what you see is people getting annoyed with / frustrated by / furious over

  • the bugs accompanying these new features,
  • those execution plan changes that are (1) for the worse and (2) get noticed by users (or monitoring systems),

and the new features are getting disabled, at a varying granularity, after a varying amount of analysis/troubleshooting.
(As of today, what percentage of systems upgraded to 12c might be running with optimizer_adaptive_features turned off?).

But this is not about bugs or even how well it works, but about the “principle”. Assume they worked fine, performance was for the better most all the time – how would you like these optimizer self-tuning features?

Here is the Tony Hasler citation again, the complete paragraph this time:

One of the key things about the old Rule-Based Optimizer (RBO) was that the same statement resulted in the same execution plan every time. The CBO, on the other hand, is designed to be adaptable. The CBO tries to get the optimal execution plan each and every time, and if the object statistics suggest that the data has changed then the CBO will quite happily consider a new execution plan.
Despite the concerns about predictability that customers started raising from the outset and continue to raise today, the strategic direction of the CBO continues to be based on adaptability. Features like dynamic sampling and bind variable peeking started to appear. When 11g arrived it became very clear that the CBO team was looking towards ever more adaptable and ever less predictable behaviors. The signals came in the form of Adaptive Cursor Sharing in 11gR1 and Cardinality Feedback in 11gR2. Both Adaptive Cursor Sharing and Cardinality Feedback can result in the execution plan selected for the first execution of a statement being “upgraded” for the second and subsequent executions, even with no change to object statistics! Usually, these changes are for the better, but there are no guarantees.
As if this wasn’t bad enough, 12cR1 introduced Adaptive Execution Plans. This feature might result in the execution plan of a statement changing in mid-flight! We don’t want these features!
– On a test system we don’t want these features because they might mask problems that will impact our production service.
– On a production system we don’t want these features because the execution plans that we have tested are satisfactory, whatever the CBO thinks; we don’t want to risk using an untested plan that might cause performance to degrade.

With the above argumentation, convincing though it sounds, I have two concerns, which in the end are closely interrelated.

First, the “real world” one. There are, of course, many real worlds, but in how many of them is it that the teams of software developers and DBAs, collaborating closely, after thorough, institutionalized testing on the integration systems, can say: Our new code introduced those new execution plans, they are good (or even the best possible), we can safely release our code in production?
One real world I know is where developers develop, and DBAs are busy patching, upgrading databases, working around necessary OS upgrades, providing test systems, and ex- and importing data – only dealing with SQL performance if and when problems arise.

There is this section in the book about a Connor McDonald presentation culminating in

Of course the fictitious execution plan changed was caused by an explicitly run job to gather statistics. No authorization, no impact analysis, no testing, nothing. Connor asked the perfectly reasonable question as to why IT organizations routinely precipitate untested execution plan changes on their database systems while at the same time placing most other types of changes under the highest scrutiny. It seemed to Connor that that the world had gone mad. I agree!

(I told you the book was fun to read.) Now of course the world is mad, but it’s easy to understand the human and organizational factors leading to this special thing. Colors, phrasings, menus on a website are more tangible than execution plans, for one … But that’s another discussion.

So in a practical way, I imagine there’s an idea of easing the DBA’s job because anyway, her job is complex, and getting more complex – in an infrastructure way – all the time (think of the whole Big Data integration topic coming up).

So that leads to the second point. If the intention is to ease the DBA’s job, instead of making it more difficult, introducing problems etc., then adaptivity, or machine learning, sounds like a good idea. With supervised learning, me the DBA, I have analyzed a statement, know the best solution and tell the optimizer how to run it. With unsupervised learning, the optimizer finds out itself.
If we agree that a self-learning optimizer may be a good thing, the question is good are its methods/algorithms? Not in the way of “are they buggy”, but are they, in principle, good techniques that, when bugs have been removed, fulfill their purposes?

Many of the features concerned, and mentioned by Tony Hasler, are based on feedback loops. (That’s the reason for the cybernetic in the post’s title). Being based on feedback loops can mean one of two things here:

  • The feedback occurs after execution is complete. This means there was at least one bad execution. (In practice there may have been more). This is the case for Adaptive Cursor Sharing and Cardinality Feedback.
  • The feedback occurs mid-execution, as with 12c Adaptive Plans.

With the second form, in principle (again, bugs aside), I don’t have an issue. Join cardinality estimation is intrinsically difficult (just read up on Jonathan Lewis’ blog for that) – now you could either try to find a better generic, one-fits-all-and-handles-every-case algorithm, or you approach the question from another side – such as a control mechanism like the STATISTICS_COLLECTOR rowsource.

As for the first form, you (here meaning the software vendor, i.e., Oracle) could do several things to make it better.

  • Re-engineer the underlying algorithm so that fewer failures are needed before the good plan is found.
  • Persist the results, so we don’t start again from zero after the information (the cursor) is flushed from the shared pool.
  • Ideally, not even a single failure were needed – this is the same as conversion to the second form.

Now for improving the algorithms, this is I guess what’s happening between releases (I don’t have a testcase available, but I’d assume it could be shown for e.g. Adaptive Cursor Sharing between 11.1 and 12.1); for persisting results, look at the evolution from 11.1 cardinality feedback to 12.1 inclusion of that information (now renamed to “statistics feedback”) in SQL Plan directives; for converting to an in-flight algorithm, this may be feasible or not. If it’s feasible, it may still not be efficient.

Summing up, the direction seems to be “unsupervised learning, using mid-execution feedback / zero failure adaptive behavior where possible”. This does not look too bad to me. (By the way I’m not even sure that in the best (in-flight) form, this goes against testability as argued by Tony. 12c adaptive join methods use a deterministic inflection point calculation that should allow for reproducible tests given identical data.)

Did you notice the word “strategic” in the above quotation? Further down in the same chapter, Tony writes, with respect to SQL Plan Management

Although the CBO strategy is all about adaptability, Oracle, like any other company, has to do something to address the demands of its largest and most vocal customers. Here are the tactical plan stability measures that Oracle has taken to address customer concerns about predictability.

So there we have the strategy vs. tactics opposition. But do we necessarily need to see it like this? In the view I’m suggesting, of supervised vs. unsupervised learning, as long as SQL Plan Management goes “unattended” it is not fundamentally in contradiction to the adaptive features. With 12c SPM Evolve Advisor Task, this would be the case.

One digression before I wrap it up. Tony has another very interesting paragraph, about Graham Wood on frequency of statistics gathering:

In 2012 the Real World Performance team from Oracle did a round-the-world tour to try to dispel some myths about Oracle performance and to educate people about performance management. I attended the Edinburgh seminar and found it very stimulating. One of the most interesting suggestions came from Graham Wood. Graham’s observation was that many of Oracle’s customers gather statistics far too frequently. I want to make it clear that Graham did not say that gathering statistics was only required to prevent execution plan changes, and he most definitely didn’t liken gathering statistics to a game of Russian Roulette! However, Graham’s recommendation does suggest, quite correctly, that there is some risk to gathering statistics and that by doing it less often you can reduce that risk considerably.

The risk of statistics calculation refers to DBA_TAB_COL_STATISTICS.LOW_VALUE and DBA_TAB_COL_STATISTICS.HIGH_VALUE which when increasingly off will lead to increasingly wrong cardinality estimations (imagine having collected statistics at 10 P.M. and loading fresh, time-based data during the night – again read up on Jonathan Lewis’ blog for that topic). Here I must say this suggestion – don’t gather statistics too often – does not seem in line to me with the “general direction”. If I’m to trust Oracle to achieve good results with the “unsupervised learning approach”, I want to be able to assume this issue to be handled automatically, too – which would mean for Oracle to look into the algorithm related to how HIGH_VALUE and LOW_VALUE are used in cardinality estimation (introduce something like dynamic sampling here? in specific cases?).

So, that was my try on a take on that topic. What do you think? Do you agree with Tony that “we don’t want that”, or do you see value in the “unsupervised learning” approach? Please feel free to leave a comment to share your opinion!

Advertisements

9 thoughts on “Wanted: Your Take on the Cy Bernetic Optimizer (CBO)

  1. Hi,

    Its Connor here, so I figured I’d chip in 🙂

    My point about changing stats was the contradiction between organisations that might have 25 levels of change control if you want to change a piece of metadata in your application (eg, add a new status code to a reference table). Countless meetings will be held about impact analysis, risk, backout etc etc etc…even though the very reason reference tables often exists to allow easy addition of new values 🙂

    Yet the same organisation will happily gather stats *every night* or even more frequently – which *potentially* could change the execution plan of any or every SQL in the application. No meetings, no change control, no backout planning etc…

    The lack of use of “set_…stats” and “lock_…. stats” are a failing of many sites – who simply plunder away at their servers with full “gather_…stats” calls over and over, eg, I dont need to scan a billion row customer table to know that the number of distinct number of values in the gender column is probably 2 🙂

    So my argument has always been – statistics are *part* of your application, not external to it. You should have some thought and discipline when it comes to statistics. For example, I think histograms are very good… but I think the *auto* histogram concept is potentially problematic (ie, the database adds histograms based on the SQL behaviour it sees). The risk there is we dont know what SQL’s come from an app, or which ones come from (say) a DBA just running a sample query – why would we want to optimize the sample query at the expense of the app’s queries ? The optimizer cannot tell the difference.

    But the reality is – we (ie, the database) serve two masters – some databases have the luxury of being carefully looked after by a DBA, whereas other databases are installed, created, and no-one ever looks at them again until a year down the track when someone wishes they’d done a backup 😦 So the strategies we use must cater to both – so the automation and adaptability can cater to the latter “unattended” database, whilst a dilligent DBA can take steps to reign the automation back in those databases where they want more control.

    Nice post.

    Cheers,
    Connor

    • Hi Connor,

      thanks for commenting!

      >> we (ie, the database) serve two masters

      yeah that’s right (well…cum grano salis, regarding the backup 😉 … that might hopefully be automated too ;-)) … perhaps I’m emphasizing the “DBA administering tens/hundreds of databases” view here because I felt many views expressed elsewhere are based on the opposite (“your application”)

  2. Hi Sigrid

    Tony knows that I don’t always agree with him. For example, I strongly disagree with him when he says “We don’t want these features!”. IMO the point is to understand why a feature was implemented (let’s face it, Oracle developers are not introducing features that they expect to be useless), and how to control/configure it.

    Some systems need an adaptive optimization (by writing “adaptive” I don’t refer to the Oracle’s 12c features). Especially when data, SQL statements and user inputs steadily change. In fact, in such an environment, is simply not possible to have an optimizer that is able to generated an efficient plan for every hard parse. Hence, the optimizer has to adapt… IMO the optimizer group at Oracle also recognized that fact.

    That said, the adaptive features are not necessary in every environment. We have to be able to control them.

    In summary, no feature is always bad or always good. Our job is to know how and when to take advantage of everyone of them.

    Just my 2 cents,
    Chris

  3. Hi Sigrid,
    I learned at OOW15 in 12c the SPM Evolve Advisor Task has been move outside of tuning advisor to be able to use it without Tuning pack.
    I think we should really give a try at SQL Plan Baselines for all environments that need plan stability. ISVs should release their application with accepted SQL Plan baselines.
    Regards,
    Franck.

  4. Nice post.

    My 2 cents:
    https://orastory.wordpress.com/2015/05/01/strategies-for-minimising-sql-execution-plan-instability/

    I see the need for all abilities/features.

    But most application owners I know want stability and predictability above all else.

    The reason being that, for example, one single plan “flip” can cause a priority incident, a breach of SLA, a need to stand up and represent the app in a management meeting, a red mark against that app and its management/developers in an end of year review.

    However, just how much you have to turn off any adaptive or feedback and potentially get towards any sort of stability is significant.

    So often we “notice” the adaptive features because they have caused a plan flip for the worse, particularly cardinality feedback feature, but there are also many times were it has kicked in for the better (or with neutral effect).

    Normally, I vote for going with the defaults until it causes you a problem. Then if there’s a problem you have to react and decide whether you’re going to change something with a scope that matches the current problem or widen it out.

    • Hi Dom,

      many thanks for your comment! Thanks too for the link to your post, that is a very thorough and thoughtful overview.

      >> So often we “notice” the adaptive features because they have caused a plan flip for the worse, particularly cardinality feedback feature, but there are also many times were it has kicked in for the better (or with neutral effect).

      This is why I’d like to play a bit devil’s advocate here – well perhaps I just like playing devil’s advocate, too 😉

  5. I am uncomfortable with plans changing on the fly after the code has been running in production for some time.
    Even without the 11g and 12c features, we used to see plan-flips in earlier versions. Some of them simply from updated statistics, some of them being outliers. However, those could, in some measure, be foreseeable.
    The adaptive behaviours in recent versions …. how foreseeable are plan changes, I wonder ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s