Speeding up a non-equijoin

This is just a small piece about an idea I’ve had to speed up the unfortunate “where tab1.col1 between tab2.col2 and tab2.col3” query.
This type of join is notoriously difficult because if you have a large amount of data in both tables, you will want Oracle to perform a hash join, which it can’t because we’re having a non-equijoin here.

Basically, the idea is to find a way to join the tables on a less precise identity condition, keeping false positives, and then filter out the rows that do not fulfill the real requirement. Instead of a sort merge join running endlessly, you want to end up with a hash join followed by filtering and sorting out duplicates.

While the idea is simple, the realization depends on finding such a “less precise condition” that will not lose any required rows – what we need is

f(tab1.col1) = f(tab2.col2) = f(tab2.col3)

and that might be difficult to find, or might necessitate a split into several subsets, each joined and filtered by itself, which again deteriorates the splendid performance we were hoping for. There might be, however, ways to find such a condition without bloating the query too much.

To demonstrate, I’ve built a test case analogous to the real case I’ve encountered, and I’ve populated the test tables in a way that did not make my life simple (this is to say, I did not “fake” the data to be easily handled by the idea, but generated random data that ended up having the same “problems” as encountered in the real world situation).

The test case

As in the real world example I’d encountered, the table containing the range definitions has two indexes, the “range start” column being the primary key, and the “range end” column having on it a unique index:

SQL> create table ranges (range_start number primary key, range_end number, label varchar2(10));
SQL> create unique index jointest.range_end_uk on jointest.ranges(range_end);
SQL> alter table ranges add constraint range_end_unique unique(range_end) using index jointest.range_end_uk;

SQL> create table cases (id number generated always as identity primary key, range_code number);

The amount of data inserted is a fraction of the real example’s size, adapting to the circumstances of the test being run on my laptop…

SQL> insert into ranges (range_start, range_end, label)
        with r (startnum, endnum, lbl) as
            (select 10000 startnum, 10000 + round(dbms_random.value(9000, 11000)) endnum,   dbms_random.string('a', 10) lbl from dual
             union all
             select endnum + 1, endnum + 1 + round(dbms_random.value(9000, 11000)), dbms_random.string('a', 10) from r
        where startnum < 1000000000)
        select * from r;
SQL> exec dbms_stats.gather_table_stats(user, 'ranges');

SQL> insert into cases (range_code, flag1, flag2)
        select round(dbms_random.value(10000, 1000000000)) from dual 
        connect by level < 100000;
SQL> exec dbms_stats.gather_table_stats(user, 'cases');

SQL> select table_name, num_rows from user_tables;

TABLE_NAME			 NUM_ROWS
------------------------------ ----------
CASES				    99999
RANGES				   100021

SQL> select table_name, column_name, num_distinct, histogram from user_tab_columns order by 1,2;

TABLE_NAME		       COLUMN_NAME		      NUM_DISTINCT HISTOGRAM
------------------------------ ------------------------------ ------------ ---------------
CASES			       ID				     99999 NONE
CASES			       RANGE_CODE			     99999 NONE
RANGES			       LABEL				    100021 NONE
RANGES			       RANGE_END			    100021 NONE
RANGES			       RANGE_START			    100021 NONE

The original query

Now first, for the original query.

SQL> select c.id, r.label from cases c join ranges r
  on (c.range_code between r.range_start and r.range_end);

For an output of 99999 rows, this query runs forever (40 mins on my laptop), doing a merge join between ranges (accessed via the primary key index) and cases on c.range_code>=r.range_start, while the join result is filtered on c.range_code<=r.range_end.

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	54nqyafhsw8fp, child number 0
-------------------------------------
select c.id, r.label from jointest.cases c join jointest.ranges r on
(c.range_code between r.range_start and r.range_end)

Plan hash value: 3569056763

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation		     | Name	   | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	     |		   |	  1 |	     |	     |	     | 34778 (100)|	     |	99999 |00:40:10.12 |   14854 |	 1500 |       |       | 	 |
|   1 |  MERGE JOIN		     |		   |	  1 |	2500M|	  86G|	     | 34778  (97)| 00:00:02 |	99999 |00:40:10.12 |   14854 |	 1500 |       |       | 	 |
|   2 |   TABLE ACCESS BY INDEX ROWID| RANGES	   |	  1 |	 100K|	2441K|	     |	 627   (1)| 00:00:01 |	  100K|00:00:04.38 |   13975 |	  625 |       |       | 	 |
|   3 |    INDEX FULL SCAN	     | SYS_C009905 |	  1 |	 100K|	     |	     |	 214   (1)| 00:00:01 |	  100K|00:00:01.26 |	6929 |	  214 |       |       | 	 |
|*  4 |   FILTER		     |		   |	100K|	     |	     |	     |		  |	     |	99999 |00:40:05.43 |	 879 |	  875 |       |       | 	 |
|*  5 |    SORT JOIN		     |		   |	100K|  99999 |	1171K|	3944K|	 699   (2)| 00:00:01 |	 4990M|00:25:59.14 |	 879 |	  875 |  3525K|   816K| 3133K (0)|
|   6 |     TABLE ACCESS FULL	     | CASES	   |	  1 |  99999 |	1171K|	     |	 240   (1)| 00:00:01 |	99999 |00:00:00.37 |	 879 |	  875 |       |       | 	 |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("C"."RANGE_CODE"="R"."RANGE_START")
       filter("C"."RANGE_CODE">="R"."RANGE_START")

The workaround

So, I am looking for some function to apply to cases.range_code that will yield the same result when applied to range_start and range_end in the ranges table. Unfortunately, a simple

trunc(range_code/100000)

(lower numbers will work “even less”, while higher ones will render the hash join less effective) would lose rows:

SQL> select * from      
       (select range_start, range_end,
        trunc(range_start/100000) range_start_truncated,
        trunc(range_end/100000) range_end_truncated
        from ranges) 
        where range_start_truncated != range_end_truncated;


RANGE_START  RANGE_END RANGE_START_TRUNCATED RANGE_END_TRUNCATED
----------- ---------- --------------------- -------------------
...
903798841  903809794			9037		    9038
903899431  903908527			9038		    9039
903991927  904002717			9039		    9040
904095228  904104978			9040		    9041

9999 rows selected.

Still, the situation is not too bad: range_start and range_end, when divided by 100000 and truncated, differ by at most 1:

SQL> select * from      
       (select range_start, range_end,
        trunc(range_start/100000) range_start_truncated,
        trunc(range_end/100000) range_end_truncated
        from jointest.ranges) 
        where abs(range_start_truncated - range_end_truncated) > 1;

No rows selected.

which means I can simply divide up the hash join in two parts – trunc(range_code/100000) has to match either trunc(range_start/100000) or trunc(range_end/100000) – , union all the result set and look for distinct rows that match the original condition:

SQL> with pre_filter as
       (select c.id, c.range_code, r.range_start, r.range_end
        from cases c join ranges r
        on trunc(c.range_code/100000) = trunc(r.range_start/100000)
        union all
        select c.id, c.range_code, r.range_start, r.range_end
        from cases c join ranges r on
        trunc(c.range_code/100000) = trunc(r.range_end/100000))
          select distinct * from pre_filter
          where range_code between range_start and range_end;

This way, the query runs in one and a half seconds:

-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation	      | Name   | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time	 | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |      1 |	 |	 |	 |   374K(100)| 	 |  99999 |00:00:01.60 |    2670 |	 |	 |	    |
|   1 |  HASH UNIQUE	      |        |      1 |     50M|  1240M|  1915M|   374K  (2)| 00:00:15 |  99999 |00:00:01.60 |    2670 |  8288K|  3198K| 6597K (0)|
|   2 |   VIEW		      |        |      1 |     50M|  1240M|	 |  2570  (53)| 00:00:01 |    189K|00:00:01.37 |    2670 |	 |	 |	    |
|   3 |    UNION-ALL	      |        |      1 |	 |	 |	 |	      | 	 |    189K|00:00:01.30 |    2670 |	 |	 |	    |
|*  4 |     HASH JOIN	      |        |      1 |     25M|   620M|  2344K|  1285  (53)| 00:00:01 |  94838 |00:00:00.58 |    1335 |  6824K|  1791K|    9M (0)|
|   5 |      TABLE ACCESS FULL| CASES  |      1 |  99999 |  1171K|	 |   240   (1)| 00:00:01 |  99999 |00:00:00.03 |     879 |	 |	 |	    |
|   6 |      TABLE ACCESS FULL| RANGES |      1 |    100K|  1367K|	 |   137   (1)| 00:00:01 |    100K|00:00:00.03 |     456 |	 |	 |	    |
|*  7 |     HASH JOIN	      |        |      1 |     25M|   620M|  2344K|  1285  (53)| 00:00:01 |  95020 |00:00:00.54 |    1335 |  6824K|  1791K|    9M (0)|
|   8 |      TABLE ACCESS FULL| CASES  |      1 |  99999 |  1171K|	 |   240   (1)| 00:00:01 |  99999 |00:00:00.03 |     879 |	 |	 |	    |
|   9 |      TABLE ACCESS FULL| RANGES |      1 |    100K|  1367K|	 |   137   (1)| 00:00:01 |    100K|00:00:00.03 |     456 |	 |	 |	    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access(TRUNC("C"."RANGE_CODE"/100000)=TRUNC("R"."RANGE_START"/100000))
       filter(("C"."RANGE_CODE">="R"."RANGE_START" AND "C"."RANGE_CODE"="R"."RANGE_START" AND "C"."RANGE_CODE"<="R"."RANGE_END"))

Of course, the data might make things even more complicated in other cases. Still, even a union of several hash joins (where you carefully select the groups so no rows are lost) might run considerably faster than the sort merge join the optimizer has to do otherwise.