Join Optimization in DML Generation

In this section:

Setting Join Strategy

Nested Loop Processing

Joining Columns of Unequal Length

Turning Join Optimization ON

WHERE Predicate Cloning

This section describes how joins can be optimized for DML Generation using SQL Translation and Hub Servers. This is particularly important when joining heterogeneous data sources across servers.

Top of page

Setting Join Strategy

SQL Translation offers two types of join strategy in DML Generation: nested loop and sort/ merge. To set a preferred join strategy, issue the following command in the Hub Server profile.

Top of page

Syntax: How to Set Preferred Join Strategy

SQL EDA SET JOINTYPE {NESTEDLOOP|SORTMERGE}

where:

NESTEDLOOP: Selects all rows from the outer table that match screening conditions, then uses those rows to obtain qualified data from the inner tables. DML Generation uses this strategy only for an equijoin on joinable columns.; Nested Loop is very effective when the screening conditions limit the number of rows returned from a file. It is not recommended otherwise.; NESTEDLOOP is the default for JOINTYPE if this setting is deleted from the server profile.
SORTMERGE: Retrieves data from each table and then sorts, merges, and aggregates at the end of the process. When SORTMERGE is specified, DML Generation always uses a sort/merge strategy to retrieve data. By default, the Hub Server contains an entry in the server profile setting JOINTYPE to SORTMERGE.; Sort/merge is used automatically if a query fails nested loop conditions. A Hub Server uses sort/merge, even if nested loop is the set strategy, if the server determines that one or more join conditions are for non-joinable columns. A non-joinable column is one in a non-relational data source without a declared index or key.; Sort/merge works effectively with large volumes of data. A sort/merge always fetches all data from the outer table to disk, then joins them locally.

Top of page

Nested Loop Processing

If JOINTYPE is not set or is set to NESTEDLOOP, DML Generation performs a set of optimization functions:

Determines the order of the tables in the FROM clause.

Determines which columns in the WHERE clause are joinable and at what cost.

Determines whether the formats of the potential join columns are compatible.

DML Generation uses the following methodology:

Assigns a relative cost to each join candidate to find the least expensive join structure.

Examines each WHERE clause predicate.

If it finds an equijoin (that is, WHERE clauses of the form fld1 = fld2, where fld1 and fld2 are columns from two of the FROM clause tables), DML Generation determines if either fld1 or fld2 is joinable.

Examines the join columns to ensure that they have compatible formats.
Creates a matrix of joinability containing information about the costs of joining from table to table. Each join candidate is assigned a cost, as follows:

COST=1

An equijoin to a joinable column (the to column is joinable and the columns have compatible formats).

COST=16

An equijoin to a non-joinable column (the to column is not joinable or the columns do not have compatible formats).

COST=256

A non-equijoin or unrestricted Cartesian product (there is no such relationship present in the WHERE clause).

After it builds the matrix, DML Generation tries the following join strategies:

Single-path join based on the FROM clause left-to-right order. The tables in the FROM clause are examined to determine if the first table is joinable to the second, the second to the third, and so forth. This is done by examining the matrix for each pair in FROM order. If each pair has a cost of 1, then this strategy is taken, resulting in the creation of a single-path nested loop structure.

Multi-path join based on the FROM clause left-to-right order. DML Generation examines the matrix using the leftmost table as the outermost table. It then determines if it can join the FROM tables in such a way that each TO table is to the right of the FROM table and the cost of each is 1. This results in the creation of a multi-path nested loop structure.

Composite join structure. DML Generation sorts the costs and chooses those with a cost of 1 over those with a higher cost. It builds a structure that either:

Is composed completely of equijoins based on joinable columns.

Requires the creation of intermediate hold tables to accomplish an equijoin on non-joinable columns.

Requires the creation of intermediate hold tables to accomplish a Cartesian product-based join.

DML Generation tries to respect the order of tables in the FROM clause. Therefore, optimal performance is achieved by making the most restricted table the outer table-that is, first in the FROM clause.

In any case, the least expensive join based on these rules is constructed and the rows returned are the same.

Top of page

Joining Columns of Unequal Length

All data adapters allow joins of columns of unequal length. With this capability, DML Generation assigns a cost of 1 to alphabetic joins of unequal length and between short-packed (8-byte) and long-packed (16-byte) columns.

Top of page

Reference: Joinable Columns

DML Generation considers the following columns to be joinable:

Host Column

Target Column

A (any length)

A (any length)

short P

short P

Long P

long P

short P

long P

Long P

short P

I

I

D

D

F

F

new date

new date

old date (I)

old date (I)

new date

old date (I)

old date (I)

new date

Note: Old dates of alpha format are not supported. If you have such a column, it should be defined as alpha.

Some data adapters for legacy files allow a generic join, which violates relational join rules by including extra columns that do not strictly match the select list. For those data adapters, DML Generation reapplies the screening criteria after the records are retrieved to ensure strict relational compliance.

For all signed numeric columns, the handling of different values for the + and - is performed by the data adapter, not DML Generation. All values corresponding to + and - should be retrievable in a join.

Top of page

Turning Join Optimization ON

Join optimization is controlled by the SET JOINOPT setting in the server profile. It is turned ON by default. For more information on SET JOINOPT, see the Server Administration manual.

Top of page

WHERE Predicate Cloning

Example:

Cloning the WHERE Clause Predicates Using Nested Loop

Cloning the WHERE Clause Predicates Using Sort/Merge

DML Generation also clones WHERE clause conditions, where appropriate, for optimal join performance. This optimization is performed for both nested loop and sort/merge joins when there is explicit value selection criteria on one of the columns.

DML Generation builds a table of equivalent columns, which have the same value for all rows in the answer set. The equivalencies are found by searching the WHERE clause for predicates that equate one database column to another (for example, A = B). Equijoins always fit this pattern. Next, all WHERE clause predicates on columns in the equivalence table are duplicated, and the column name in the predicate is replaced with its equivalent column.

Top of page

Example: Cloning the WHERE Clause Predicates Using Nested Loop

The following example illustrates an optimization procedure using a nested loop strategy. In this example:

Table A is 100,000 rows, 100 of which refer to Smith (as a foreign key).

Table B is 100 rows; one has Smith as its key.

For this SELECT statement

SELECT * FROM A,B WHERE B.key=A.key AND B.key='SMITH'

DML Generation clones the selection criteria for 'SMITH' by also applying it to A.key. Logically, the SELECT is transformed as follows:

SELECT * FROM A,B WHERE B.key=A.key AND B.key='SMITH' AND A.key='SMITH'

The subsequent processing involves the following operations:

Selecting rows from the outer table A by applying the generated WHERE clause criteria to retrieve the first 'SMITH' row.

Selecting rows from the inner table B by supplying the from column value of 'SMITH' and the WHERE clause criteria (also 'SMITH') to retrieve one row.
Repeating steps 1 and 2 a total of 99 more times.

The outer table SELECT returns 100 rows in one DBMS event. The inner SELECT is executed 100 times and returns one row for 100 DBMS events. The total is 200 reads, 101 DBMS events, and 100 rows returned for the query.

If DML Generation did not clone the WHERE predicate, the outer SELECT would return 100,000 rows, 99,900 of which would be rejected once the inner SELECT was applied using key values not equal to 'SMITH', which would return no rows. The total would have been 100,001 reads, 100,001 DBMS events, and 100 rows returned for the query.

Thus, with WHERE predicate cloning, maximum efficiency is reached: 200 reads instead of 100,001. WHERE clause cloning provides join optimization, and makes this operation almost as efficient, in this case, as the second example.

Top of page

Example: Cloning the WHERE Clause Predicates Using Sort/Merge

The following example illustrates an optimization procedure using a sort/merge strategy. In this example:

Table A is 100,000 rows, 100 of which refer to Smith (as a foreign key).

Table B is 100 rows; one has Smith as its key.

For this SELECT statement

SELECT * FROM A,B WHERE A.key=B.key AND B.key='SMITH'

DML Generation clones the selection criteria for 'SMITH' by applying it to A.key. Logically, the SELECT is transformed as follows:

SELECT * FROM A,B WHERE B.key=A.key AND B.key='SMITH' AND A.key='SMITH'

The subsequent processing involves the following operations:

Selecting rows from table A by applying the generated WHERE clause criteria to retrieve all 100 'SMITH' rows.

Selecting rows from table B by applying the original WHERE clause criteria to retrieve one row.
Merging the results from steps 1 and 2.

The first SELECT would return 100 rows for one DBMS event. The second would return one row for one DBMS event. The total is 101 reads, two DBMS events, and 100 rows returned for the query.

If DML Generation did not clone the WHERE predicate, there would have been 100,001 reads.

In each case, this transformation yields a query semantically equivalent to the original because A.key and B.key must be equal at all times to satisfy the preexisting equijoin criteria. In any retrieval scenario, the number of rows returned from table A will be reduced by the added predicate.

iWay Software

COST=1	An equijoin to a joinable column (the to column is joinable and the columns have compatible formats).
COST=16	An equijoin to a non-joinable column (the to column is not joinable or the columns do not have compatible formats).
COST=256	A non-equijoin or unrestricted Cartesian product (there is no such relationship present in the WHERE clause).

Host Column	Target Column
A (any length)	A (any length)
short P	short P
Long P	long P
short P	long P
Long P	short P
I	I
D	D
F	F
new date	new date
old date (I)	old date (I)
new date	old date (I)
old date (I)	new date