Optimizing Joins

This discussion applies to joins invoked either by the FOCUS dynamic JOIN command (see Advanced Reporting Techniques) or by the embedded join facility (see Multi-Table Structures).

In this section:

RDBMS and FOCUS Join Management

Join Optimization Logic

Optimization of Joins Between Heterogeneous Data Sources

Example:

Optimizing Joins Between Heterogeneous Data Sources

For information about using the JOIN command to construct an outer join and about controlling outer join optimization, see Advanced Reporting Techniques.

To explain how optimization logic is applied to joins, first there is a discussion of how the RDBMS and FOCUS differ in their management of joins.

Top of page

RDBMS and FOCUS Join Management

When the RDBMS manages a join, it effectively first generates a Cartesian product of the tables, then applies any screening conditions to the resulting rows (including any join conditions), then applies the SELECT list, and, finally, calculates column function values and expressions.

In contrast, when FOCUS manages a join, it first reads segment instances from the top to the bottom of the file structure.

Because FOCUS views RDBMS tables as segments, in very rare situations the RDBMS could return more instances of a row than FOCUS would if it were processing similar data from a FOCUS or sequential data source. This difference is called the multiplicative effect.

FOCUS implements a dynamic or embedded join between RDBMS tables (segments) based on a set of pairs of "from/to" fields. The "to" fields are called the foreign key.

The KEYS attribute in the Access File defines the number of primary key fields (columns) in a segment; KEYS = n indicates that the first n fields in the segment comprise the primary key. If the foreign key fields for a child segment or cross-referenced file do not completely cover its primary key, multiple rows in the child segment (table) may correspond to a single row in the parent segment. Therefore, the RDBMS-generated Cartesian product may contain multiple instances of a single parent segment row; the parent segment is multiplied. By definition, all ancestors of the multiplied segment are also multiplied.

A unique segment is defined as having exactly one instance corresponding to an instance of its parent segment. If its foreign key fields do not cover its primary key, this is not necessarily the case, even though the join was specified as unique. A segment in this situation is called misjoined.

Even if a unique segment is misjoined, however, the data adapter does not consider it as causing its parent's multiplication. The data adapter processes a "unique segment misjoined" condition separately from the multiplicative effect; it issues a warning message, and, depending on the current optimization setting, may or may not disable optimization.

In most cases, the data adapter can prevent the multiplicative effect, as described in Optimizing Joins and Optimizing Aggregation.

Top of page

Join Optimization Logic

When you invoke optimization, the data adapter attempts to build a single SQL statement. In order for the data adapter create one SQL statement that incorporates an RDBMS join on primary and foreign keys:

The tables must share a single retrieval path in the joined structure; multiple paths are not permitted.

Except for the lowest level segment referenced in the request, primary keys must exist (KEYS > 0 in the Access File) for all segments referenced in a report request that includes:

Aggregation operators such as SUM or COUNT, when the OPTIMIZATION setting is ON or FOCUS.

Multiple FST. or LST. operators on a segment that could return more than one record per sort break, regardless of the OPTIMIZATION setting.

SUM or COUNT operations at any level other than the lowest level in the join structure cause the RDBMS to duplicate data. In effect, each host row is replicated for each associated row in the cross-referenced table; this duplication of data is known as the multiplicative effect.

When all segments other than the lowest level segment in the request have primary keys, the data adapter passes such joins to the RDBMS with an SQL ORDER BY clause on all columns of each segment's primary key, in top to bottom order. The resulting sort on the returned answer set enables the data adapter to eliminate duplicate rows before passing the data to FOCUS. This technique is called data adapter-managed native join optimization. When data-adapter-managed native join optimization cannot be used, SUM or COUNT operations at any level other than the lowest level in the join structure cause optimization to be disabled.

In some cases, data adapter-managed joins may be less efficient than FOCUS-managed joins from prior releases. To invoke a FOCUS-managed join, set OPTIMIZATION to OFF.

When the data adapter manages join optimization, it does not optimize aggregation or sorting, as described in Optimizing Sorts and Optimizing Aggregation. This behavior is consistent with that of prior releases when the join was not passed. Expressions based on a single segment in the structure, however, are passed (see Optimizing Record Selection and Projection).

For DB2, tables must reside at the same location (subsystem).

The Master Files for the tables must not include any OCCURS segments.

You must not join more than 15 tables or views.

For conditional joins, the join expressions must be optimizable, as described in Optimizing DEFINE Fields.

If these conditions are satisfied, the data adapter generates one SQL SELECT statement that joins all the referenced tables in a single request.

If you set data adapter optimization OFF, or if any condition is not satisfied, the data adapter generates an individual SQL statement for each table; FOCUS performs the required processing on the returned answer sets.

Note:

A TABLE request that references a dynamically joined structure generates SQL join predicates for all segments in the subtree that starts from the root segment. The data adapter does not necessarily generate these predicates for multi-table Master Files (see Multi-Table Structures); in a multi-table structure the subtree effectively begins with the highest referenced segment. This effect may cause identical TABLE requests to produce different reports when run against a dynamic join structure and a multi-table Master File that represent the same tree structure.

When the data adapter does not pass a join to the RDBMS, the order in which you join the files becomes critical. To minimize the number of cursors opened, order the files from left to right by increasing number of records in either the file (for non-RDBMS files) or the returned answer set (for RDBMS tables).

Join order should not matter in an optimized join; however, performance may degrade when you join more than three tables.

Top of page

Optimization of Joins Between Heterogeneous Data Sources

The data adapter can pass a single SELECT statement for all tables referenced in a TABLE request to the RDBMS when all active segments from the RDBMS comprise a contiguous single-path subtree. The data adapter can optimize the join between relational tables even when the retrieval path includes segments from different types of data sources (for example, several DB2 segments and an IMS segment).

Single path means no non-unique RDBMS segment can have an RDBMS parent and a non- unique sibling (RDBMS or non-RDBMS). Active segments are those containing fields referenced in the request or fields involved in join operations. Contiguous segments are connected RDBMS segments with a common target RDBMS. There may be segments of different file types above or below the contiguous segments. Segments occupying intermediate positions in a file hierarchy between explicitly active segments are themselves also implicitly active.

Note:

If the request specifies a sort operation, the data adapter transfers the sort operation to the RDBMS only if the root of the SQL subtree is the topmost active segment in the data source. Additionally, all BY fields must be contained within the segments of the SQL subtree. If the multiplicative effect is detected and the data adapter-managed native join is invoked, the data adapter passes an SQL ORDER BY clause on all columns of each segment's primary key, in top-to-bottom order. The resulting sort on the returned answer set enables the data adapter to eliminate duplicate rows before passing the data to FOCUS.

FOCUS and the RDBMS use slightly different sorting/aggregation algorithms. Therefore, when the RDBMS processes a sort request, the sequence of report rows within a given sort key value may vary slightly from the sequence that would be produced by FOCUS. While both report results are equally correct, if such differences are unacceptable you can set OPTIMIZATION OFF. FOCUS will then handle all aggregation, sorts and join operations required to produce the report.

Top of page

Example: Optimizing Joins Between Heterogeneous Data Sources

The following example illustrates the SQL request passed to DB2 as the result of a dynamic join between two DB2 tables and an IDMS record. For information about JOIN syntax, see Advanced Reporting Techniques:

 JOIN EMP_ID IN EMPINFO TO WHO IN FUNDTRAN AS JOIN1
 JOIN EMP_ID IN EMPINFO TO EMP_ID IN IDMSFILE AS JOIN2

With optimization enabled, the data adapter produces one SQL SELECT statement that joins the two DB2 tables. The RDBMS processes the join:

>       SELECT T1.EID,T2.BN FROM "USER1"."EMPINFO" T1,
    "USER1"."FUNDTRAN" T2 WHERE (T2.EID = T1.EID) FOR FETCH ONLY;

For each DB2 joined row returned, the data adapter uses the join field as input to an IDMS OBTAIN record command utilizing an IDMS index or a Calc key.

With optimization disabled, the data adapter generates a separate SQL SELECT statement for each DB2 table:

      SELECT T1.EID FROM "USER1"."EMPINFO" T1 FOR FETCH ONLY;

After a row is fetched from table1, the join field is used as input to the second select:

      SELECT T2.BN FROM "USER1"."FUNDTRAN" T2 WHERE (T2.EID = ?) FOR
     FETCH ONLY;

For each DB2 joined row returned, the data adapter uses the join field as input to an IDMS OBTAIN record command using an IDMS index or a Calc key.

Information Builders