11.6 The SQL Inside Your PL/SQL

Now that we've explored calling PL/SQL from SQL, let's turn the tables and explore the use of SQL inside your PL/SQL code. SQL is great at manipulating large sets of data, but there are situations where you need to work with data at the row level. PL/SQL, with its looping and cursor control capabilities, allows the flexibility to work at the set level using SQL or at the row level using cursors. However, many PL/SQL programmers forego the power of SQL and do everything at the row level, even when it is unnecessary and time-consuming to do so.

As an analogy, imagine that you are working at a warehouse, and a large shipment of parts arrives on the loading dock. Your job is to separate the shipment by part type and distribute the pieces to different areas of the warehouse. To make your job easier, the warehouse owner has procured the best forklift money can buy. There are two possible strategies to employ:

  • Pick up one box at a time, determine the type, and drive it to the appropriate destination.

  • Spend some time analyzing the situation, determine that every box on a pallet is of the same type, and drive entire pallets to the appropriate destination.

Although this analogy might be overly simplistic, it does serve to illustrate the difference between set operations and row operations. Allowing the Oracle server to manipulate large sets in a single operation can often yield a performance improvement of several orders of magnitude over manipulating one row at a time, especially on systems with multiple CPUs.

When a procedural language is used for database access (whether it is PL/SQL, C with OCI calls, or Java using JDBC), there is a tendency to employ the first strategy. Perhaps programmers are accustomed to coding at a low level of granularity when using a procedural language and this spills over into their data access logic. This situation is especially prevalent in systems that need to process and load large amounts of data from external files, such as data warehouse load utilities.

Imagine that you are charged with building an infrastructure to accept files from multiple OLTP systems, perform various data cleaning operations, and aggregate the data into a data warehouse. Using PL/SQL (or C, Java, C++, Cobol, etc.), you could build functionality that:

  1. Opens a given file.

  2. Reads a line, verifies/cleans the data, and updates the appropriate row of the appropriate fact table in the data warehouse.

  3. Repeats #2 until the file is exhausted.

  4. Closes the file.

Although this approach might work for small files, it is not uncommon for large warehouses to receive feeds containing hundreds of thousands or millions of items. Even if your code is extremely efficient, processing a million-line file could take several hours.

Here's an alternate strategy that employs the power of the Oracle server to make quick work of large data feeds:

  1. Create a staging table for each unique data feed file format.

  2. At the start of the load process, truncate the staging tables.

  3. Use SQL*Loader with the direct path option to quickly load the data file into the appropriate staging table.

  4. Update all rows of the staging table to clean, verify, and transform data, marking rows as invalid if they fail verification. Perform the operation in parallel if possible.

  5. Update the appropriate fact table using a subquery against the staging table. Again, perform in parallel if possible.

For this strategy to succeed, you need to have adequate disk space and sufficiently large rollback and temporary tablespaces. With adequate resources and properly constructed SQL statements, however, this strategy can yield a 10X improvement over the previous strategy.

So what role should PL/SQL play in such a scenario? In this case, PL/SQL would be an excellent vehicle for executing steps 4 and 5 of the previous list. Although the stored procedures might each contain only a single update statement, the SQL is likely to be complex and may contain optimizer hints and other advanced features. Therefore, it would be advisable to isolate the SQL from the rest of the application so that it may be independently monitored and tuned.

In general, when dealing with complex logic involving large data sets, it is advantageous to think in terms of data sets rather than programming steps. In other words, ask yourself where your data is, where it needs to move to, and what needs to happen to it during its journey instead of thinking in terms of what needs to happen with each piece of data to satisfy the business requirements. If you follow this strategy, you will find yourself writing substantial, efficient SQL statements that employ PL/SQL where appropriate, rather than writing complex PL/SQL routines that employ SQL when needed. In doing so, you will be providing the server with the opportunity to split large workloads into multiple pieces that run in parallel, which can greatly improve performance.