Method 1: Transform, Then Load

Let me start by simply mapping out the optimally efficient transform, then load implementation options (shown in Figure 6-4).

Figure 6-4. "Transform Then Load" Implementation Options

graphics/06fig04.gif

Yes, there are implementation options other than those presented in Figure 6-4. But these methods represent the optimally efficient techniques available, period. Let's examine each of these five scenarios in more detail.

Scenario 1

If you only need simple data transformations and you're using Oracle 8i, then your best and most obvious choice is to simply use SQL Loader. Of course, you'll have to write the correct control file to implement your simple transformation logic, but otherwise, you can just run SQL Loader in parallel with direct mode load. There are, of course, two cases you may have to handle: many small, identical input files or one large input file.

Many small, identical input files is a natural fit for running SQL Loader in parallel with direct loads. It's really just a simple scripting exercise. Let's assume we're on a UNIX server, have 200 input files whose names start with xxx.inp, and that we want to execute 10 concurrent SQL Loader processes at a time until all 200 files are loaded. The UNIX shell script to accomplish this would be:

#!/bin/sh
degree=10
inp_name=xxx.inp
ctl_name=xxx.ctl
file_count=`ls -l ${inp_name}* | wc -l`
if [ $file_count ]
then
  rm -f file_list*
  ls ${inp_name}* > file_list
  split_count=`expr \( $file_count + $file_count % $degree \) / $degree`
  split -$split_count file_list file_list_
  for list in `ls file_list_*`
  do
    ( cat $list | while read file
      do
        if [ -s $file ]
        then
          sqlldr data=$file control=$ctl_name direct=true parallel=true
        fi
      done ) &
  done
  wait
fi

How does this script work? First, Line 9 creates a file that contains a list of the 200 filenames (in reality, the script would work for whatever number of files it finds). Second, Line 11 splits that complete list into a degree number of sub-lists, each containing N/degree number of filenames. So now we have 10 sub-lists each containing 20 filenames. Then, for each sub-list, Lines 14?20 create a parallel background process that loads each input file contained in that sub-list. Finally, Line 22 waits for all parallel threads to complete.

One large input file, on the other hand, is a bit more complicated to handle. It is most desirable to logically split the large file N ways, but preferably without having to increase the total I/O. For example, if we simply ran the UNIX split command on a large file to split it into smaller files, and then ran SQL Loader in parallel on those smaller files, we'd triple our I/O! Why? A split is essentially a copy (i.e., read and write of each record), and we would have to re-read the copied records. That's three times the I/O, and as any good DBA knows, I/O is the mortal enemy. Plus, we'd have to wait for the split to complete before starting any SQL Loader processing. So, we also would not be minimizing our inter-process waits. It does not get any worse than that?triple the I/O and fully maximized inter-process waits!

To avoid these performance issues, we will make extensive use of pipes. While it may sound like I'm splitting hairs here and that pipes are really no better than the split, they are. First, pipes are done in memory, so no additional I/O is incurred. Second, pipes reduce inter-process waits as they asynchronously connect processes. So, pipes are truly the way to go.

Let's assume that we're on a UNIX server, have one large input file whose name is yyy.inp, and that we want to execute 10 concurrent SQL Loader processes on that single file. The UNIX shell script to accomplish this would be:

#!/bin/sh
degree=10
inp_name=yyy.inp
ctl_name=yyy.ctl
tmp_name=/tmp/`basename $0`.tmp
if [ -s $inp_name ]
then
  i=0
  while [ $i -lt $degree ]
  do
    mkfifo pipe_a$i
    mkfifo pipe_b$i
    awk 'NR%4=='$i'{print $0}' < pipe_a$i > pipe_b$i &
    sqlldr data=pipe_b$i control=$ctl_name direct=true parallel=true &
    i=`expr $i + 1`
  done

  i=0
  j=`expr $degree - 1`
  x="cat $inp_name"
  while [ $i -lt $degree ]
  do
    if [ $i -eq $j ]
    then
      x="$x > pipe_a$i"
    else
      x="$x | tee pipe_a$i"
    fi
    i=`expr $i + 1`
  done

  echo $x > $tmp_name
  sh $tmp_name
  rm -f $tmp_name

  i=0
  while [ $i -lt $degree ]
  do
    rm -f pipe_a$i
    rm -f pipe_b$i
    i=`expr $i + 1`
  done
fi

How does this script work? For each desired parallel degree, the following occurs: First, Lines 10 and 11 create two named pipes: pipe_a and pipe_b. Second, Line 12 creates an awk record filter process in the background whose input is pipe_a and output is pipe_b. These awk filters permit us to have each parallel thread process a subset of the records based on the modulus from 0 to N-1 of the record numbers. Third, Line 13 creates a SQL Loader process in the background whose input is pipe_b. So, our overall parallel process flow looks something like:

pipe?a0 graphics/arrow.gif awk mod 0 filter graphics/arrow.gif pipeb0 graphics/arrow.gif SQL Loader
pipe_a1 graphics/arrow.gif awk mod 1 filter graphics/arrow.gif pipeb1 graphics/arrow.gif SQL Loader
...
pipea(N-1) graphics/arrow.gif awk mod N-1 filter graphics/arrow.gif pipe_b(N-1) graphics/arrow.gif SQL Loader

In Lines 17?29, we create a command string that we'll use to initiate the entire parallel processing architecture. What we want to do is cat the large file once and simultaneously feed that data to all the parallel process threads, which we do via unnamed pipes, the tee command, and output file redirection. Hence, the constructed command string will be of the form:

cat $inp_file | tee pipe_a0 | pipe_a1 | ... | pipe_a(N-2) > pipe_a(N-1)

Finally, we'll write that command string to a temporary file in Line 31 and execute it in Line 33. The reason we write it to a temporary file first and then execute it is that both the pipe and redirect characters in the command string can cause substitution problems for some of the various UNIX flavors if we merely try to execute that command string in place.

I'll leave it to the reader as an exercise to combine these two scripts for the case with lots of medium-sized files to load in parallel both across and within input files. Even if you decide not to try this exercise, think about it for a moment. Both the above scripts are not trivial, and combining them would take some effort. Remember this fact, because in later sections, where I say that new Oracle 9i features automate all the parallel implementation headaches for you, I mean as compared to these scripts and combining them. In Oracle 9i, you'll merely provide parallel DML and SQL hints to accomplish the exact same thing. That's why Oracle 9i is the way to go for data loading.

Scenario 2

If you only need simple data transformations and you're using Oracle 9i, then your best and most obvious choice is to use 9i's new external table mechanism. What exactly are external tables? Oracle defines them as the meta-data necessary to describe an external flat file such that Oracle can provide read-only access to that data as if it was a regular database table. Thus, Oracle abstracts away the implementation while providing the full expressive power of SQL, including parallel queries. Imagine, SQL for flat files!

As before, there are two cases you may have to handle: many small, identical input files or one large input file. This time, we'll look at these cases in reverse order since the one large input file case is now so simple as to be laughable.

Let's assume that once again, we're on a UNIX server, have one large input file whose name is yyy.inp, and that we want to execute 10 concurrent loading processes into our regular database table from that single external table (i.e., the input file). The Oracle 9i SQL code to accomplish this would be:

create directory inp_dir as '/home/oracle/input_files';

create table yyy (c1 number, c2 number, c3 number,
                  c4 number, c5 number, c6 number)
organization external (
    type oracle_loader
    default directory inp_dir
    access parameters (
        fields terminated by ','
    )
    location ('yyy.inp')
)
parallel 10;

alter session enable parallel dml;
insert /*+ parallel(fact,10) append */ into fact
    select /*+ parallel(yyy,10) full(yyy)*/* from yyy;

How easy is that? All we had to do was give Oracle just a few simple hints to get all that parallel processing done for us. Note that this code would work just the same on a Windows NT server (or any other OS for that matter).

Look back at the previous scenario's one large input file case; it was nearly 50 lines of complex scripting code to achieve the same results as our insert select with hints. I rest my case.

Many small, identical input files, on the other hand, is slightly more complicated to handle. But as before, it's really just a simple scripting exercise. Let's assume we're on a UNIX server, have 200 input files whose names start with xxx.inp, and that we want to execute 10 concurrent loading processes until all 200 files are loaded into our regular database table from that single external table (i.e., the input file).

First, we must update the previous example's external table such that it processes 10 files at a time (i.e., we now have 10 files instead of one listed under 10 locations). This will cause Oracle to process all those files at once (per operation). That's all there is to it.

create directory inp_dir as '/home/oracle/input_files';

create table xxx (c1 number, c2 number, c3 number,
                  c4 number, c5 number, c6 number)
organization external (
    type oracle_loader
    default directory inp_dir
    access parameters (
        fields terminated by ','

    )
    location ('file_0',
              'file_1',
              'file_2',
              'file_3',
              'file_4',
              'file_5',
              'file_6',
              'file_7',
              'file_8',
              'file_9')
)
parallel 10;

The UNIX shell script to process through all the input files would simply be:

#!/bin/sh
degree=10
inp_name=xxx.inp
ctl_name=xxx.ctl
file_count=`ls -l ${inp_name}* | wc -l`
if [ $file_count ]
then
  rm -f file_list*
  ls ${inp_name}* > file_list
  split_count=$degree
  split -$split_count file_list file_list_
  for list in `ls file_list_*`
  do
    i=0
    cat $list | while read file
    do
      rm -f file_$i
      ln -s $file file_$i
      i=`expr $i + 1`
    done
    sqlplus <<EOF
        alter session enable parallel dml;
        insert /*+ parallel(fact,10) append */ into fact
            select /*+ parallel(xxx,10) full(xxx)*/* from xxx;
EOF
  done
fi

How does this script work? First, Line 9 creates a file that contains a list of the 200 filenames (in reality, the script would work for whatever number of files it finds). Second, Line 11 splits that complete list into N number of sub-lists, each containing just a degree number of filenames. So now we have 20 sub-lists, each containing 10 filenames. Then, for each sub-list, Lines 15?20 create a soft link for each input file contained in a sub-list to location filenames for the external table (i.e., file_0 through file_N-1). Finally, Line 22 executes a SQL Plus session to load the files. This process repeats until all the file lists have been processed, and thus all the files have been loaded.

Scenario 3

If you need more complex data transformations, are only doing inserts (i.e., no updates), and you're using Oracle 8i, then your best choice is to use SQL Loader with row-level pre-insert triggers. Of course, you'll have to write the correct control file logic and trigger code to implement your data transformations, but otherwise, you just run SQL Loader in parallel without direct mode load (so the trigger can fire). In many respects, this scenario is very much like the first. In fact, you can use almost the exact same shell scripts, with the only modification being to change the lines calling SQL Loader from direct=true to direct=false?that's it.

Of course, you also need to create a row-level pre-insert trigger on the target table to perform your data transformation logic. An example would be:

create or replace trigger xxx_trg
before insert on xxx
referencing old as old new as new
for each row
declare
  av integer;
begin

  /* Obtain adjustment via lookup */
  begin
    select  adj_value
      into  av
      from  lookup_table
      where adj_lookup = :new.c2;
  exception
    when others then
      av := 0;
  end;

  /* Calculate final expression value */
  :new.c6 := nvl(:new.c4,0) + nvl(:new.c5,0) ? av;

end;
/

In this example, we're calculating a column as an expression based on other columns from that same row and then applying adjustments to the calculated value based on some lookup table. Of course, our complex transformation logic is limited only by our PL/SQL coding ability. You cannot do this kind of stuff with just plain, old SQL Loader; you need the complete and expressive power of PL/SQL to accomplish such complex data transformations.

The reason this method only works for inserts (and not updates) is due to Oracle error ORA-04091: "Table schema.table is mutating, trigger/function may not see it." This error means: A trigger (or a user-defined PL/SQL function that is referenced in this statement) attempted to look at (or modify) a table that was in the middle of being modified by the statement that fired it.

Of course, the performance penalty for running SQL Loader with direct = false is somewhat noticeable. But, there really is no other way around this.

Scenario 4

If you need more complex data transformations, are doing upserts (i.e., updates that insert when a record is not found), and you're using Oracle 8i, then your best choice is to use Pro-C programs. Of course, you'll have to write the C code with embedded SQL to implement your data transformations. In many respects, this scenario is very much like both Scenarios 1 and 3. In fact, you can use almost the exact same shell scripts, with the only modification being to change the lines calling SQL Loader to instead call your Pro-C program. If that program is written to open files passed in as arguments, then the call would be of the form:

program_name file_name

But if the Pro-C program is instead just written to read from stdio, then the form would be:

cat file_name | program_name

Now, you might ask, why Pro-C instead of just using PL/SQL? That's a fair question. PL/SQL is a great language for doing database internal programming; plus now, with its supplied UTL_FILE package, PL/SQL can also operate on flat files. And, SQL Plus is a lightweight, command-line program that we could easily embed within our UNIX shell scripts (as we did in the multi-file case of Scenario 2) to execute our PL/SQL code. So again, why not just use PL/SQL?

Remember that my goal is to show you the most optimally efficient implementation, which is Pro-C. But for many shops, the answer will be to go with PL/SQL. The rationale is often that developers are more comfortable with PL/SQL (at least more so than with Pro-C and its associated makefiles). Plus, some UNIX vendors no longer provide a free C compiler (and Oracle currently only supports the GNU-C compiler on Linux). If that describes your shop, then by all means stick with PL/SQL.

PL/SQL lacks one key programming construct that Pro-C provides: Dynamic SQL Method #2: prepare and execute. This programming technique can shave about 15?20% off data loading program runtimes, so in many cases, it's worth the extra costs. How does this technique work? Remember, every time Oracle processes a command, it must parse, bind, execute, and fetch. With Dynamic SQL Method #2, you can prepare that statement once outside of your loop processing and then execute it repeatedly in the loop without Oracle having to re-parse or re-bind it.

As we did in Scenario 3, let's once again calculate a column as an expression based on other columns from that same row and then apply adjustments to the calculated value based on some lookup table. So, your Pro-C program would look like:

int main(int argc, char *argv[]) {
  ...

  strcpy(sql_command,"UPDATE xxx \
    SET c4 = nvl(c4,0) + nvl(:h_c4:i_c4,0), \
        C5 = nvl(c5,0) + nvl(:h_c5:i_c5,0), \
        C6 = nvl(c4,0) + nvl(:h_c4:i_c4,0) + \
             nvl(c5,0) + nvl(:h_c5:i_c5,0) ? :h_av:i_av \
    WHERE c1 = :h_c1:i_c1 \
      AND c2 = :h_c2:i_c2 \
      AND c3 = :h_c3:i_c3;"
  EXEC SQL PREPARE update_command FROM :sql_command;

  strcpy(sql_command,"INSERT INTO xxx VALUES \
    (:h_c1:i_c1,:h_c2:i_c2,:h_c3:i_c3,
     :h_c4:i_c4,:h_c5:i_c5, \
     nvl(:h_c4:i_c4,0) + nvl(:h_c5:i_c5,0) - :h_av:i_av)");
  EXEC SQL PREPARE insert_command FROM :sql_command;

  /* Process data file records */
  while (fgets (rec, sizeof rec, fid) != NULL) {
    ...

    /* Obtain adjustment via lookup */
    EXEC select  adj_value
           into  :h_av:i_av
           from  lookup_table
           where adj_lookup = :h_c2:i_c2;
    if (sqlca.sqlcode == 1403) {
      h_av = 0;
      i_av = 0;
    }
    else if (sqlca.sqlcode != 0) {
      ...
    }

    /* First - try to update existing record */
    EXEC SQL EXECUTE update_command
        USING :h_c1:i_c1,
              :h_c2:i_c2,
              :h_c3:i_c3,
              :h_c4:i_c4,
              :h_c5:i_c5,
              :h_av:i_av;

    /* Second - if update fails because record
                not found, then insert record*/
    if (sqlca.sqlcode == 1403) {
      EXEC SQL EXECUTE insert_command
        USING :h_c1:i_c1,
              :h_c2:i_c2,
              :h_c3:i_c3,
              :h_c4:i_c4,
              :h_c5:i_c5,
              :h_av:i_av;
    }
    else if (sqlca.sqlcode != 0) {
      ...
    }

    ...
  }

  ...
}

Scenario 5

If you need more complex data transformations, you're possibly doing upserts (i.e., updates that insert when a record is not found), and you're using Oracle 9i, then your best choice is to use 9i's new external table mechanism and table functions, plus the new MERGE command if you're doing upserts. But let me state that while these new features represent the most optimally efficient ways to load data, they are far from the most obvious and easy ways to go. To effectively utilize these features, you as the DBA should assume a leadership and mentoring role for the developers. In short, these features leverage the database as the ETL engine, and as such, begin to blur the distinction between ETL developer and DBA.

If you're doing upserts, 9i's new MERGE command is simply a new DML command that encapsulates an UPDATE and INSERT into a single command processed by a single call to the database. Your developers will love this new syntax, as it's exactly what they've been coding as separate, related DML commands with intelligent error handling. Now it is a single command and a single network request sent to the database server. So you get the best of both worlds?it is easier to code and runs faster, too.

As for table functions, I like Oracle's definition: "A table function is defined as a function written in PL/SQL, Java, or C that can produce a set of rows as output and can take a set of rows as input." In essence, table functions sit between your source external table and final target table as ETL parallel piping mechanisms. Thus, Oracle now supports, via SQL, all the advanced parallel and pipelined ETL capabilities previously only available via shell scripts, Pro-C programs, and Oracle utilities.

Implementing Scenario 5 very closely resembles Scenario 2, but with two very minor exceptions. First, we can use either the INSERT or MERGE command. Second, we'll select our input data from a table function that is written against our external table. That's it. Let's dig deeper.

As before, there are two cases you may have to handle: many small, identical input files or one large input file. I'm only going to show the large file case. The reader should be able to very easily combine the example below with the one from Scenario 2 to produce the case for many small files.

Let's assume that once again we're on a UNIX server, have one large input file whose name is yyy.inp, and that we want to execute 10 concurrent loading processes into our regular database table from a single external table (i.e., the input file). And as with both Scenarios 3 and 4, we will calculate a column as an expression based on other columns from that same row and then apply adjustments to the calculated value based on some lookup table. Plus, like Scenario 4, we'll perform upserts. Whew. The Oracle 9i SQL code to accomplish this would be:

create directory inp_dir as '/home/oracle/input_files';

create table yyy (c1 number, c2 number, c3 number,
                  c4 number, c5 number, c6 number)
organization external (
    type oracle_loader
    default directory inp_dir
    access parameters (
        fields terminated by ','
    )
    location ('yyy.inp')
)
parallel 10;

create or replace type trx_obj is object (
  c1 number,
  c2 number,
  c3 number,
  c4 number,
  c5 number,
  c6 number,
  av integer
);
/
create or replace type trx_tab is table of trx_obj;
/

create or replace package trx
as
  type yyy_cur is ref cursor return yyy%rowtype;

  type trx_rec is record (
    c1 number,
    c2 number,
    c3 number,
    c4 number,
    c5 number,
    c6 number,
    av integer
  );
  type trx_tab is table of trx_rec;
  type trx_cur is ref cursor return trx_rec;


  function go (p yyy_cur)
    return trx_tab
    PIPELINED
    PARALLEL_ENABLE(PARTITION p BY ANY);
end;
/

create or replace package body trx
as
  function go (p yyy_cur)
    return trx_tab
    PIPELINED
    PARALLEL_ENABLE(PARTITION p BY ANY)
  is
    out_rec trx_cur;
  begin
    for inp_rec in p loop
      out_rec.c1 := inp_rec.c1;
      out_rec.c2 := inp_rec.c2;
      out_rec.c3 := inp_rec.c3;
      out_rec.c4 := inp_rec.c4;
      out_rec.c5 := inp_rec.c5;
      out_rec.c6 := inp_rec.c6;
      /* Obtain adjustment via lookup */
      begin
        select  adj_value
          into  out_rec.av
          from  lookup_table
          where adj_lookup = inp_rec.c2;
      exception
        when others then
          out_rec.av := 0;
      end;
      pipe row(out_rec);
    end loop;
  end;
end;
/

alter session enable parallel dml;
merge /*+ parallel(fact,10) append */
    into fact f
    using TABLE(trx.go(
        CURSOR(select /*+ parallel(yyy,10) full(yyy) */ *
               from yyy ))) y
    on f.c1 = y.c1 and
       f.c2 = y.c2 and
       f.c3 = y.c3
    when matched then
        update set
           c4 = nvl(f.c4,0) + nvl(y.c4),
           C5 = nvl(f.c5,0) + nvl(y.c5),
           C6 = nvl(f.c4,0) + nvl(y.c4) +
                nvl(f.c5,0) + nvl(y.c5) ? y.av
    when not matched then
        insert values (y.c1, y.c2, y.c3,
                       y.c4, y.c5,
                       nvl(y.c4,0) + nvl(y.c5,0) ? y.av);

We've now reached the point where the Oracle implementation is arguably as complex as the most involved shell scripting solution of any of the other scenarios. That's why the data warehouse DBA should be involved with ETL efforts when adopting this technique. Using the Oracle server as your ETL transformation engine like this is by far the fastest implementation choice there is. So, it's worth the extra effort to master this method.