Pre-parsers

Notes on pre-parsing data for Oracle data loads

Sometimes data simply isn't in a form that is easy to load into an Oracle database i.e. column form. It would be great if everybody exchanged data in a simple CSV form with a single file to table mapping. Sadly that isn't the case and sometimes you have to do a little work to get it into a form thats useable. A recent benchmark highlighted this issue very well. The customer provided the data in compressed CSV form (so far so good) but the data was held in key value pairs (not so good). They also provided us with a mapping file that describes how it all fits together. 

Now typically the approach many people would take would be to develop some form of program that parses all of the data and writes it to staging  area and then loads all of it in one go to the target database. I make no criticism of this approach since it works well and as long as its not time critical. It's by far the simplest method. However Im a big fan of taking advantage of whats already available and one of the most underused and powerful features of the Oracle database is  the preparser. It enables you to pipeline various operations so they all run as quickly as possible. So going back to my benchmark we used this approach to load data into out target database. It consisted of 4 steps
  • Read the data of the filesystem as efficiently as possible and write it to stdout
  • Read from stdin and Unzip the the data writing it to stdout
  • Read from stdin into a java program to do the key value mapping and error detection/correction writing the output to stdout
  • Read from stdin into sqlloader
I will at this time point out I'm not really using Oracle's pre-parser I'm just using good old "Pipes" but why this is important will become clearer later This approach gave us a great deal of flexibility and simplified the code we had to write. It operates in some respects as a serialised map reduce flow but I'll come back to that another day and explain how it can be integrated directly into a massively parallel approach. It's also possible to get Java to natively read the zipped file as well having said that I offloaded that process to the os to enable me to use different compression formats when needed.

The java program simple reads from stdin and writes to stdout. To handle key value pairs just required the program to read the mapping file in and split and parse the values from stdin. The data was then written to stdout in a well know order.

Java extract from my program... 


            BufferedReader br = new BufferedReader(new InputStreamReader(System.in), OneMB);
            String line = null;
            HashMap keyValuePairs = null;
            MyTokenizer mt = null;
            while ((line = br.readLine()) != null) {
                keyValuePairs = new HashMap(200);
                mt = new MyTokenizer(line, delimitor);
                for (String token : mt) {
                    int loc = token.indexOf("=");
                    if (loc != -1) {
                        String i = token.substring(0, loc);
                        String s = token.substring(loc + 1, token.length());
                        keyValuePairs.put(i, s);
                    }
                StringBuffer outRec = new StringBuffer(1000);
                outRec.append(checkForNull(keyValuePairs.get("uniqueID"), "")).append(seperator);
                // mapping logic similar to above repeats
                System.out.println(outRec.toString());
            }

All that was needed for sqlloader to process the files was a control file that understood the order of the columns and any additional formatting. 

One of the additional benefits is that we can load the data via "direct path" and implement other features such as multi table insert. The Java preparser enables you to add all of the additional formatting to make this a trivial process.

The following diagram illustrates the process.

This equates into a Unix/Linux statement such as

/bin/dd if=myverybigfile.txt bs=1024k status=noxfer 2>/dev/null | /bin/gunzip -c | java -classpath /home/oracle/loader.jar com.dom.KeyValueParserStdIn | sqlloader bench/bench control=kv.ctl data=\"-\" direct=TRUE;

NOTE : one thing you may have noticed is that Im using dd to do 1MB I/Os. This just an efficiency operation and works well on structures such as DBFS, you could skip this part of the operation if needed.

Which brings us onto external tables and the preparser


External Tables and pre-parsers 

As I mentioned earlier I like to take advantage of functionality that's already available and one of those features in the Oracle database is external tables. I don't intend to go into much detail as to why you should use external tables other than they do much of the heavy lifting for you and they provide a seamless interface between the filesystem and the database. They effectively make files look like tables. 

This means it's trivial to implement parallelism for our pre-parser. We don't need to worry about how to handle the files and how to schedule everything, external tables take care of all of that for you. In our benchmark we used them in the following way




Our previous pipeline remains the same except that we don't need sqlloader its all managed by the table definition itself. So we end up with something similar to the following for the table definition (I've abbreviated it quite substantially and highlighted the important bits)

create table staging_ext_mydata_jan01
(    uniqueid NUMBER,
..
-- Lots of columns
..
 )
ORGANIZATION EXTERNAL
(
   TYPE oracle_loader
   DEFAULT DIRECTORY load_dir_jan01
   ACCESS PARAMETERS
   (
      RECORDS DELIMITED BY NEWLINE
      PREPROCESSOR exec_dir:'external_tab.sh'
      BADFILE log_dir: 'external.bad'
      LOGFILE log_dir: 'external.log'
fields terminated by '|'
OPTIONALLY ENCLOSED BY '"' AND '"'
   ( uniqueid char(100),
..
-- Lots of defintions
..
)
)
LOCATION('data1.txt',data2.txt'...'data100.txt'))
REJECT LIMIT UNLIMITED;

One of the things to note is that I've included the pipelined preprocessor inside of a shell script which looks like this 

/bin/dd if=$1 bs=1024k status=noxfer 2>/dev/null | /bin/gunzip -c | java -classpath /home/oracle/loader.jar com.dom.KeyValueParserStdIn

The important part of this script is the parameter ($1) that is passed to the shell script. This is the file name that the external table wants to process. 

A simple select statement from my "staging_ext_mydata_jan01" unzips and parses the data converting it to usable columns. Whats more if I issue the statement in parallel Oracle takes care of creating the processes for me and making sure everything is scheduled in an orderly fashion.

To finish the load we simply used a multi table insert to put the data into the correct tables in an efficient fashion. Using this approach we were able to read zipped files, parse them and insert them into our three target tables at over 1.5 million source records/sec.
Comments