Import BLOB and CLOB columns from Oracle into HDFS using Sqoop

Importing data into HDFS (hadoop distributed file system) from various supported RDBMS (relational database management systems) using Sqoop is one of the initial most steps the tech community tried as an important ingestion mode. Undoubtedly this is still the critical tool for data ingestion pipelines nowadays. Sqoop (SQL <=> Hadoop) is an extremely stable and reliable tool for data import as well as export with almost all organizations in big data space using it in one way or the other. And as it happens with all technologies there are certain complex use cases with Sqoop ingestion for which there are limited resources for guidance. On top of this due to changing versions of the involved tech stack sometimes the backward compatibility can’t be maintained which means previously working code might not work after upgrading the versions. One such issue that I faced wherein I wanted to import BLOB and CLOB columns from Oracle into HDFS using Sqoop.

Let me discuss in detail the options I tried, the reported errors, and the final solution which helped me successfully import BLOB and CLOB columns from Oracle into HDFS using Sqoop. Assume the table schema in the Oracle database is:

CREATE TABLE TEST (	
ID NUMBER(10,0) NOT NULL ENABLE, 
CREATED_DATE DATE NOT NULL ENABLE, 
BLOB_COLUMN BLOB NOT NULL ENABLE, 
CLOB_COLUMN CLOB NOT NULL ENABLE);

If you want to check yourself what are the corresponding data types in the hive for columns in the Oracle table, then let hive create the table in the hive for us. The first option I tried was explicitly mapping BLOB and CLOB columns to java String type and let java handle it for us, the sqoop query, in this case, looks like:

sqoop import \
--connect <connection string for Oracle database> \
--username <user id having access to the Oracle schema> --password <password> \
--map-column-java BLOB_COLUMN=String,CLOB_COLUMN=String \
--delete-target-dir --target-dir /tmp/test \
-m 5 --split-by ID \
--query 'select ID, CREATED_DATE, BLOB_COLUMN, CLOB_COLUMN from TEST WHERE $CONDITIONS'

And strangely as contrary to certain blog posts it didn’t work with an error stating that the getString method is not available for BLOB accessor class in the JDBC driver:

Error: java.io.IOException: SQLException in nextKeyValue
        at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.sql.SQLException: Invalid column type: getString not implemented for class oracle.jdbc.driver.T4CBlobAccessor
        at oracle.jdbc.driver.Accessor.unimpl(Accessor.java:412)
        at oracle.jdbc.driver.BlobAccessor.getString(BlobAccessor.java:335)
        at oracle.jdbc.driver.OracleResultSetImpl.getString(OracleResultSetImpl.java:1283)
        at org.apache.sqoop.lib.JdbcWritableBridge.readString(JdbcWritableBridge.java:71)
        at com.cloudera.sqoop.lib.JdbcWritableBridge.readString(JdbcWritableBridge.java:61)
        at QueryResult.readFields(QueryResult.java:186)
        at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:244)
        ... 12 more

Although hive is written in java itself but I thought of giving it a try with hive’s String type as well by changing –map-column-java to –map-column-hive. Also as you might be aware if you are using any of the below arguments in the sqoop query then you must specify –hive-import in the sqoop query along with a handful of other options:

–hive-home
–hive-overwrite
–create-hive-table
–hive-table
–hive-partition-key
–hive-partition-value
–map-column-hive

The sqoop import query looks like:

sqoop import \
--connect <connection string for Oracle database> \
--username <user id having access to the Oracle schema> --password <password> \
--map-column-hive BLOB_COLUMN=String,CLOB_COLUMN=String \
--create-hive-table --hive-import --hive-database default --hive-table test --hive-overwrite \
--delete-target-dir --target-dir /tmp/test \
-m 5 --split-by ID \
--query 'select ID, CREATED_DATE, BLOB_COLUMN, CLOB_COLUMN from TEST WHERE $CONDITIONS'

And without any doubt, it worked like a charm which means hive has handled the BLOB and CLOB data types. And our issue is resolved!!! Great. But wait a minute – do you see any problem in this approach? There are two in-fact:

We have imported a binary formatted table as String in hive which is not a correct way of data representation.
We have imported CLOB in plain text which might have the field and the line delimiters/terminators as part of the data/value itself.

So we need to fix those issues, one way is to import the data as such and overlay schema on top of it. In this case, the sqoop import would be as simple as:

sqoop import \
--connect <connection string for Oracle database> \
--username <user id having access to the Oracle schema> --password <password> \
--delete-target-dir --target-dir /tmp/test \
-m 5 --split-by ID \
--query 'select ID, CREATED_DATE, BLOB_COLUMN, CLOB_COLUMN from TEST WHERE $CONDITIONS'

And if you are worried about the auto mapped data type of BLOB and CLOB then you can refer the java file generated the sqoop import which is like:

setters.put("BLOB_COLUMN", new FieldSetterCommand() {
      @Override
      public void setField(Object value) {
        BLOB_COLUMN= (com.cloudera.sqoop.lib.BlobRef)value;
      }
    });
setters.put("CLOB_COLUMN", new FieldSetterCommand() {
      @Override
      public void setField(Object value) {
        CLOB_COLUMN= (com.cloudera.sqoop.lib.ClobRef)value;
      }
    });

you would see com.cloudera.sqoop.lib.* on CDH platforms otherwise you might see org.apache.sqoop.lib.*

You can then map com.cloudera.sqoop.lib.BlobRef and com.cloudera.sqoop.lib.ClobRef to Hive Binary type in the table schema.

The simple and better option is to import it in some format like parquet to avoid data corruption due to field/record delimiters within the value and explicitly define the data type of these LOBs. The sqoop query to be used is:

sqoop import \
--connect <connection string for Oracle database> \
--username <user id having access to the Oracle schema> --password <password> \
--map-column-hive BLOB_COLUMN=Binary --map-column-java CLOB_COLUMN=String --as-parquetfile \
--create-hive-table --hive-import --hive-database default --hive-table test --hive-overwrite \
--delete-target-dir --target-dir /tmp/test \
-m 5 --split-by ID \
--query 'select ID, CREATED_DATE, BLOB_COLUMN, CLOB_COLUMN from TEST WHERE $CONDITIONS'

If you note carefully BLOB has been mapped to hive Binary datatype and CLOB column has been mapped to the Java String type.

You might get several different types of errors during sqoop import process and I have highlighted a couple of those in my next blog.

I hope this would give you some inputs on resolving issues related to BLOB or CLOB column import from RDBMS. Do let me know in case you find it useful or have any concern.

Leave a Reply Cancel reply