Every big data engineer at some point in time has to work with Apache Sqoop as one of the bridges for data connectivity between the RDBMS and the Hadoop environment. There are a number of reasons which can produce sqoop errors during import or export. Based on my working experience I am listing down a few sqoop errors along with their possible root causes and corresponding resolutions
OutOfMemoryError
The error might look like this:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.reflect.Array.newArray(Native Method) at java.lang.reflect.Array.newInstance(Array.java:70) at oracle.jdbc.driver.BufferCache.get(BufferCache.java:226) at oracle.jdbc.driver.PhysicalConnection.getCharBuffer(PhysicalConnection.java:7600) at oracle.jdbc.driver.OracleStatement.prepareAccessors(OracleStatement.java:991) at oracle.jdbc.driver.T4CTTIdcb.receiveCommon(T4CTTIdcb.java:273) at oracle.jdbc.driver.T4CTTIdcb.receive(T4CTTIdcb.java:144) at oracle.jdbc.driver.T4C8Oall.readDCB(T4C8Oall.java:806) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:355) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:523) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207) at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:863) at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1153) at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1275) at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3576) at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3620) at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1491) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:777) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:786) at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:289) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:260) at org.apache.sqoop.manager.SqlManager.getColumnTypesForQuery(SqlManager.java:253) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:336) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1858) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1658) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:494) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
To resolve this exception you might have to tweak the settings for:
- mapreduce.map.memory.mb: Upper limit of the memory allocated by YARN to the mapper container
- mapreduce.map.java.opts: Maximum memory to be allocated to the jvm process on the mapper container. This memory is the subset of the one allocated to the mapper (defined by mapreduce.map.memory.mb), so it should be a little less than the overall upper limit of the allocated memory for the mapper. You have to tweak both of these settings together by increasing both of these gradually until the issue is resolved.
- –fetch-size: In smaller clusters generally there is much less liberty to give more memory to mappers, in this case, the best bet is to control the fetch size from RDBMS and lower it gradually in decrements of 10%. In case you have a wide table (200 or more columns) or in case the column sizes are huge then it would be difficult to accommodate a huge number of records in the memory as part of a single batch. It would be better to reduce the fetch-size value to reduce the batch size which would mean less number of records being sent in one batch to sqoop by RDBMS. Obviously, you need to achieve a perfect balance between resource consumption and performance which would need a little trial with these values (as well as the number of mappers)
Unknown Format Error
ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.UnknownFormatException: Unknown format for serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe org.kitesdk.data.UnknownFormatException: Unknown format for serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe at org.kitesdk.data.spi.hive.HiveUtils.descriptorForTable(HiveUtils.java:111) at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.load(HiveAbstractMetadataProvider.java:104) at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:197) at org.kitesdk.data.Datasets.load(Datasets.java:108) at org.kitesdk.data.Datasets.load(Datasets.java:165) at org.kitesdk.data.Datasets.load(Datasets.java:187) at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:123) at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:130) at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:267) at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:748) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:515) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243) at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
This error comes when you try to load data in one particular format (avro/parquet) which is different from the format of the table which is already created. The simple most scenario is where you import the data in simple text input format without explicitly defining any file format (–as-parquetfile or –as-avrodatatfile) and next time when you try to impose a new format on the same table then it complains by specifying the (unknown) format of the already created table. The best way is to drop the table before executing the sqoop import query. This should resolve such sqoop errors.
Hive Doesn’t support SQL Type
ERROR tool.ImportTool: Import failed: java.io.IOException: Hive does not support the SQL type for column BLOB_COLUMN at org.apache.sqoop.hive.TableDefWriter.getCreateTableStmt(TableDefWriter.java:181) at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:189) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:530) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243) at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
This error comes post data fetch in HDFS when as part of –hive-import construct, the data and its columns are mapped to the hive table but Hive as such is not able to find out the correct type for BLOB data types in RDBMS. The correct way to resolve this error is to explicitly map incoming BLOB datatype to Hive Binary data type using –map-column-hive PG_BINARY=Binary. For more details on the correct way to do this refer to my earlier blog here.
Cannot convert SQL Type
ERROR tool.ImportTool: Import failed: Cannot convert SQL type 2005
This one of the sqoop errors comes when file format has been specified without correctly casting the source data type to its corresponding data types in the target system (e.g. Hive). The simple example is the presence of a CLOB column in the incoming data which is to be stored in parquet format. Always remember to cast BLOBs and CLOBs to their corresponding types before using any file format. BLOBs have to be mapped to Binary data type via –map-column-hive and CLOBs have to be mapped to String data type via –map-column-java. For more information refer my previous blog here.
No ResultSet method for Java type
ERROR orm.ClassWriter: No ResultSet method for Java type Binary ERROR tool.ImportTool: Import failed: No ResultSet method for Java type Binary
This is again an issue with incorrect mapping for the source columns to their corresponding data types in the target system. A BLOB column when mapped to java Binary instead of hive binary throws this error use –map-column-hive BLOB_COLUMN=Binary to correctly import the BLOB data types.
I hope this would help you save some time while troubleshooting. Happy Sqooping !!!