Ankit Jain's blog: Error that occured in Hadoop and its sub-projects

Thursday, January 5, 2012

Error that occured in Hadoop and its sub-projects

1. OOZIE job failed:

Error message : ERROR is considered as FAILED for SLA

Cause 1 : Not able to find hadoop namenode (master), jobtracker machine.

Suppose you are running oozie, hadoop-master and job tracker on one machine and datanode, tasktracker are running on another machine.

Your job.properties file contains following lines:

nameNode=hdfs://localhost:9000
jobTracker=localhost:9001

In above case, FS action will work fine because no map-reduce opertion is perform in FS action case. But, if you run map-reduce action then tasktracker will look hadoop-master on localhost machine becuase we have used localhost:9000 in job.properties file.

Solution : Used IP of hadoop-namenode and jobtracker machine in job.properties file instead of localhost.

Cause 2 : Oozie not able to find Mysql server.
Suppose I am using mysql as a metastore for hive.
Hive hive-default.xml file have following lines :
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

Solution : Use IP of mysql machine instead of localhost.

2. Zookeeper server not running:
Error message: Could not find my address: zk-serevr1 in list of ZooKeeper quorum servers

Causes :
HBase tries to start a ZK server on some machine but that machine isn't able to find itself in the hbase.zookeeper.quorum configuration. This is a name lookup problem.

Solution:
Use the hostname presented in the error message instead of the value you used (zk-server1). If you have a DNS server, you can set hbase.zookeeper.dns.interface and hbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN.

3. Hadoop-datanode job failed or datanode not running: java.io.IOException: File ../mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

Cause 1: Make sure atleast one datanode is running.

Cause 2: namespaceID of master and slaves machines are not same.
If you see the error java.io.IOException: Incompatible namespaceIDs in the logs of a datanode , chances are you are affected by bug HADOOP-1212 (well, I’ve been affected by it at least).

Solution :
If namespaceID of master and slaves machines are not same. Than replace the namespaceID of slaves machine with master namespaceID.
- dfs/name/current/VERSION file contains the namespaceID of master machine
- dfs/data/current/VERSION file contains the namespaceID of master machine

Cause 3: Datanode instance running out of space.
Solution : Free some space.

Cause 4 : You may also get this message due to permissions. May be JobTracker can not create jobtracker.info on startup.

4.    Sqoop export command failed:
Error message:
attempt_201101151840_1006_m_000001_0, Status : FAILED
java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:350)
at impressions_by_zip.__loadFromFields(impressions_by_zip.java:159)
at impressions_by_zip.parse(impressions_by_zip.java:108)


Cause : Given field separator is not valid
Solution : Specify correct field delimeter in sqoop export command.

5. HBase regionserver not running :

Error message: 2012-01-02 13:48:49,973 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Master rejected startup because clock is out of sync
org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server hadoop-datanode2,60020,1325492317440 has been rejected; Reported time is too far out of sync with master. Time difference of 206141ms > max allowed of 30000ms

Solution: Clock of regionservers are not sync with master machine. Synchronized the clock of hbase master and regionserver machines.