Thursday, July 22, 2010

Hadoop cluster setup

Hadoop setup
Important Directories
One of the basic tasks involved in setting up a Hadoop cluster is determining where the several various Hadoop-related directories will be located. Where they go is up to you; in some cases, the default locations are inadvisable and should be changed. This section identifies these directories.
Directory Description Default location Suggested location
HADOOP_LOG_DIR Output location for log files from daemons ${HADOOP_HOME}/logs /var/log/hadoop
hadoop.tmp.dir A base for other temporary directories /tmp/hadoop-${user.name} /tmp/hadoop
dfs.name.dir Where the NameNode metadata should be stored ${hadoop.tmp.dir}/dfs/name /home/hadoop/dfs/name
dfs.data.dir Where DataNodes store their blocks ${hadoop.tmp.dir}/dfs/data /home/hadoop/dfs/data
mapred.system.dir The in-HDFS path to shared MapReduce system files ${hadoop.tmp.dir}/mapred/system /hadoop/mapred/system
This table is not exhaustive; several other directories are listed in conf/hadoop-defaults.xml. The remaining directories, however, are initialized by default to reside under hadoop.tmp.dir, and are unlikely to be a concern.
It is critically important in a real cluster that dfs.name.dir and dfs.data.dir be moved out from hadoop.tmp.dir. A real cluster should never consider these directories temporary, as they are where all persistent HDFS data resides. Production clusters should have two paths listed for dfs.name.dir which are on two different physical file systems, to ensure that cluster metadata is preserved in the event of hardware failure.

Tuesday, July 20, 2010

Hadoop installation instructions from IBM

Hadoop installation instructions part 1
Hadoop installation instructions part 2