JULY SOFT .NET BLOG

About GEYSIR ENTERPRISE SEARCH, .NET, TECHNOLOGY and MORE

Hadoop 1 Master & 2 Slaves Setup

Why Hadoop is important in handling Big Data?

Hadoop provides excellent big data management provision, supports the processing of large data sets in a distributed computing environment. It is designed to expand from single servers to thousands of machines, each providing computation and storage. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure, which minimizes the risk of catastrophic system failure, even if a significant number of nodes become out of action. Hadoop is very valuable for large scale businesses.

Hadoop installation scenario on 3 Ubuntu machines:

ub1 is server node and ub2 and ub3 are the slaves nodes.

Steps:

  1. We will install Hadoop on master node ub1

  2. Hadoop is based on java framework, so we will install java first:

      • sudo add-apt-repository ppa:webupd8team/java

      • sudo apt-get update

      • sudo apt-get install default-jdk

      • sudo apt-get install oracle-java8-installer

  • Last command will install java at "/usr/lib/jvm/java-8-oracle". In order to check if the installation was Ok use next command:

      • /usr/lib/jvm/java-8-oracle

  • Create a hadoop group and "hduser" user as system user:

      • sudo addgroup hadoop

      • sudo adduser --ingroup hadoop hduser

  • Install SSH for secure accessing one machine from another(used by Hadoop for acceing slaves nodes):

      • sudo apt-get install openssh-server

  • Configure SSH. Login with hduser:

      • sudo su hduser

  • Generate SSH key for hduser:

      • ssh-keygen -t rsa -P ""

  • Copy id_rsa.pub to authorized keys from hduser:

      • cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

  • Add "hduser" to sudoers:

      • sudo adduser hduser sudo

  • Hadoop doesn’t work on IPv6, so Ipv6 must be disabled:

      • sudo apt install gksu

      • sudo apt install gedit

      • sudoedit /etc/sysctl.conf

    • Add into above file below settings:

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

      • CRTL+X -> yes

  • Locate hadoop installation parent directory:

      • cd /usr/local/

  • Download Hadoop:

      • sudo wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

  • Extract Hadoop sources:

      • sudo tar -xzvf hadoop-2.7.3.tar.gz

  • Move hadoop-2.7.3 to hadoop folder:

      • sudo mv hadoop-2.7.3 /usr/local/hadoop

  • Assign ownership of this folder to Hadoop user hduser:

      • sudo chown hduser:hadoop -R /usr/local/hadoop

  • Create Hadoop temp dirs for namenode and datanode:

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

  • Assign ownership of this Hadoop temp folder to Hadoop user:

      • sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

  • Check JAVA_HOME path:

      • update-alternatives --config java

  • Edit hadoop configuration files. Edit ".bashrc" file:

      • sudoedit .bashrc

      • add into it:

# -- HADOOP ENVIRONMENT VARIABLES START -- #

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOMEi

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

# -- HADOOP ENVIRONMENT VARIABLES END -- #

  • Edit "hadoop-env.sh":

      • cd /usr/local/hadoop/etc/hadoop

      • sudoedit hadoop-env.sh

      • add into above file:

        JAVA_HOME=/usr/lib/jvm/java-8-oracle

  • Edit "core-site.xml":

      • cd /usr/local/hadoop/etc/hadoop

      • sudoedit core-site.xml

      • Add into above file:

        <property>

        <name>fs.default.name</name>

        <value>hdfs://UB1:9000</value>

        </property>

  • Edit "hdfs-site.xml":

    • cd /usr/local/hadoop/etc/hadoop

    • sudoedit hdfs-site.xml

    • add into above file:

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>

</property>

  • Edit "yarn-site.xml":

    • cd /usr/local/hadoop/etc/hadoop

    • sudoedit yarn-site.xml

    • Add into above file:

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

  • Copy template of mapred-site.xml.template file:

      • cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

  • Edit "mapred-site.xml":

    • cd /usr/local/hadoop/etc/hadoop

    • sudoedit mapred-site.xml

    • Add into above file it:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

  • Reset the PC & open again the terminal with hduser. Format namenode:

      • cd /usr/local/hadoop/etc/hadoop

      • hdfs namenode -format

  • Start all hadpop daemons:

      • cd /usr/local/hadoop/

      • start-dfs.sh

      • start-yarn.sh

  • Verify hadoop daemons:

      • jps

  • Now we will extend setup hadoop on slaves nodes.

  • Add all host names to /etc/hosts directory in all Machines (Master and Slave nodes). You can find each PC IP using ifconfig command

    • on UB1 / then on UB2 / then on UB3:

      • sudo vim /etc/hosts

      • if vim is not installed you will intall it using:

          • sudo apt-get update

          • sudo apt-get install vim

      • Add into above file:

10.0.3.15 UB1

10.0.3.16 UB2

10.0.3.17 UB3

  • Create hadoop as group and hduser as user in all slaves Pcs

      • sudo addgroup hadoop

      • sudo adduser --ingroup hadoop hduser

      • sudo usermod -a -G sudo hduser (or edit "/etc/sudoers/" and add hduser ALL=(ALL:ALL) ALL) )

  • Install rsync for sharing hadoop source on all PCs

      • sudo apt-get install rsync

      • sudo reboot

  • Edit core-site.xml on master PC:

      • cd /usr/local/hadoop/etc/hadoop

      • sudo vim core-site.xml

      • replace localhost with UB1

  • Edit hdfs-site.xml on master and replace replication factor from 1 to 3

  • Edit yarn-site.xml on master:

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>UB1:8025</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>UB2:8035</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>UB3:8050</value>

</property>

  • Edit mapred-site.xml on master and add new entry:

    <property>

<name>mapreduce.job.tracker</name>

<value>UB1:5431</value>

</property>

  • Edit on master node the master:

      • cd /usr/local/hadoop/etc/hadoop

      • sudo vim masters

      • add line:

        ## Add name of master nodes

        UB1

  • Update slaves on master:

    • cd /usr/local/hadoop/etc/hadoop

    • sudo vim slaves

## Add name of slave nodes

UB2

UB3

  • Use rsync on master:

    • First install SSH on each slave PC

        • sudo apt-get install openssh-server

        • Generate SSH key for hduser:

          • ssh-keygen -t rsa -P ""

          • Copy id_rsa.pub to authorized keys from hduser:

            • cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

            • disable Ipv6 as above

            • cd /usr/local

            • sudo mkdir hadoop

            • sudo chown hduser:hadoop -R /usr/local/hadoop

    • sudo rsync -avxP /usr/local/hadoop/ hduser@UB1:/usr/local/hadoop/

    • sudo rsync -avxP /usr/local/hadoop/ hduser@UB2:/usr/local/hadoop/

  • On master:

      • sudo rm -rf /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode

      • sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

  • On each slave node:

      • sudo rm -rf /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/

      • sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode

      • sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

  • Execute on master:

      • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@UB2

      • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@UB3

      • If there are errors at any step, then you should first install java on each node and re-execute above commands

  • Execute on master:

      • usr/local/hadoop/

      • hdfs namenode -format

      • start-dfs.sh

      • start-yarn.sh

      • jps

  • Excute on each slave:

      • jps

  • Test:

  • In order to configure WebHDFS, we need to hdfs-site.xml as follows:

        <property>
           <name>dfs.webhdfs.enabled</name>
           <value>true</value>
        </property>
  • Copy local folder to hadoop: