Why Hadoop is important in handling Big Data?
Hadoop provides excellent big data management provision, supports the processing of large data sets in a distributed computing environment. It is designed to expand from single servers to thousands of machines, each providing computation and storage. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure, which minimizes the risk of catastrophic system failure, even if a significant number of nodes become out of action. Hadoop is very valuable for large scale businesses.
Hadoop installation scenario on 3 Ubuntu machines:
ub1 is server node and ub2 and ub3 are the slaves nodes.
Steps:
-
We will install Hadoop on master node ub1
-
Hadoop is based on java framework, so we will install java first:
-
sudo add-apt-repository ppa:webupd8team/java
-
sudo apt-get update
-
sudo apt-get install default-jdk
-
sudo apt-get install oracle-java8-installer
-
Last command will install java at "/usr/lib/jvm/java-8-oracle". In order to check if the installation was Ok use next command:
-
Create a hadoop group and "hduser" user as system user:
-
Install SSH for secure accessing one machine from another(used by Hadoop for acceing slaves nodes):
-
Configure SSH. Login with hduser:
-
Generate SSH key for hduser:
-
Copy id_rsa.pub to authorized keys from hduser:
-
Add "hduser" to sudoers:
-
Hadoop doesn’t work on IPv6, so Ipv6 must be disabled:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
-
Locate hadoop installation parent directory:
-
Download Hadoop:
-
Extract Hadoop sources:
-
Move hadoop-2.7.3 to hadoop folder:
-
Assign ownership of this folder to Hadoop user hduser:
-
Create Hadoop temp dirs for namenode and datanode:
-
Assign ownership of this Hadoop temp folder to Hadoop user:
-
Check JAVA_HOME path:
-
Edit hadoop configuration files. Edit ".bashrc" file:
-
sudoedit .bashrc
-
add into it:
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOMEi
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #
-
Edit "hadoop-env.sh":
-
Edit "core-site.xml":
-
cd /usr/local/hadoop/etc/hadoop
-
sudoedit core-site.xml
-
Add into above file:
<property>
<name>fs.default.name</name>
<value>hdfs://UB1:9000</value>
</property>
-
Edit "hdfs-site.xml":
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
10.0.3.15 UB1
10.0.3.16 UB2
10.0.3.17 UB3
-
Create hadoop as group and hduser as user in all slaves Pcs
-
Install rsync for sharing hadoop source on all PCs
-
Edit core-site.xml on master PC:
-
Edit hdfs-site.xml on master and replace replication factor from 1 to 3
-
Edit yarn-site.xml on master:
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>UB1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>UB2:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>UB3:8050</value>
</property>
<name>mapreduce.job.tracker</name>
<value>UB1:5431</value>
</property>
## Add name of slave nodes
UB2
UB3
-
Use rsync on master:
-
First install SSH on each slave PC
-
sudo rsync -avxP /usr/local/hadoop/ hduser@UB1:/usr/local/hadoop/
-
sudo rsync -avxP /usr/local/hadoop/ hduser@UB2:/usr/local/hadoop/
-
On master:
-
sudo rm -rf /usr/local/hadoop_tmp/
-
sudo mkdir -p /usr/local/hadoop_tmp/
-
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode
-
sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/
-
On each slave node:
-
sudo rm -rf /usr/local/hadoop_tmp/
-
sudo mkdir -p /usr/local/hadoop_tmp/
-
sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode
-
sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/
-
Execute on master:
-
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@UB2
-
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@UB3
-
If there are errors at any step, then you should first install java on each node and re-execute above commands
-
Execute on master:
-
usr/local/hadoop/
-
hdfs namenode -format
-
start-dfs.sh
-
start-yarn.sh
-
jps
-
Excute on each slave:
-
Test:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>