## Hadoop cluster building process summary

This paper mainly summarizes Hadoop The process of clustering , Content includes release notes 、Hadoop Introduction of the cluster 、 Server preparation 、 Network environment preparation 、 Server system settings and JDK Environmental installation . Let's have a look with those who need to learn ~

1、 Release notes

Hadoop The distribution version is divided into open source community version and commercial version . The community version refers to the Apache The version maintained by the software foundation , It's an official version system . Business Edition Hadoop It's a Community Edition by a third-party commercial company Hadoop On this basis, some modifications have been made 、 Integration and compatibility testing of various service components , Well known are cloudera Of CDH、mapR、hortonWorks etc. .

What we're going to learn later is the commercial version ：cloudera Of CDH. If not specified, it means CDH edition .Hadoop It's a special version of , It is developed by many branches in parallel . The big ones are divided into 3 A big series version ：1.x、2.x、3.x.Hadoop1.0 By a distributed file system HDFS And an offline computing framework MapReduce form .

Hadoop 2.0 It contains a support NameNode Laterally extended HDFS, A resource management system YARN And one running on YARN Offline computing framework on MapReduce. Compared with Hadoop1.0, Hadoop 2.0 More powerful , And it has better scalability 、 performance , And support a variety of computing frameworks .Hadoop 3.0 Compared with the previous Hadoop 2.0 There's a series of enhancements . At present, it has stabilized , But the upgrading and integration of the whole ecosystem is not complete yet , So commercial use is still open to question . What we are going to talk about Hadoop Cluster building process , Using the current 2 The most stable version of the series ：CDH 2.6.0-CDH14.0.

2、Hadoop Introduction of the cluster

Hadoop Specifically, a cluster consists of two clusters ：HDFS Clusters and YARN colony , The two are logically separated , But physically, they're always together .HDFS Cluster is responsible for massive data storage , The main roles in the cluster are ：NameNode 、 DataNode 、 SecondaryNameNode.YARN Cluster is responsible for the resource scheduling of massive data computing , The main roles in the cluster are ： ResourceManager、NodeManager.

that mapreduce What is it? ? It's actually a distributed computing programming framework , It's an application development package , By the user in accordance with the programming specifications for program development , Post packaging runs on HDFS On the cluster , And receive YARN Cluster resource scheduling management .Hadoop There are three ways to deploy ,Standalone mode( Independent mode )、Pseudo-Distributed mode( Pseudo distributed mode )、Cluster mode( Cluster mode ), The first two are deployed on a single machine . Stand alone mode is also called stand-alone mode , only 1 Two machines running 1 individual java process , Mainly used for debugging . The pseudo distribution pattern is also in 1 Running on a machine HDFS Of NameNode and DataNode、YARN Of ResourceManger and NodeManager, But start separate java process , Mainly used for debugging . Cluster mode is mainly used for production environment deployment . Will use N Each host makes up a Hadoop colony . In this deployment mode , The master and slave nodes will be deployed on different machines separately . We use 3 Node as an example to build , The roles are assigned as follows ：

node-01 NameNode DataNode ResourceManager

node-02 DataNode NodeManager SecondaryNameNode

node-03 DataNode NodeManager

3、 Server preparation

Use in this case VMware Workstation Pro Virtual machines create virtual servers to build HADOOP colony , The software and version used are as follows ：

VMware Workstation Pro 12.0

Centos 6.9 64bit

4、 Network environment preparation

use NAT The Internet . If you create a desktop version Centos System , It can be edited through the graphic page after installation . If it is mini Version of , By editing ifcfg-eth* Profile to configure . Be careful BOOTPROTO、GATEWAY、NETMASK.

5、 Server system settings

Synchronization time

# Synchronize the time of each machine in the cluster

date -s "2019-03-03 03:03:03" yum install ntpdate

# Network synchronization time

ntpdate cn.pool.ntp.org

Set host name

vi /etc/sysconfig/network NETWORKING=yes

HOSTNAME=node-1

To configure IP、 Host name mapping vi /etc/hosts

192.168.33.101 node-1

192.168.33.102 node-2

192.168.33.103 node-3

To configure ssh Avoid secret landing

# Generate ssh Login free key

ssh-keygen -t rsa ( Four returns )

After executing this order , Will generate id_rsa( Private key )、id_rsa.pub( Public key )

Copy the public key to the target machine for password free login

ssh-copy-id node-2

Configure firewall

# View firewall status

service iptables status

# Turn off firewall

service iptables stop

# Check the startup status of firewall

chkconfig iptables --list

# Turn off the firewall and start it

chkconfig iptables off

6、JDK Environmental installation

# Upload jdk Installation package

jdk-8u65-linux-x64.tar.gz

# Unzip the installation package

tar zxvf jdk-8u65-linux-x64.tar.gz

# Configure environment variables /etc/profile

export JAVA_HOME=/export/servers/jdk1.8.0_65

export PATH= $PATH:$JAVA_HOME/bin

export CLASSPATH=.: $JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

# Refresh configuration

source /etc/profile

That's all Hadoop Summary of cluster building process , Have you all mastered ? More detailed big data video learning resources are in the erudite Valley , Welcome to apply for trial places , Have a free course experience !

