ubuntu 部署大数据组件(hadoop,zookeeper,hive,spark)

发布 : 2020-07-12 分类 : 运维 浏览 :

ubuntu20.04部署apacher开源hadoop组件

环境说明

内部ip NAT ip 服务 备注
192.168.100.11 172.16.20.197 zookeeper ZKFailoverController
NameNode DataNode NodeManage ResourceManager journalnode
hive
spark
192.168.100.12 172.16.20.198 zookeeper ZKFailoverController
NameNode DataNode NodeManage ResourceManager journalnode
hive
192.168.100.13 172.16.20.199 zookeeper ZKFailoverController
NameNode DataNode NodeManage ResourceManager journalnode
hive

准备工作

安装openjdk-8 及psmic,psmic 在namenode ha会用到,所以准备环境的时候提前安装好。

1
$ sudo apt update && sudo apt install openjdk-8-jdk psmisc

配置环境变量

1
2
3
4
$ sudo su -c 'cat >/etc/profile.d/java.sh<<EOF
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=\${JAVA_HOME}/bin/:\${PATH}
EOF'

创建相关用户

1
$ sudo useradd -r -m  -s /bin/bash hadoop

登陆hadoop用户,配置密钥及免密,后续配置namenode ha会用到,所有namennode节点要相互免密

1
2
$ ssh-keygen -t rsa -m PEM
$ for x in 1 2 3; do ssh-copy-id hadoop-${x}; done

设置主机名,主机名按实际服务器修改

1
$ hostnamectl set-hostname hadoop-1

本地hosts新增以下内容

1
2
3
192.168.100.11 hadoop-1 zookeeper1
192.168.100.12 hadoop-2 zookeeper2
192.168.100.13 hadoop-3 zookeeper3

部署zookeeper

以下操作要在三台zookeeper服务器操作
将zookeeper包copy到自定义目录

1
$ sudo cp -a apache-zookeeper-3.7.0-bin/ /opt/

创建程序所需目录

1
$ sudo mkdir -p /data/zookeeper/{data,logs} -p

编辑zookeeper配置文件

1
2
3
4
5
6
7
8
9
10
11
12
$ cp  /opt/apache-zookeeper-3.7.0-bin/conf/{zoo_sample.cfg,zoo.cfg}
$ vim /opt/apache-zookeeper-3.7.0-bin/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
dataLogDir=/data/zookeeper/logs
clientPort=2181
maxClientCnxns=2000
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888

创建myid,id在zookeeper集群中唯一,并且应该是1-255之间,主配置文件中的server id需要与myid一致,分别在三个节点执行以下操作:

1
2
3
vagrant@hadoop-1: sudo su -c "sudo echo "1" >/data/zookeeper/data/myid"
vagrant@hadoop-2:~$ sudo su -c "sudo echo "2" >/data/zookeeper/data/myid"
vagrant@hadoop-3:~$ sudo su -c "sudo echo "3" >/data/zookeeper/data/myid"

配置zookeeper环境变量

1
2
3
4
5
$ sudo su -c 'cat >/etc/profile.d/zookeeper.sh <<EOF
export ZOOKEEPER_HOME=/opt/zookeeper
export PATH=\${ZOOKEEPER_HOME}/bin:\${PATH}
EOF'
$ source /etc/profile

为zookeeper做软件链接,并修改目录权限

1
2
$ ln -s /opt/apache-zookeeper-3.7.0-bin/ /opt/zookeeper/
$ sudo chown -R zookeeper:zookeeper /opt/apache-zookeeper-3.7.0-bin/ /opt/zookeeper /data/zookeeper/

启动服务,三台一起启动,启动时会根据配置相互连接,并选举相关服务

1
2
$ sudo su - zookeeper
$ zkServer.sh start

查看状态,从以下状态看到hadoop-2节点为leader其它两台为follower节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
vagrant@hadoop-1:/vagrant$ zkServer.sh status
/usr/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/apache-zookeeper-3.7.0-bin/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: follower

vagrant@hadoop-2:~$ zkServer.sh status
/usr/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/apache-zookeeper-3.7.0-bin/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: leader

vagrant@hadoop-3:~$ zkServer.sh status
/usr/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/apache-zookeeper-3.7.0-bin/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost. Client SSL: false.
Mode: follower

部署hadoop

将hadoop包copy到自定义目录,三台都要操作

1
2
$ sudo cp -a /vagrant/hadoop-3.2.2 /opt/
$ ln -s /opt/hadoop-3.2.2/ /opt/hadoop

创建数据目录,并修改权限

1
2
$ sudo mkdir -p /data/dfs01/hadoop
$ sudo chown -R hadoop.hadoop /opt/hadoop-3.2.2/ /opt/hadoop /data/dfs01/hadoop

配置环境变量

1
2
3
4
5
$ sudo su -c 'cat >/etc/profile.d/hadoop.sh<<EOF
export HADOOP_HOME=/opt/hadoop
export PATH=\${HADOOP_HOME}/bin:\${HADOOP_HOME}/sbin:\${PATH}
EOF'
$ source /etc/profile

修改hadoop-env.sh,修改环境变量JAVA_HOME为绝对路径,并将用户指定为自定义用户,我使用的为hadoop。

1
2
3
4
5
6
7
8
$ cat >> /opt/hadoop/etc/hadoop/hadoop-env.sh <<EOF
export JAVA_HOME=${JAVA_HOME}
export HADOOP_PID_DIR=${HADOOP_HOME}/tmp/pids
export HDFS_NAMENODE_USER=hadoop
export HDFS_DATANODE_USER=hadoop
export HDFS_JOURNALNODE_USER=hadoop
export HDFS_ZKFC_USER=hadoop
EOF

修改yarn-env.sh修改用户为自定义用户,我使用的为hadoop

1
2
3
4
5
$ cat >> /opt/hadoop/etc/hadoop/yarn-env.sh <<EOF
export YARN_REGISTRYDNS_SECURE_USER=hadoop
export YARN_RESOURCEMANAGER_USER=hadoop
export YARN_NODEMANAGER_USER=hadoop
EOF

切换到hadoop用户,后边操作需要用hadoop用户操作

1
$ sudo su - hadoop

修改hadoop的配置

所有配置可在一台机器统一进行配置,然后通过scp同步到其它主机

需要修改的配置有core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml、workers文件

以下property段都需要配置在 配置段内

core-site.xml配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/data/dfs01/hadoop/tmp</value>
</property>

<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>

<property>
<name>ha.zookeeper.quorum</name>
<value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value>
</property>

  • fs.defaultFS 指定HDFS中NameNode的地址,如果是ha需要将hdfs地址指定为hdfs-site.xml配置中的集群名
  • hadoop.tmp.dir 指定hadoop运行时产生文件的存储目录,是其他临时目录的父目录
  • ha.zookeeper.quorum ZooKeeper地址列表,ZKFailoverController将在自动故障转移中使用这些地址。
  • io.file.buffer.size 在序列文件中使用的缓冲区大小,流文件的缓冲区为4K

更多配置点我参考

hdfs-site.xml配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
   <property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>

<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2,nn3</value>
</property>

<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
</property>

<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop-1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop-2:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn3</name>
<value>hadoop-3:8020</value>
</property>

<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop-1:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop-2:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn3</name>
<value>hadoop-3:9870</value>
</property>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/data</value>
</property>

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-1:8485;hadoop-2:8485;hadoop-3:8485/mycluster</value>
</property>

<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/dfs01/hadoop/tmp/dfs/journal</value>
</property>

<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa.pub</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>

  • dfs.nameservices 配置命名空间,所有namenode节点配置在命名空间mycluster下
  • dfs.replication 指定dataNode存储block的副本数量,默认值是3个
  • dfs.blocksize 大型文件系统HDFS块大小为256MB,默认是128MB
  • dfs.namenode.rpc-address 各个namenode的 rpc通讯地址
  • dfs.namenode.http-address 各个namenode的http状态页面地址
  • dfs.namenode.name.dir 存放namenode名称表(fsimage)的目录
  • dfs.datanode.data.dir 存放datanode块的目录
  • dfs.namenode.shared.edits.dir HA集群中多个NameNode之间的共享存储上的目录。此目录将由活动服务器写入,由备用服务器读取,以保持名称空间的同步。
  • dfs.journalnode.edits.dir 存储journal edit files的目录
  • dfs.ha.automatic-failover.enabled 是否启用故障自动处理
  • dfs.ha.fencing.methods 处于故障状态的时候hadoop要防止脑裂问题,所以在standby机器切换到active后,hadoop还会试图通过内部网络的ssh连过去,并把namenode的相关进程给kill掉,一般是sshfence 就是ssh方式
  • dfs.ha.fencing.ssh.private-key-files 配置了 ssh用的 key 的位置。

更多配置点我参考

修改mapred-site.xml配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

<property>
<name>mapreduce.jobhistory.address</name>
<value>0.0.0.0:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>0.0.0.0:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>

更多配置信息点我查看

修改yarn-site.xml
yarn也是通过zk选举来实现RM高可用,详细可见官方文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
 <property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster1</value>
</property>

<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>

<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2,rm3</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop-1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop-2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm3</name>
<value>hadoop-3</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop-1:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop-2:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm3</name>
<value>hadoop-3:8088</value>
</property>
<property>
<name>hadoop.zk.address</name>
<value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value>
</property>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

更多参数配置点我查看

配置workers

此配置是指定datanode的工作节点

1
2
3
4
5
$ cat > /opt/hadoop/etc/hadoop/workers <<EOF
hadoop-1
hadoop-2
hadoop-3
EOF

启动hadoop服务

按顺序执行以下操作,注意要切换到hadoop用户操作

首次启动,先格式化zk数据,目的是在ZooKeeper集群上建立HA的相应节点,在任意节点执行一次即可

1
$ hdfs zkfc -formatZK

查看是否格式化成功,有hadoop-ha目录,并且有我们配置的集群名称即可

1
2
3
$ zkCli.sh
[zk: localhost:2181(CONNECTED) 1] ls /hadoop-ha
[mycluster]

启动journalnode

在所有节点启动journalnode,

1
$ hdfs --daemon start journalnode

在其中一个namenode节点执行格式化,以在hadoop-1节点为例

1
$ hdfs namenode -format

启动hadoop-1节点nameNode

1
$ hdfs --daemon start namenode

将hadoop-1节点上namenode的数据同步到其他nameNode节点,在hadoop-2、hadoop-3节点执行:

1
$ hdfs namenode -bootstrapStandby

启动hadoop02及hadoop03节点nameNode

1
$ hdfs --daemon start namenode

使用hdfs命令查看当前所有NameNode的状态应该都是standby状态

1
2
3
4
5
6
hadoop@hadoop-1:/opt/hadoop/etc/hadoop$ hdfs haadmin -getServiceState nn1
standby
hadoop@hadoop-1:/opt/hadoop/etc/hadoop$ hdfs haadmin -getServiceState nn2
standby
hadoop@hadoop-1:/opt/hadoop/etc/hadoop$ hdfs haadmin -getServiceState nn3
standby

启动其它所有服务

1
$ start-all.sh

通过hdfs使用检查集群中是否有active状态的nn

1
2
3
4
5
6
hadoop@hadoop-1:/opt/hadoop/etc/hadoop$ hdfs haadmin -getServiceState nn1
standby
hadoop@hadoop-1:/opt/hadoop/etc/hadoop$ hdfs haadmin -getServiceState nn2
active
hadoop@hadoop-1:/opt/hadoop/etc/hadoop$ hdfs haadmin -getServiceState nn3
standby

验证hadoop可用性

hdfs创建目录测试

1
2
3
4
$ hadoop fs -mkdir  /test
$ hadoop fs -ls /
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2021-12-28 16:18 /test

YARN功能测试

1
2
3
4
$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar pi 10 100
....省略部分内容
Job Finished in 14.916 seconds
Estimated value of Pi is 3.14800000000000000000

验证mapreduce功能

上传测试文件

1
2
$ echo "hello hadoop" > wordtest
$ hadoop fs -put wordtest /wordtest

执行mapreduce测试

1
$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /wordtest /result

查看结果

1
2
3
$ hadoop fs -cat /result/part-r-00000
hadoop 1
hello 1

验证namenode ha高可用

测试是否能够完成自动故障转移

先找到active状态的namenode节点

1
2
$ hdfs haadmin -getServiceState nn2
active

登陆到active状态的主机,将namenode服务kill

1
$ hdfs --daemon stop namenode

在查看其它节点是否有自动切换为active状态的namenode,如果切换成功则故障转移功能正常

1
2
$ hdfs haadmin -getServiceState nn3
active

部署Hive

先安装mysql,ubuntu20.04源内的版本为mysql8.0版本

1
2
$ sudo apt install mysql-server
$ sudo systemctl enable mysql --now

将mysql配置的bind-address 改为0.0.0.0否则其它服务器无法访问,也可以绑定到指定ip

1
2
3
4
5
6
7
8
9
$ cat /etc/mysql/mysql.conf.d/mysqld.cnf  | egrep -v '^#|^$'
[mysqld]
user = mysql
bind-address = 0.0.0.0
mysqlx-bind-address = 0.0.0.0
key_buffer_size = 16M
myisam-recover-options = BACKUP
log_error = /var/log/mysql/error.log
max_binlog_size = 100M

重启mysql

1
$ sudo systemctl restart mysql

创建数据库账号

1
2
3
$ sudo mysql
mysql> CREATE USER 'admin'@'%' IDENTIFIED BY 'zj2018';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'admin'@'%';

将hive包copy到安装目录

1
2
$ sudo  tar zxf /vagrant/apache-hive-3.1.2-bin  -C /opt/
$ sudo ln -s /vagrant/apache-hive-3.1.2-bin /opt/hive/

配置环境变量

1
2
3
4
5
6
$ sudo su -c 'cat > /etc/profile.d/hive.sh <<EOF
export HIVE_HOME=/opt/hive
export PATH=\${HIVE_HOME}/bin:\${PATH}
EOF'

$ source /etc/profile

创建hive数据目录,并修改目录权限,后续操作切换到hadoop用户操作

1
2
3
$ sudo mkdir /data/dfs01/hive -p
$ sudo chown -R hadoop. /opt/hive /opt/apache-hive-3.1.2-bin/ /data/dfs01/hive/
$ sudo su - hadoop

修改配置

修改hive-env.sh配置

1
2
3
4
5
$ cat > /opt/hive/conf/hive-env.sh <<EOF
export JAVA_HOME=/opt/openjdk
export HADOOP_HOME=/opt/hadoop
export HIVE_CONF_DIR=/opt/hive/conf
EOF

增加hive-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
$ cat >/opt/hive/conf/hive-site.xml<<'EOF'
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql.iwellmass.com:3306/hive?createDatabaseIfNotExist=true</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>zj2018</value>
</property>

<property>
<name>hive.exec.local.scratchdir</name>
<value>/data/dfs01/hive</value>
</property>

<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${hive.session.id}_resources</value>
</property>

<property>
<name>hive.querylog.location</name>
<value>/data/dfs01/hive/querylog</value>
</property>

<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/data/dfs01/hive/operation_logs</value>
</property>

<property>
<name>hive.server2.enable.doAs</name>
<value>FALSE</value>
</property>
</configuration>
EOF

启动hive与验证

需要下载mysql jdbc的jar包,然后将jdbc的包放到hive目录内要注意这个jdbc的包要与你mysql的版本一致

1
2
3
$ wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-8.0.27.tar.gz
$ tar -zxvf mysql-connector-java-8.0.27.tar.gz
$ cp mysql-connector-java-8.0.25/mysql-connector-java-8.0.27.jar /opt/hive/lib

将hive lib目录下的guava包替换成hadoop版本内的guava版本,如果不替换,后边执行初始化数据库的时候会报错

1
2
$ mv /opt/hive/lib/{guava-19.0.jar,guava-19.0.jar.bak}
$ cp /opt/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /opt/hive/lib/

初始化数据库

1
$ schematool -dbType mysql -initSchema

测试hive 交互命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.2.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = c6bc087d-f131-4f16-ac32-64399b9d6899

Logging initialized using configuration in jar:file:/opt/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive Session ID = 2e30f456-7d17-42f4-8f30-35f097fb457f
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
Time taken: 0.381 seconds, Fetched: 1 row(s)

Beeline Client交互配置

Hive 内置了 HiveServer 和 HiveServer2 服务,两者都允许客户端使用多种编程语言进行连接,但是 HiveServer 不能处理多个客户端的并发请求,因此产生了 HiveServer2。HiveServer2(HS2)允许远程客户端可以使用各种编程语言向 Hive 提交请求并检索结果,支持多客户端并发访问和身份验证。HS2 是由多个服务组成的单个进程,其包括基于 Thrift 的 Hive 服务(TCP 或 HTTP)和用于 Web UI 的 Jetty Web 服务。
HiveServer2 拥有自己的 CLI 工具——Beeline。Beeline 是一个基于 SQLLine 的 JDBC 客户端。由于目前 HiveServer2 是 Hive 开发维护的重点,所以官方更加推荐使用 Beeline 而不是 Hive CLI。以下主要讲解 Beeline 的配置方式。

修改 hadoop 集群的 core-site.xml 配置文件,在配置段内增加以下内容,配置完后同步到所有机器

1
2
3
4
5
6
7
8
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>

之所以要配置这一步,是因为 hadoop 2.0 以后引入了安全伪装机制,使得 hadoop 不允许上层系统(如 hive)直接将实际用户传递到 hadoop 层,而应该将实际用户传递给一个超级代理,由该代理在 hadoop 上执行操作,以避免任意客户端随意操作 hadoop。如果不配置这一步,在之后的连接中可能会抛出AuthorizationException异常。

使用systemd管理hive-server2及hive-meta服务

1
2
3
4
5
6
7
8
9
10
11
12
13
$ sudo su -c 'cat > /etc/systemd/system/hive-meta.service <<EOF
[Unit]
Description=Hive metastore
After=network.target

[Service]
User=hadoop
Group=hadoop
ExecStart=/opt/hive/bin/hive --service metastore

[Install]
WantedBy=multi-user.target
EOF'
1
2
3
4
5
6
7
8
9
10
11
12
13
$ sudo su -c 'cat > /etc/systemd/system/hive-server2.service <<EOF
[Unit]
Description=hive-server2
After=network.target

[Service]
User=hadoop
Group=hadoop
ExecStart=/opt/hive/bin/hive --service hiveserver2

[Install]
WantedBy=multi-user.target
EOF'

添加自启动并启动这两个服务

1
$ sudo systemctl enable --now hive-meta.service  hive-server2.service

检查是否有以下端口启动,hive-server2(10000端口)可能启动较慢,如果服务没有启动异常,可以多等一会儿。

1
$ netstat -ntpl | egrep '10000|9083'

使用beeline连接测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ beeline -u jdbc:hive2://hadoop-1:10000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.2.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://hadoop-1:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive

0: jdbc:hive2://hadoop-1:10000> show databases;
INFO : Compiling command(queryId=hadoop_20211229004522_296cb14e-d962-4a48-945c-1ad25a85afa6): show databases
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20211229004522_296cb14e-d962-4a48-945c-1ad25a85afa6); Time taken: 0.501 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20211229004522_296cb14e-d962-4a48-945c-1ad25a85afa6): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20211229004522_296cb14e-d962-4a48-945c-1ad25a85afa6); Time taken: 0.035 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (0.764 seconds)

hive部署完成

本文作者 : WGY
原文链接 : http://geeklive.cn/2020/07/12/ubuntu20-install-hadoop/undefined/ubuntu20-install-hadoop/
版权声明 : 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!
留下足迹