feat: add reports

This commit is contained in:
jackfiled 2024-07-06 22:01:54 +08:00
commit 356464662a
100 changed files with 1516 additions and 0 deletions

View File

@ -0,0 +1,420 @@
# 大数据实时数据分析综合实践
## 实验目的
### Local模式部署
- 实现`Flink`的安装
- 学会`Flink`的脚本启动
- 使用`Flink`自带的单词统计程序进行测试
### Standalone模式进行部署
实现standalone模式下`Flink`进程的启动。
### Yarn模型进行部署
- 完成`Flink on Yarn`模式的配置
- 在`Yarn`中启动`Flink`集群
- 以文件的形式进行任务提交
### Kafka和Flink
- 安装`Kafka`
- 本地编辑代码读取`Kafka`数据并且打成jar包
- 将jar包上传到`Flink`集群运行
## 实验过程
### Local模式部署
本次实验只需要在单台机器上启动`Flink`,因此直接在本地计算机上进行实验。
首先确保本地安装了Java 1.8版本的`JDK`
![image-20240605185225014](大数据实时数据分析综合实践/image-20240605185225014.png)
从`apache`网站上下载`flink`,将其解压到本地之后设置环境变量:
```shell
export FLINK_HOME=$(pwd)
export PATH=$FLINK_HOME/bin:$PATH
```
配置完成之后就启动`flink`服务。
![image-20240605190010357](大数据实时数据分析综合实践/image-20240605190010357.png)
进入`http://localhost:8081`就可以看见`Flink`的管理界面。
![image-20240605190043147](大数据实时数据分析综合实践/image-20240605190043147.png)
在本地模式上运行`Flink`提供的单词计数样例:
![image-20240605190249013](大数据实时数据分析综合实践/image-20240605190249013.png)
在此时的管理界面上也可以看见刚刚提交并完成的任务。
![image-20240605190322651](大数据实时数据分析综合实践/image-20240605190322651.png)
### Standalone模式进行部署
鉴于之前的几次实验都是使用`Docker`容器进行完成,因此这里我们也使用类似的方式进行部署,不过需要注意的是,这里使用的`Dockerfile`文件是实验三版本的,其中并没有包含实验四中涉及的`spark`等的内容,同时为了减少构建出镜像的大小,删除了在`Dockerfile`中配置`hbase`的内容,修改之后的`Dockerfile`之后的内容如下:
```dockerfile
FROM archlinux:latest
# Install necessary dependencies
RUN echo 'Server = https://mirrors.cernet.edu.cn/archlinux/$repo/os/$arch' > /etc/pacman.d/mirrorlist
RUN pacman -Sy --noconfirm openssh jdk8-openjdk which inetutils
# Setting JAVA_HOME env
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk
# Configuring SSH login
RUN echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCyyLt1bsAlCcadB2krSCDr0JP8SrF7EsUM+Qiv3m+V10gIBoCBFEh9iwpVN1UMioK8qdl9lm+LK22RW+IU6RjW+zyPB7ui3LlG0bk5H4g9v7uXH/+/ANfiJI2/2+Q4gOQAsRR+7kOpGemeKnFGJMgxnndSCpgYI4Is9ydAFzcQcGgxVB2mTGT6siufJb77tWKxrVzGn60ktdRxfwqct+2Nt88GTGw7eGJfMQADX1fVt9490M3G3x2Kw9KweXr2m+qr1yCRAlt3WyNHoNOXVhrF41/YgwGe0sGJd+kXBAdM2nh2xa0ZZPUGFkAp4MIWBDbycleRCeLUpCHFB0bt2D82BhF9luCeTXtpLyDym1+PS+OLZ3NDcvztBaH8trsgH+RkUc2Bojo1J4W9NoiEWsHGlaziWgF6L3z1vgesDPboxd0ol6EhKVX+QjxA9XE79IT4GidHxDwqonJz/dHXwjilqqmI4TEHndVWhJN0GV47a63+YCK02VAZ2mOA3aw/7LE= ricardo@magicbook-14' >> /root/.ssh/authorized_keys
COPY id_big_data /root/.ssh/id_rsa
RUN echo 'Host *' >> /etc/ssh/ssh_config && echo ' StrictHostKeyChecking no' >> /etc/ssh/ssh_config
# Install Hadoop
ADD hadoop-3.3.6.tar.gz /opt/
RUN mv /opt/hadoop-3.3.6 /opt/hadoop && \
chmod -R 777 /opt/hadoop
# Configure Hadoop
ENV HADOOP_HOME=/opt/hadoop
RUN echo "slave1" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave2" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave3" >> $HADOOP_HOME/etc/hadoop/workers
RUN mkdir $HADOOP_HOME/tmp
ENV HADOOP_TMP_DIR=$HADOOP_HOME/tmp
RUN mkdir $HADOOP_HOME/namenode
RUN mkdir $HADOOP_HOME/datanode
ENV HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
ENV PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
ENV HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_CLASSPATH
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
COPY hadoop_config/* $HADOOP_HOME/etc/hadoop/
RUN sed -i '1i export JAVA_HOME=/usr/lib/jvm/java-8-openjdk' $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# Install zookeeper
ADD apache-zookeeper-3.9.2-bin.tar.gz /opt/
RUN mv /opt/apache-zookeeper-3.9.2-bin /opt/zookeeper && \
chmod -R 777 /opt/zookeeper
# Configure zookeeper
ENV ZOOKEEPER_HOME=/opt/zookeeper
ENV PATH=$ZOOKEEPER_HOME/bin:$PATH
RUN mkdir $ZOOKEEPER_HOME/tmp
COPY zookeeper_config/* $ZOOKEEPER_HOME/conf/
# Install flink
ADD flink-1.13.6-bin-scala_2.11.tgz /opt/
RUN mv /opt/flink-1.13.6 /opt/flink && \
chmod -R 777 /opt/flink
# Add hadoop library
ADD commons-cli-1.4.jar /opt/flink/lib/
ADD flink-shaded-hadoop-3-uber-3.1.1.7.2.1.0-327-9.0.jar /opt/flink/lib/
# Configure flink
ENV FLINK_HOME=/opt/flink
ENV PATH=$FLINK_HOME/bin:$PATH
COPY flink_conf/* $FLINK_HOME/conf/
COPY run.sh /run.sh
CMD [ "/run.sh" ]
```
其中`flink`的各个配置文件按照操作手册中的内容进行设置。
在`master`节点上使用和`locak`模型一致的命令启动`flink`集群,使用`jps`查看当前启动的进程:
![image-20240605193709092](大数据实时数据分析综合实践/image-20240605193709092.png)
进入从节点中查看在从节点上启动的进程:
![image-20240605193835225](大数据实时数据分析综合实践/image-20240605193835225.png)
进入Web管理界面
![image-20240605194006528](大数据实时数据分析综合实践/image-20240605194006528.png)
能看见启动的4个节点和对应的4个`slot`。
再次进行`master`容器中,启动自带的测试用例单词技术程序。
![image-20240605194334009](大数据实时数据分析综合实践/image-20240605194334009.png)
同时在web管理界面上也可以看见刚刚完成的任务
![image-20240605194418208](大数据实时数据分析综合实践/image-20240605194418208.png)
### Yarn模式进行部署
首先修改`yarn-site.xml`配置文件:
```xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/hadoop/tmp/nm-local-dir</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>4</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*</value>
</property>
</configuration>
```
再去`flink`的配置文件添加如下的内容:
```yaml
high-availability: zookeeper
high-availability.storageDir: hdfs://master/flink_yarn_ha
high-availability.zookeeper.path.root: /flink-yarn
high-availability.zookeeper.quorum: master:2181,slave1:2181,slave2:2181,slave3:2181
yarn.application-attempts: 10
```
然后在修改`flink`中的`masters`配置文件:
rcj-2021211180-node1:8081
rcj-2021211180-node2:8081
这里虽然没有修改`Dockerfile`,但是由于修改了构建容器需要用到的配置文件,这里需要重新构建镜像,再次启动容器集群。
首先在容器中正常启动`hadoop`和`zookeeper`。
在`hdfs`文件系统中创建文件夹,这个文件夹将会被`flink`使用。
![image-20240605203622160](大数据实时数据分析综合实践/image-20240605203622160.png)
使用`yarn`启动:
![image-20240605203818772](大数据实时数据分析综合实践/image-20240605203818772.png)
在yarn中
![image-20240605203934891](大数据实时数据分析综合实践/image-20240605203934891.png)
在运行任务之前的`flink`监控界面:
![image-20240605204109999](大数据实时数据分析综合实践/image-20240605204109999.png)
然后运行`flink`自带的单词计数程序。
![image-20240605205011650](大数据实时数据分析综合实践/image-20240605205011650.png)
单词技术程序的输出如下。
![image-20240605204242830](大数据实时数据分析综合实践/image-20240605204242830.png)
执行示例的单词计数任务之后的监控界面
![image-20240605204333544](大数据实时数据分析综合实践/image-20240605204333544.png)
在实验指导书中要求提供从`hdfs`读取数据再到`hdfs`中输出数据的单词计数程序。这里首先在`hdfs`文件系统中创建程序的输入文件,并输入一定量的文本作为输入,使用下列的命令进行运行。
```
flink run WordCount.jar -input hdfs://master:8020/flink_wordcount/input.txt -output hdfs://master:8020/flink_wordcount/output.txt
```
![image-20240605205713427](大数据实时数据分析综合实践/image-20240605205713427.png)
程序运行完成之后,在`hdfs`文件系统中查询程序的输出结果。
![image-20240605205752687](大数据实时数据分析综合实践/image-20240605205752687.png)
在监控界面也能看见新增了一个已经完成的任务:
![image-20240605205844398](大数据实时数据分析综合实践/image-20240605205844398.png)
### Kafka和Flink
首先安装kafka仍然使用`Docker`的方式进行安装,修改之后的`dockerfile`如下:
```dockerfile
FROM archlinux:latest
# Install necessary dependencies
RUN echo 'Server = https://mirrors.cernet.edu.cn/archlinux/$repo/os/$arch' > /etc/pacman.d/mirrorlist
RUN pacman -Sy --noconfirm openssh jdk8-openjdk which inetutils
# Setting JAVA_HOME env
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk
# Configuring SSH login
RUN echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCyyLt1bsAlCcadB2krSCDr0JP8SrF7EsUM+Qiv3m+V10gIBoCBFEh9iwpVN1UMioK8qdl9lm+LK22RW+IU6RjW+zyPB7ui3LlG0bk5H4g9v7uXH/+/ANfiJI2/2+Q4gOQAsRR+7kOpGemeKnFGJMgxnndSCpgYI4Is9ydAFzcQcGgxVB2mTGT6siufJb77tWKxrVzGn60ktdRxfwqct+2Nt88GTGw7eGJfMQADX1fVt9490M3G3x2Kw9KweXr2m+qr1yCRAlt3WyNHoNOXVhrF41/YgwGe0sGJd+kXBAdM2nh2xa0ZZPUGFkAp4MIWBDbycleRCeLUpCHFB0bt2D82BhF9luCeTXtpLyDym1+PS+OLZ3NDcvztBaH8trsgH+RkUc2Bojo1J4W9NoiEWsHGlaziWgF6L3z1vgesDPboxd0ol6EhKVX+QjxA9XE79IT4GidHxDwqonJz/dHXwjilqqmI4TEHndVWhJN0GV47a63+YCK02VAZ2mOA3aw/7LE= ricardo@magicbook-14' >> /root/.ssh/authorized_keys
COPY id_big_data /root/.ssh/id_rsa
RUN echo 'Host *' >> /etc/ssh/ssh_config && echo ' StrictHostKeyChecking no' >> /etc/ssh/ssh_config
# Install Hadoop
ADD hadoop-3.3.6.tar.gz /opt/
RUN mv /opt/hadoop-3.3.6 /opt/hadoop && \
chmod -R 777 /opt/hadoop
# Configure Hadoop
ENV HADOOP_HOME=/opt/hadoop
RUN echo "slave1" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave2" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave3" >> $HADOOP_HOME/etc/hadoop/workers
RUN mkdir $HADOOP_HOME/tmp
ENV HADOOP_TMP_DIR=$HADOOP_HOME/tmp
RUN mkdir $HADOOP_HOME/namenode
RUN mkdir $HADOOP_HOME/datanode
ENV HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
ENV PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
ENV HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_CLASSPATH
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
COPY hadoop_config/* $HADOOP_HOME/etc/hadoop/
RUN sed -i '1i export JAVA_HOME=/usr/lib/jvm/java-8-openjdk' $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# Install zookeeper
ADD apache-zookeeper-3.9.2-bin.tar.gz /opt/
RUN mv /opt/apache-zookeeper-3.9.2-bin /opt/zookeeper && \
chmod -R 777 /opt/zookeeper
# Configure zookeeper
ENV ZOOKEEPER_HOME=/opt/zookeeper
ENV PATH=$ZOOKEEPER_HOME/bin:$PATH
RUN mkdir $ZOOKEEPER_HOME/tmp
COPY zookeeper_config/* $ZOOKEEPER_HOME/conf/
# Install flink
ADD flink-1.13.6-bin-scala_2.11.tgz /opt/
RUN mv /opt/flink-1.13.6 /opt/flink && \
chmod -R 777 /opt/flink
# Add hadoop library
ADD commons-cli-1.4.jar /opt/flink/lib/
ADD flink-shaded-hadoop-3-uber-3.1.1.7.2.1.0-327-9.0.jar /opt/flink/lib/
# Configure flink
ENV FLINK_HOME=/opt/flink
ENV PATH=$FLINK_HOME/bin:$PATH
COPY flink_conf/* $FLINK_HOME/conf/
# Install kafka
ADD kafka_2.12-1.0.2.tgz /opt/
RUN mv /opt/kafka_2.12-1.0.2 /opt/kafka/ && \
chmod -R 777 /opt/kafka
# Configure kafka
ENV KAFKA_HOME=/opt/kafka
ENV PATH=$KAFKA_HOME/bin:$PATH
COPY run.sh /run.sh
CMD [ "/run.sh" ]
```
重新构建镜像,重启集群。在重启集群之后,首先启动`hdfs`系统,然后在各个节点上启动`zookeeper`。
![image-20240608093430593](大数据实时数据分析综合实践/image-20240608093430593.png)
然后尝试在各个节点上启动`kafka`,在每个节点上均能发现`kafka`的有关进程。
![image-20240608093837789](大数据实时数据分析综合实践/image-20240608093837789.png)
![image-20240608094045724](大数据实时数据分析综合实践/image-20240608094045724.png)
![image-20240608094105014](大数据实时数据分析综合实践/image-20240608094105014.png)
![image-20240608094123108](大数据实时数据分析综合实践/image-20240608094123108.png)
验证各个节点上`kafka`启动成功之后关闭各个节点上的`kafka`。
使用`flink`中的`zookeeper`替代原有的`zookeeper`,在启动之前首先关闭原有`master`节点上的`zookeeper`。
![image-20240608134552985](大数据实时数据分析综合实践/image-20240608134552985.png)
启动master节点上的`kafka`服务器:
![image-20240608134751480](大数据实时数据分析综合实践/image-20240608134751480-1717825672949-1.png)
使用指令创建名称为`test`的`kafka topic`。
![image-20240608135424451](大数据实时数据分析综合实践/image-20240608135424451.png)
在创建完成之后进入`kafka`的终端消息生产者,并指定`topic`为上述创建的`test`。
![image-20240608135509295](大数据实时数据分析综合实践/image-20240608135509295.png)
![image-20240608135516378](大数据实时数据分析综合实践/image-20240608135516378.png)
可以进行输入以创建消息。
使用`IDEA`打包对应的程序,上传到`docker`中进行执行。
![image-20240608142912073](大数据实时数据分析综合实践/image-20240608142912073.png)
回到上面创建的在`kafka`中创建消息的终端,输入一些文本。
![image-20240608143028705](大数据实时数据分析综合实践/image-20240608143028705.png)
在网站中即可看见程序对应的统计输出。
![image-20240608143037390](大数据实时数据分析综合实践/image-20240608143037390.png)
## 实验中遇到的问题
### 无法创建Topic
在`kafka`中使用实验指导书中给出的指令创建`kafka topic`时发生报错:
```shell
./bin/kafka-topics.sh --create --bootstrap-server master:2181 --replication-factor 1 --partitions 1 --topic test
```
![image-20240608152602933](大数据实时数据分析综合实践/image-20240608152602933.png)
经过在网上查询资料和查看课程群中的聊天记录,确认是实验中使用的`kafka`版本的问题,需要将上述命名修改为
```shell
./bin/kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 1 --topic test
```
### 运行打包好的Java程序时发生NoClassDefFoundError
经过对于报错信息和网上资料的整理,在设置`jar`打包时加上缺少的包即可:
![image-20240608153706724](大数据实时数据分析综合实践/image-20240608153706724.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 171 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 167 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 170 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 199 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 283 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 59 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 266 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 450 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 376 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 209 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 252 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 261 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

View File

@ -0,0 +1,331 @@
# 大数据技术基础实验三
## 实验目的
掌握`HBase`和`zookeeper`的安装和使用,使用`MapReduce`批量将`HBase`表中的数据导入到`HDFS`中。在实验中将快速掌握到`HBase`数据库在分布式计算中的应用理解Java API读取`HBase`数据等的相关内容。
## 实验过程
### Docker配置
由于本次实验中,实验指导书推荐使用`docker`的方式进行`hbase`的配置,因此在之前实验的基础上,结合实验指导书对之前配置`docker`的`Dockerbuild`文件和`docker-compose.yml`文件进行了修改,以支持`hbase`和`zookeeper`。
现在将调整之后的文件贴到此处。
```dockerfile
FROM archlinux:latest
# Install necessary dependencies
RUN echo 'Server = https://mirrors.tuna.tsinghua.edu.cn/archlinux/$repo/os/$arch' > /etc/pacman.d/mirrorlist
RUN echo 'Server = https://mirrors.ustc.edu.cn/archlinux/$repo/os/$arch' >> /etc/pacman.d/mirrorlist
RUN echo 'Server = https://mirrors.aliyun.com/archlinux/$repo/os/$arch' >> /etc/pacman.d/mirrorlist
RUN pacman -Sy --noconfirm openssh jdk8-openjdk which inetutils
# Setting JAVA_HOME env
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk
# Configuring SSH login
RUN echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCyyLt1bsAlCcadB2krSCDr0JP8SrF7EsUM+Qiv3m+V10gIBoCBFEh9iwpVN1UMioK8qdl9lm+LK22RW+IU6RjW+zyPB7ui3LlG0bk5H4g9v7uXH/+/ANfiJI2/2+Q4gOQAsRR+7kOpGemeKnFGJMgxnndSCpgYI4Is9ydAFzcQcGgxVB2mTGT6siufJb77tWKxrVzGn60ktdRxfwqct+2Nt88GTGw7eGJfMQADX1fVt9490M3G3x2Kw9KweXr2m+qr1yCRAlt3WyNHoNOXVhrF41/YgwGe0sGJd+kXBAdM2nh2xa0ZZPUGFkAp4MIWBDbycleRCeLUpCHFB0bt2D82BhF9luCeTXtpLyDym1+PS+OLZ3NDcvztBaH8trsgH+RkUc2Bojo1J4W9NoiEWsHGlaziWgF6L3z1vgesDPboxd0ol6EhKVX+QjxA9XE79IT4GidHxDwqonJz/dHXwjilqqmI4TEHndVWhJN0GV47a63+YCK02VAZ2mOA3aw/7LE= ricardo@magicbook-14' >> /root/.ssh/authorized_keys
COPY id_big_data /root/.ssh/id_rsa
RUN echo 'Host *' >> /etc/ssh/ssh_config && echo ' StrictHostKeyChecking no' >> /etc/ssh/ssh_config
# Install Hadoop
ADD hadoop-3.3.6.tar.gz /opt/
RUN mv /opt/hadoop-3.3.6 /opt/hadoop && \
chmod -R 777 /opt/hadoop
# Configure Hadoop
ENV HADOOP_HOME=/opt/hadoop
RUN echo "slave1" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave2" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave3" >> $HADOOP_HOME/etc/hadoop/workers
RUN mkdir $HADOOP_HOME/tmp
ENV HADOOP_TMP_DIR=$HADOOP_HOME/tmp
RUN mkdir $HADOOP_HOME/namenode
RUN mkdir $HADOOP_HOME/datanode
ENV HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
ENV PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
ENV HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_CLASSPATH
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
COPY hadoop_config/* $HADOOP_HOME/etc/hadoop/
RUN sed -i '1i export JAVA_HOME=/usr/lib/jvm/java-8-openjdk' $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# Install zookeeper
ADD apache-zookeeper-3.9.2-bin.tar.gz /opt/
RUN mv /opt/apache-zookeeper-3.9.2-bin /opt/zookeeper && \
chmod -R 777 /opt/zookeeper
# Configure zookeeper
ENV ZOOKEEPER_HOME=/opt/zookeeper
ENV PATH=$ZOOKEEPER_HOME/bin:$PATH
RUN mkdir $ZOOKEEPER_HOME/tmp
COPY zookeeper_config/* $ZOOKEEPER_HOME/conf/
# Install hbase
ADD hbase-2.5.8-bin.tar.gz /opt/
RUN mv /opt/hbase-2.5.8 /opt/hbase && \
chmod -R 777 /opt/hbase
# Configure hbase
ENV HBASE_HOME=/opt/hbase
ENV PATH=$HBASE_HOME/bin:$HBASE_HOME/sbin:$PATH
COPY hbase_config/* $HBASE_HOME/conf/
RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk" >> $HBASE_HOME/conf/hbase-env.sh
RUN echo "export HBASE_MANAGES_ZK=false" >> $HBASE_HOME/conf/hbase-env.sh
RUN echo "export HBASE_LIBRARY_PATH=/opt/hadoop/lib/native" >> $HBASE_HOME/conf/hbase-env.sh
RUN echo 'export HBASE_DISABLE_HADOOP_CLASSPATH_LOOKUP="true"' >> $HBASE_HOME/conf/hbase-env.sh
COPY run.sh /run.sh
ENTRYPOINT [ "/run.sh" ]
```
启动的`docker-compose`文件如下:
```yaml
version: '3.8'
services:
master:
hostname: rcj-2021211180-master
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "1"
networks:
hadoop-network:
ipv4_address: 172.126.1.111
slave1:
hostname: rcj-2021211180-slave1
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "2"
networks:
hadoop-network:
ipv4_address: 172.126.1.112
slave2:
hostname: rcj-2021211180-slave2
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "3"
networks:
hadoop-network:
ipv4_address: 172.126.1.113
slave3:
hostname: rcj-2021211180-slave3
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "4"
networks:
hadoop-network:
ipv4_address: 172.126.1.114
networks:
hadoop-network:
driver: bridge
ipam:
config:
- subnet: 172.126.1.0/24
```
通过上述的修改,基本上做到了一键启动,不必在启动之后做任何调整。
### 启动容器
首先执行
```shell
docker compose up -d
```
启动实验中所需要用到的全部4个容器。
然后进行`master`容器中,启动`hdfs`。这里启动的过程不再赘述,直接使用
```shell
hdfs dfsadmin -report
```
验证启动是否正确。
![image-20240520160735453](实验三实验报告/image-20240520160735453.png)
通过汇报的信息确认各个节点都能正常启动。
下面在各个节点上使用`zkServer.sh start`启动`zookeeper`。然后使用`zkServer.sh status`验证启动,这里我的`master`节点上的`zookeeper`被选举为`follower`。
![image-20240520160903389](实验三实验报告/image-20240520160903389.png)
最后启动`hbase`,然后使用`jps`验证各个容器中的Java进程个数。
首先是`master`节点:
![image-20240520161006241](实验三实验报告/image-20240520161006241.png)
然后是一个从节点:
![image-20240520161027509](实验三实验报告/image-20240520161027509.png)
### HBase实践
首先使用`hbase shell`进入交互式Shell执行如下的指令插入示例数据
```sql
create '2021211180_rcj', 'cf1'
put '2021211180_rcj', '2021211180_rcj_001', 'cf1:keyword', 'Honor 20'
put '2021211180_rcj', '2021211180_rcj_002', 'cf1:keyword', 'Galaxy S21'
put '2021211180_rcj', '2021211180_rcj_003', 'cf1:keyword', 'Xiaomi 14'
```
查看表中此时的内容:
![image-20240520161549869](实验三实验报告/image-20240520161549869.png)
### 编写程序读取HBase
按照实验指导书上的代码编写代码。
首先是`MemberMapper`类。这个类完成的工作类似于我们在实验二中使用`MapReduce`时编写的`Mapper`类,不过操作的数据从`HDFS`中的文件变成了`HBase`数据库中的数据。在下面这段代码中,我们将读取表中的每一行,并将其中的数据拼接为一个字符串输出到文件系统中。
```java
package org.rcj2021211180.inputSource;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import java.io.IOException;
public class MemberMapper extends TableMapper<Writable, Writable> {
public static final String FieldCommonSeparator = "\u0001";
@Override
protected void setup(Context context) throws IOException, InterruptedException {
}
@Override
protected void map(ImmutableBytesWritable row, Result columns, Context context) throws IOException, InterruptedException {
String key = new String(row.get());
Text keyValue = new Text(key);
try {
for (Cell cell : columns.listCells()) {
String value = Bytes.toStringBinary(cell.getValueArray());
String columnFamily = Bytes.toString(cell.getFamilyArray());
String columnQualifier = Bytes.toString(cell.getQualifierArray());
long timestamp = cell.getTimestamp();
Text valueValue = new Text(columnFamily + FieldCommonSeparator +
columnQualifier + FieldCommonSeparator +
value + FieldCommonSeparator + timestamp);
context.write(keyValue, valueValue);
}
} catch (Exception e) {
e.printStackTrace();
System.err.println("Error: " + e.getMessage());
}
}
}
```
然后是程序中的主类`Main`。在主类中我们设置并启动了`MapReduce`的`Job`,这个工作将从我们指定的表中读取数据,并按照上一个`Mapper`类中的逻辑进行处理之后输出到文件系统中。
```java
package org.rcj2021211180.inputSource;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
public class Main {
private static final String tableName = "2021211180_rcj";
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration config = HBaseConfiguration.create();
// 设置扫描器对象
Scan scan = new Scan();
scan.setBatch(0);
scan.setCaching(1000);
scan.setMaxVersions();
// 设置扫描的时间范围
scan.setTimeRange(System.currentTimeMillis() - 3 * 24 * 2600 * 1000L, System.currentTimeMillis());
// 设置需要扫描的列族
scan.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("keyword"));
config.setBoolean("mapred.map.tasks.speculative.execution", false);
config.setBoolean("mapred.reduce.tasks.speculative.execution", false);
Path tmpIndexPath = new Path("hdfs://master:8020/tmp/" + tableName);
FileSystem fs = FileSystem.get(config);
if (fs.exists(tmpIndexPath)) {
fs.delete(tmpIndexPath, true);
}
Job job = new Job(config, "MemberTest1");
job.setJarByClass(Main.class);
TableMapReduceUtil.initTableMapperJob(tableName, scan, MemberMapper.class, Text.class, Text.class, job);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, tmpIndexPath);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
```
### 在Docker中运行程序
在IDEA中将代码编译到`jar`包中,使用`docker cp`将编译好的`jar`包复制到容器内:
![image-20240520204246024](实验三实验报告/image-20240520204246024.png)
在`master`节点上运行任务:
![image-20240520204314842](实验三实验报告/image-20240520204314842.png)
等待任务完成之后,使用`hdfs`工具查看运行结果:
![image-20240520204401651](实验三实验报告/image-20240520204401651.png)
## 项目源代码
### 源代码目录的结构
![image-20240522002134265](实验三实验报告/image-20240522002134265.png)
在源代码目录中多余的是实验一和实验二的源代码。
### 包中的内容
![image-20240522002249443](实验三实验报告/image-20240522002249443.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 204 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 104 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

View File

@ -0,0 +1,141 @@
# 大数据实验二实验报告
<center><div style='height:2mm;'></div><div style="font-family:华文楷体;font-size:14pt;">任昌骏 2021211180</div></center>
## 实验描述
在实验中使用`IDEA`构建大数据工程通过Java编写程序并通过集群运行完成单词计数的任务。首先在本地上完成`WordCount`工程的创建和程序的编写;然后将程序打包之后上传到之前搭建的`hadoop`集群中运行。
## 实验目的
1. 了解如何使用`IDEA`构建大数据工程
2. 熟悉使用Java语言编写大数据程序
3. 了解`MapReduce`的工程原理
4. 掌握如何在集群上运行`hadoop`程序
## 实验环境
`Hadoop`环境为使用`Docker`搭建的`Hadoop 2.7.7`环境,基础系统镜像为`archlinux`。使用`hdfs dfsadmin -report`确保集群运行正确。
![image-20240515162247831](实验二实验报告/image-20240515162247831.png)
使用的`JDK`版本为`1.8.0_412``IDEA`版本为2024.1.1。
## 实验步骤
### 创建工程并编写代码
在`IDEA`中创建`Maven`工程,添加相关的依赖并创建`log4j.xml`配置文件。创建本次实验中的主类`WordCount`,并编写对应的逻辑代码。
```java
package top.rrricardo;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import java.io.IOException;
import java.util.StringTokenizer;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private final Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration configuration = new Configuration();
String[] otherArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: WordCount <in> <out>");
System.exit(-1);
}
Job job = new Job(configuration, "Word Counter");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : -1);
}
}
```
上述的代码实际上由三个类构成,即`WordCount`主类和它的两个内部类`TokenizeMapper`和`IntSumReducer`。单词技术的主要逻辑就位于两个内部类中,这两个内部类分别集成了`Mapper`和`Reducer`,这对应着`MapReduce`框架中的两个核心工程,将任务分成若干个小任务,再将这一系列小任务的执行结果合并为最终结果。在这里,`Map`即为将字符串中的各个单词识别出来并将该单词的技术,而`Reducer`则负责收集这些结果,最终得到整个文本中的单词技术。主类这里负责解析用户的输出并启动对应的工作。
### 打包并上传到集群中
首先在`IDEA`中按照实验指导书中的步骤添加新的工件`Artifacts`
![image-20240515163730035](实验二实验报告/image-20240515163730035.png)
然后运行打包指令打包成实际的`JAR`包。
![image-20240515163818616](实验二实验报告/image-20240515163818616.png)
使用`docker cp`将打包好的`JAR`包上传到集群中。
![image-20240515163903962](实验二实验报告/image-20240515163903962.png)
同时在上传一个文本文档作为下一步执行单词计数任务的输入,这里直接使用我们工程的源代码文件`WordCount.java`。
![image-20240515164013153](实验二实验报告/image-20240515164013153.png)
### 执行单词计数的任务
进入集群的主节点中,因为`MapReduce`的输入和输出文件都需要在`hdfs`文件系统中,因此首先使用指令将输入的文本文件上传到`hdfs`文件系统中。
首先创建一个`lab2`的文件夹作为区分:
![image-20240515164259679](实验二实验报告/image-20240515164259679.png)
然后将我们输入的文本文件上传到该文件夹下:
![image-20240515164356891](实验二实验报告/image-20240515164356891.png)
运行单词技术的程序,其中任务的输入文件为`/lab2/WordCount.java`,任务的输出文件夹为`/lab2/output/`。
![image-20240515164558443](实验二实验报告/image-20240515164558443.png)
使用命令检查任务运行的结果:
![image-20240515164651664](实验二实验报告/image-20240515164651664.png)
说明任务运行的非常成功。

Binary file not shown.

After

Width:  |  Height:  |  Size: 351 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 221 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

View File

@ -0,0 +1,418 @@
# 实验四实验报告
## 实验目的
- 了解服务器配置的过程
- 熟悉使用`Scala`编写`Spark`程序的过程
- 了解`Spark RDD`的工作原理
- 掌握在`Spark`集群上运行程序的方法
- 掌握使用`Spark SQL`读取数据库的方法
## 实验步骤
### 安装Spark
仍然直接使用`docker`的方式进行安装,直接将安装的步骤写在`Dockerfile`中,因此这里首先给出修改之后的`Dockerfile`
```dockerfile
FROM archlinux:latest
# Install necessary dependencies
RUN echo 'Server = https://mirrors.tuna.tsinghua.edu.cn/archlinux/$repo/os/$arch' > /etc/pacman.d/mirrorlist
RUN echo 'Server = https://mirrors.ustc.edu.cn/archlinux/$repo/os/$arch' >> /etc/pacman.d/mirrorlist
RUN echo 'Server = https://mirrors.aliyun.com/archlinux/$repo/os/$arch' >> /etc/pacman.d/mirrorlist
RUN pacman -Sy --noconfirm openssh jdk8-openjdk which inetutils
# Setting JAVA_HOME env
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk
# Configuring SSH login
RUN echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCyyLt1bsAlCcadB2krSCDr0JP8SrF7EsUM+Qiv3m+V10gIBoCBFEh9iwpVN1UMioK8qdl9lm+LK22RW+IU6RjW+zyPB7ui3LlG0bk5H4g9v7uXH/+/ANfiJI2/2+Q4gOQAsRR+7kOpGemeKnFGJMgxnndSCpgYI4Is9ydAFzcQcGgxVB2mTGT6siufJb77tWKxrVzGn60ktdRxfwqct+2Nt88GTGw7eGJfMQADX1fVt9490M3G3x2Kw9KweXr2m+qr1yCRAlt3WyNHoNOXVhrF41/YgwGe0sGJd+kXBAdM2nh2xa0ZZPUGFkAp4MIWBDbycleRCeLUpCHFB0bt2D82BhF9luCeTXtpLyDym1+PS+OLZ3NDcvztBaH8trsgH+RkUc2Bojo1J4W9NoiEWsHGlaziWgF6L3z1vgesDPboxd0ol6EhKVX+QjxA9XE79IT4GidHxDwqonJz/dHXwjilqqmI4TEHndVWhJN0GV47a63+YCK02VAZ2mOA3aw/7LE= ricardo@magicbook-14' >> /root/.ssh/authorized_keys
COPY id_big_data /root/.ssh/id_rsa
RUN echo 'Host *' >> /etc/ssh/ssh_config && echo ' StrictHostKeyChecking no' >> /etc/ssh/ssh_config
# Install Hadoop
ADD hadoop-3.3.6.tar.gz /opt/
RUN mv /opt/hadoop-3.3.6 /opt/hadoop && \
chmod -R 777 /opt/hadoop
# Configure Hadoop
ENV HADOOP_HOME=/opt/hadoop
RUN echo "slave1" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave2" >> $HADOOP_HOME/etc/hadoop/workers
RUN echo "slave3" >> $HADOOP_HOME/etc/hadoop/workers
RUN mkdir $HADOOP_HOME/tmp
ENV HADOOP_TMP_DIR=$HADOOP_HOME/tmp
RUN mkdir $HADOOP_HOME/namenode
RUN mkdir $HADOOP_HOME/datanode
ENV HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
ENV PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
ENV HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_CLASSPATH
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
COPY hadoop_config/* $HADOOP_HOME/etc/hadoop/
RUN sed -i '1i export JAVA_HOME=/usr/lib/jvm/java-8-openjdk' $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# Install zookeeper
ADD apache-zookeeper-3.9.2-bin.tar.gz /opt/
RUN mv /opt/apache-zookeeper-3.9.2-bin /opt/zookeeper && \
chmod -R 777 /opt/zookeeper
# Configure zookeeper
ENV ZOOKEEPER_HOME=/opt/zookeeper
ENV PATH=$ZOOKEEPER_HOME/bin:$PATH
RUN mkdir $ZOOKEEPER_HOME/tmp
COPY zookeeper_config/* $ZOOKEEPER_HOME/conf/
# Install hbase
ADD hbase-2.5.8-bin.tar.gz /opt/
RUN mv /opt/hbase-2.5.8 /opt/hbase && \
chmod -R 777 /opt/hbase
# Configure hbase
ENV HBASE_HOME=/opt/hbase
ENV PATH=$HBASE_HOME/bin:$HBASE_HOME/sbin:$PATH
COPY hbase_config/* $HBASE_HOME/conf/
RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk" >> $HBASE_HOME/conf/hbase-env.sh
RUN echo "export HBASE_MANAGES_ZK=false" >> $HBASE_HOME/conf/hbase-env.sh
RUN echo "export HBASE_LIBRARY_PATH=/opt/hadoop/lib/native" >> $HBASE_HOME/conf/hbase-env.sh
RUN echo 'export HBASE_DISABLE_HADOOP_CLASSPATH_LOOKUP="true"' >> $HBASE_HOME/conf/hbase-env.sh
# Install spark
ADD spark-3.5.1-bin-hadoop3-scala2.13.tgz /opt/
RUN mv /opt/spark-3.5.1-bin-hadoop3-scala2.13 /opt/spark && \
chmod -R 777 /opt/spark
# Configure spark
ENV SPARK_HOME=/opt/spark
ENV PATH=$SPARK_HOME/bin:$PATH
ENV HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
ENV YARN_CONF_DIR=/opt/hadoop/etc/hadoop
RUN mv /opt/spark/conf/spark-env.sh.template /opt/spark/conf/spark-env.sh && \
echo 'export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)' >> /opt/spark/conf/spark-env.sh && \
touch /opt/spark/conf/workers && \
echo "master" >> /opt/spark/conf/workers && \
echo "slave1" >> /opt/spark/conf/workers && \
echo "slave2" >> /opt/spark/conf/workers && \
echo "slave3" >> /opt/spark/conf/workers
COPY run.sh /run.sh
```
正常启动容器,按照实验一中给定的步骤首先启动`hadoop`集群,首先验证`hadoop`集群启动是否成功。
```shell
yarn node -list
```
![image-20240526135317455](实验四实验报告/image-20240526135317455.png)
```shell
hdfs dfs -ls /
```
![image-20240526135337986](实验四实验报告/image-20240526135337986.png)
然后启动`spark`集群,确认集群启动成功。
![image-20240526135424472](实验四实验报告/image-20240526135424472.png)
然后`spark-shell`验证`spark`是否正确可用。
![image-20240526135656161](实验四实验报告/image-20240526135656161.png)
能够正常在交互式Shell下运行示例程序说明`spark`的安装和启动正确。
### 编写程序完成单词计数任务
按照实验指导书中的说明创建使用`Spark`的`Scala`项目,在项目中编写进行单词计数的程序。在按照实验指导书上的指导,将编写好的程序编译打包为`jar`。
编写的程序如下:
```scala
package top.rcj2021211180
import org.apache.spark.{SparkConf, SparkContext}
class ScalaWordCount {
}
object ScalaWordCount {
def main(args : Array[String]): Unit = {
val list = List("hello hi hi spark",
"hello spark hello hi sparksal",
"hello hi hi sparkstreaming",
"hello hi sparkgraphx")
val sparkConf = new SparkConf().setAppName("word-count").setMaster("yarn")
val sc = new SparkContext(sparkConf)
val lines = sc.parallelize(list)
val words = lines.flatMap((line: String) => {
line.split(" ")
})
val wordOne = words.map((word: String) => {
(word, 1)
})
val wordAndNum = wordOne.reduceByKey((count1: Int, count2: Int) => {
count1 + count2
})
val ret = wordAndNum.sortBy(kv => kv._2, false)
print(ret.collect().mkString(","))
ret.saveAsTextFile(path = "hdfs://master:8020/spark-test")
sc.stop()
}
}
```
使用下述命令进行运行:
```shell
spark-submit --class top.rcj2021211180.ScalaWordCount --master yarn --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 BigData.jar
```
查看运行的结果:
![image-20240526152418929](实验四实验报告/image-20240526152418929.png)
### 使用RDD编写独立应用程序实现数据去重
按照实验指导书中的内容编写下面的内容:
```scala
package top.rcj2021211180
import org.apache.spark.{SparkConf, SparkContext}
class ScalaDuplicateRemove {
}
object ScalaDuplicateRemove {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("Scala Duplicate Remove").setMaster("local")
val sc = new SparkContext(sparkConf)
val basePath = "/duplicateFiles/"
val linesA = sc.textFile(basePath + "A.txt")
val linesB = sc.textFile(basePath + "B.txt")
val lines = linesA.union(linesB).distinct().sortBy(identity)
lines.saveAsTextFile(basePath + "C.txt")
sc.stop()
}
}
```
仍然按照上一次打包运行的方式进行打包和上传到集群中进行运行。
使用实验指导书中给出的样例的进行测试,首先将给定的两个文件`A.txt`和`B.txt`上传到`HDFS`文件系统中。
![image-20240526160922326](实验四实验报告/image-20240526160922326.png)
运行Spark程序
```shell
spark-submit --class top.rcj2021211180.ScalaDuplicateRemove --master yarn --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 BigData.jar
```
![image-20240526161121927](实验四实验报告/image-20240526161121927.png)
查看运行的结果:
![image-20240526161308849](实验四实验报告/image-20240526161308849.png)
### 使用Spark SQL读写数据库
为了让`spark`可以访问`Mysql`数据库,需要在`spark`中添加`Mysql`的`JDBC Connector`,因此直接在`Dockerfile`中添加相关的`jar`包。
```dockerfile
# Add Mysql JDBC Connector
COPY mysql-connector-j-8.4.0.jar /opt/spark/jars/
```
这里使用容器的方式启动`mysql`,而不是直接在`master`容器中安装的方式。设计如下的`docker-compose.yml`文件:
```yaml
version: '3.8'
services:
master:
hostname: rcj-2021211180-master
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "/run.sh"
- "1"
networks:
hadoop-network:
ipv4_address: 172.126.1.111
slave1:
hostname: rcj-2021211180-slave1
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "/run.sh"
- "2"
networks:
hadoop-network:
ipv4_address: 172.126.1.112
slave2:
hostname: rcj-2021211180-slave2
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "/run.sh"
- "3"
networks:
hadoop-network:
ipv4_address: 172.126.1.113
slave3:
hostname: rcj-2021211180-slave3
image: registry.cn-beijing.aliyuncs.com/jackfiled/hadoop-cluster
command:
- "/run.sh"
- "4"
networks:
hadoop-network:
ipv4_address: 172.126.1.114
db:
image: mysql:8.0-debian
environment:
MYSQL_ROOT_PASSWORD: 12345678
networks:
hadoop-network:
networks:
hadoop-network:
driver: bridge
ipam:
config:
- subnet: 172.126.1.0/24
```
重新启动集群。
在重新启动集群并添加`mysql`容器之后,首先进入`mysql`容器中在修改数据库的相关设置和创建供`spark`读写的数据库,并建立示例表,在表中插入两条示例数据。
![image-20240526203703065](实验四实验报告/image-20240526203703065.png)
进入`Master`节点,重新启动`hadoop`集群并重新启动`spark`集群。
在进入`spark-shell`之后使用实验指导书上的命令验证`spark`是否能够访问数据库:
```scala
val jdbcDP = spark.read.format("jdbc").
| option("url", "jdbc:mysql://db:3306/spark").
| option("driver", "com.mysql.cj.jdbc.Driver").
| option("dbtable", "student").
| option("user", "root").
| option("password", "12345678").
| load()
```
![image-20240526204428500](实验四实验报告/image-20240526204428500.png)
需要注明的是,这里在使用容器启动数据库之后,需要将`JDBC`链接字符串的地址从`localhost`变更为对应容器的域名`db`。
在使用`spark-shell`验证可以读数据库之后,编写`scala`代码在数据库中写入更多的数据。
```scala
package top.rcj2021211180
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import java.util.Properties
class InsertStudent {
}
object InsertStudent {
def main(args : Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Insert Student")
.master("local")
.getOrCreate()
val sc = spark.sparkContext
val studentData = Array("3 Zhang M 26", "4 Liu M 27")
val studentRDD = sc.parallelize(studentData).map(_.split("\\s+"))
val scheme = StructType(List(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("gender", StringType, true),
StructField("age", IntegerType, true)
))
val rowRDD = studentRDD.map(attr => Row(attr(0).toInt, attr(1), attr(2), attr(3).toInt))
val studentDF = spark.createDataFrame(rowRDD, scheme)
val jdbcUrl = "jdbc:mysql://db:3306/spark"
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "12345678")
connectionProperties.put("driver", "com.mysql.cj.jdbc.Driver")
studentDF.write
.mode("append")
.jdbc(jdbcUrl, "spark.student", connectionProperties)
spark.stop()
}
}
```
编写如上的程序,编译打包并上传到集群中运行。
```shell
spark-submit --class top.rcj2021211180.InsertStudent --master yarn --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 BigData.jar
```
![image-20240526210634061](实验四实验报告/image-20240526210634061.png)
在运行之后,进入数据库容器中查看表中的内容:
![image-20240526210805499](实验四实验报告/image-20240526210805499.png)
## Bug列表
### 无法在Master节点上启动Spark的Worker
在使用`bash start-workers.sh`脚本启动Spark的Workers时我发现运行报错
![image-20240526134241065](实验四实验报告/image-20240526134241065.png)
同时在`master`节点上运行`jps`发现,当前节点上并没有启动`worker`
![image-20240526134314946](实验四实验报告/image-20240526134314946.png)
显然在脚本尝试在本地节点上启动Worker时报错失败了但是此时的`Spark`集群中剩下的节点仍然正确启动了,使用`spark-shell`可以正常的计算。
经排查发现是`spark/conf/workers`中的内容错误:
![image-20240526134655010](实验四实验报告/image-20240526134655010.png)
修改`Dockerfile`中设置相关内容的命令修复问题。
重新创建容器之后再次启动`spark`集群,问题解决。
![image-20240526135005780](实验四实验报告/image-20240526135005780.png)
### Failed to load class报错
在将`jar`打包好上传到`spark`中进行运行时报错提示无法找到主类。
![image-20240526144735441](实验四实验报告/image-20240526144735441.png)
解压打包好的`jar`包,发现其中确实没有将`top.rcj2021211180.ScalaWordCount`这个`class`,怀疑在编译过程中出现配置错误。
尝试重新创建项目,并且在打包之前首先运行一次编译再打包,并在打包好之后按照实验指导书上的说明删除`MANIFEST.MF`文件,再次上传到集群中进行运行,此时程序没有报错。

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 162 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 153 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

View File

@ -0,0 +1,206 @@
# 金融行业羊毛党识别案例实践报告
## 实验目的
利用移动的梧桐大数据平台,基于机器学习算法进行数据挖掘从而预测“羊毛党”,从而了解分类任务并了解其原理。
## 实验过程
### 实验准备
登录梧桐鸿鹄大数据实操平台,在平台上创建实验中需要用到的工程项目。在创建时选择项目模板时选择通用模板,方便在实验中进行数据编排和数据挖掘。
### 数据流程编排
首先是创建数据处理的数据流。
![image-20240602193025602](金融行业羊毛党识别案例实践报告/image-20240602193025602.png)
创建好数据流之后,在数据流画布中进行算子的编排。编排算子主要需要完成训练数据的提取和预测数据的提取两项工作。数据编排和大致上分成三个阶段:
1. 从`HDFS`文件系统上提取出需要使用的数据;
2. 将提取的数据进行处理,例如分组、过滤,并且划分为训练数据和预测数据两个部分。
3. 将训练数据和预测数据输出到`HDFS`文件系统中存储。
#### 第一阶段 数据提取
![image-20240602222521946](金融行业羊毛党识别案例实践报告/image-20240602222521946.png)
在这一阶段将从两个文件中分别读取用户的信息和用户的行为,并按照用户的电话号码聚合为一张表。聚合之后的表中输出列如下图所示:
![image-20240602222816004](金融行业羊毛党识别案例实践报告/image-20240602222816004.png)
#### 第二阶段 数据分组和过滤
![image-20240602222927716](金融行业羊毛党识别案例实践报告/image-20240602222927716.png)
这一节点就会将数据分成训练数据和预测数据两个部分其中训练数据来自于原始数据中2020年1月和2020年2月的部分而预测数据则来自于原始数据中2020年3月和2020年4月的部分。在这一阶段还将对数据按照电话号码进行聚合之后求`chrg_amt`字段的平均值和`chgr_cnt`字段的和并和该电话号码的其他字段合在一起输出。对于训练数据,还需要从文件系统中读取用户的标签。
在完成以上部分步骤之后,训练数据流中最后一个连接算子的输出列如下图所示:
![image-20240602200516373](金融行业羊毛党识别案例实践报告/image-20240602200516373.png)
预测数据流的最后一个连接算子的输出列如下图所示:
![image-20240602223630319](金融行业羊毛党识别案例实践报告/image-20240602223630319.png)
#### 数据保存
这一阶段将上述阶段中提取出的数据以`csv`文件的形式保存在`HDFS`文件系统中。
其中训练数据的保存路径为:
```
/srv/multi-tenant/midteant02/dev/user/p6_bupt/p6_bupt_102/etl_output/renchangjun_2021211180_train_data.csv
```
![image-20240602223827418](金融行业羊毛党识别案例实践报告/image-20240602223827418.png)
预测数据的保存路径为:
```
/srv/multi-tenant/midteant02/dev/user/p6_bupt/p6_bupt_102/etl_output/renchangjun_2021211180_predict_data.csv
```
![image-20240602201404145](金融行业羊毛党识别案例实践报告/image-20240602201404145.png)
预测数据最后的输出列如下图所示:
![image-20240602201449404](金融行业羊毛党识别案例实践报告/image-20240602201449404.png)
### 数据流的在线调测
在完成数据流的编排之后,进入数据流的在线调测界面。
![image-20240602201655899](金融行业羊毛党识别案例实践报告/image-20240602201655899.png)
运行调测之后显示调测成功。
![image-20240602201858001](金融行业羊毛党识别案例实践报告/image-20240602201858001.png)
过程中的日志截图如下:
首先是启动各个组件:
![image-20240602202316839](金融行业羊毛党识别案例实践报告/image-20240602202316839.png)
然后是提交`Spark`任务成功:
![image-20240602202423160](金融行业羊毛党识别案例实践报告/image-20240602202423160.png)
最后是数据流任务全部完成:
![image-20240602202450489](金融行业羊毛党识别案例实践报告/image-20240602202450489.png)
### 数据挖掘
首先建立数据需要使用的交互式建模模型,并在建立模型时选择`Pytorch-Tensorflow`引擎。
![image-20240602224304598](金融行业羊毛党识别案例实践报告/image-20240602224304598.png)
然后进行`jupyter`笔记本的编辑页面,按照实验指导书的说明编写代码训练一个从数据中可以识别出“羊毛党”的二分类模型。
```python
import pandas as pd #用于处理高级数据结构和数据分析
from manas.dataset import mfile #用于从 HDFS 上读取数据文件
from sklearn.metrics import roc_auc_score,classification_report #用于评估模型预测结果
from sklearn.model_selection import train_test_split #用于划分训练数据集和测试数据集
from xgboost import XGBClassifier #用于构建基于 XGBoost 算法的分类模型
train_path = "hdfs://cxq-ns1/srv/multi-tenant/midteant02/dev/user/p6_bupt/p6_bupt_102/etl_output/renchangjun_2021211180_train_data.csv" #训练数据集文件地址
file_r = mfile.mfile(train_path, "r") #获取文件操作
train_df = pd.read_csv(file_r, sep='\,', header=0) #使用 pandas 将数据读取成 DataFrame完成后直接使用变量操作文件
file_r.close() #读取完成后关闭数据流
predict_path = "hdfs://cxq-ns1/srv/multi-tenant/midteant02/dev/user/p6_bupt/p6_bupt_102/etl_output/renchangjun_2021211180_predict_data.csv" #预测数据集文件地址
file_r = mfile.mfile(predict_path, "r") #获取文件操作
predict_df = pd.read_csv(file_r, sep='\,', header=0) #使用 pandas 将数据读取成 DataFrame完成后直接使用变量操作文件
file_r.close() #读取完成后关闭数据流
train_df.fillna(0,inplace = True) #缺失值填充为 0
predict_df.fillna(0,inplace = True) #缺失值填充为 0
target_data = train_df[['phone','label']] #导出标签数据
del train_df['label'] #删除训练数据标签
train_data = train_df
train_data = train_data.set_index('phone') #将 phone 作为 DataFrame 索引
target_data = target_data.set_index('phone') #将 phone 作为 DataFrame 索引
predict_df = predict_df.set_index('phone') #将 phone 作为 DataFrame 索引
model = XGBClassifier(min_chile_weight = 1,max_depth = 10,learning_rate = 0.05,gamma = 0.4,colsample_bytree = 0.4) #构建 XGBoost 二分类模型
x_train,x_test,y_train,y_test = train_test_split(train_data,target_data,test_size = 0.3,random_state = 42) #随机划分训练数据集和测试数据集
model.fit(x_train,y_train) #训练模型
predict_l = model.predict_proba(x_test)[:,1] #输出预测结果
auc = roc_auc_score(y_test,predict_l) #计算 AUC
print('AUC is {}'.format(auc)) #输出 AUC
pd.DataFrame({'feature':x_train.columns,'importance':model.feature_importances_}).sort_values(ascending = False,by = 'importance') #降序输出模型特征重要度
#模型性能指标评估报告函数
def get_threshold_report(y_predict, target_name):
model_count = y_predict[target_name].value_counts().sort_index()[1] #获得测试数据集中"羊毛党"的用户数量
y_test = y_predict[[target_name]] #单独取出标签列
model_test = y_predict.copy()
report = pd.DataFrame() #初始化 report 为 DataFrame 格式
for i in range(1, 20): #等间隔选取 0.050.95 的 20 个不同概率阈值
sep_pr = [] #记录本次划分结果
sep_value = i * 0.05 #本次阈值
col_name = 'sep_.' + str(round(sep_value, 2)) #本次列名,用于记录预测标签
model_test[col_name] = model_test['predict'].apply(lambda x: 1 if x > sep_value else 0) #根据预测概率及阈值生成预测标签
sep_pr.append(str(round(sep_value, 2))) #记录本次阈值
sep_pr.append(model_count) #记录真实"羊毛党"的用户数量
predict_model = model_test[col_name].value_counts().sort_index()
#获取预测标签分类统计数量
if predict_model.shape[0] == 1: #只有一类标签的情况
if predict_model.index.format() == '0': #一类标签为 0即不存在被预测为"羊毛党"的用户
sep_pr.append(0) #记录被预测为"羊毛党"的用户数量为 0
else:
sep_pr.append(predict_model[0]) #一类标签为 1即所有用户均被预测为"羊毛党",记录被预测为"羊毛党"的用户数量,即 predict_model 的第一行数据
else: #两类标签的情况
sep_pr.append(predict_model[1]) #记录被预测为"羊毛党"的用户数量即predict_model 的第二行数据
model_report = classification_report(y_test, model_test[col_name].values, digits=4) #根据真实标签和预测标签生成分类报告
pr_str = ' '.join(model_report.split('\n')[3].split()).split(' ')
#提取分类报告最后一行的三种评估指标
sep_pr.append(pr_str[1]) #记录精确率
sep_pr.append(pr_str[2]) #记录召回率
sep_pr.append(pr_str[3]) #记录 F1 分数
report = pd.concat([report, pd.DataFrame([sep_pr])], axis=0) #拼接本次划分的评估报告
report.columns = ['threshold', 'actual', 'predict', 'precision', 'recall', 'f1-score'] #设置评估报告列名
report = report.reset_index(drop=True) #重置索引,并将原索引删除
return report #返回模型性能指标评估报告
y_test['predict'] = predict_l #添加预测结果列
get_threshold_report(y_test, 'label') #生成性能指标评估报告
#输出模型推理结果
output = predict_df.copy() #获取待预测数据
output['predict_proba'] = model.predict_proba(output)[:, 1] #使用模型进行预测,得到预测概率
output['predict_label'] = output['predict_proba'].apply(lambda x: 1 if x > 0.25 else 0) #根据选定的概率阈值划分"羊毛党"
df_output = output[['predict_proba', 'predict_label']].sort_values(ascending=False, by='predict_proba') #按照预测概率降序排列疑似"羊毛党"
df_output.to_csv('output.csv') #保存预测结果
df_output #输出预测结果
```
在笔记本中运行上述代码的截图如下:
![image-20240602224525815](金融行业羊毛党识别案例实践报告/image-20240602224525815.png)
代码中训练得到的性能指标评估报告如下图所示:
![image-20240602224609399](金融行业羊毛党识别案例实践报告/image-20240602224609399.png)
从报告中可以发现在阈值取到0.2到0.3之前的时候,`f1`分数达到最高大约在79.51%上下因此在代码中进行预测时我选择的阈值为0.2。预测的结果如下图所示:
![image-20240602224724341](金融行业羊毛党识别案例实践报告/image-20240602224724341.png)
## 实验总结
在本次实验中,我通过中国移动开发的梧桐大数据实训平台实际体验了大数据技术在企业实际生产过程所发挥的重要作用,并为大数据技术在解决企业中实际问题所发挥的巨大作用而震撼。在实验中,我还体验到可视化进行数据的编排是一种对于初学者来说非常高效的使用手段,相较于与之间使用`Java`编写代码来访问`HDFS`来说大大提高了效率和程序的可读性。
### 实验中的bug
1. 在将分组和过滤好的数据保存在`HDFS`文件系统中时,没有选择自动导出表头,导致在后续数据挖掘的过程中无法获得到各个字段的名称。这个问题在微信群中快速得到指导老师的解答,很快就解决了。
![image-20240602225315319](金融行业羊毛党识别案例实践报告/image-20240602225315319.png)
2. 在填写代码中的`HDFS`地址时没有保留下服务器的名称`cxq-ns1`导致无法在代码中访问到`HDFS`服务器。这个问题在报错信息的提示下也很快便解决了。
![image-20240602225421702](金融行业羊毛党识别案例实践报告/image-20240602225421702.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 253 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 271 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 254 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 388 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 269 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 400 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 225 KiB