目录
1.生成部署脚本文件并进行部署(在部署机也就是机器A上操作)
一、准备工作
- 两个主机(物理机或者虚拟机,Ubuntu或Centos7系统,允许以root用户登录)
- 所有主机安装Docker
- 所有主机安装Docker-Compose
- 部署机可以联网,所以主机相互之间可以网络互通
- 运行机已经下载FATE 的各组件镜像
关于如何安装docker和docker-compose,还有如何下载FATE镜像在上一篇文中有所介绍。
此处两台机器均为虚拟机,CentOS 7系统,这里称之为机器A和机器B,机器A既作为部署机又作为目标机,机器A的IP地址为192.168.16.129,机器B的IP地址为192.168.16.130,均以root登录。
二、部署操作
1.生成部署脚本文件并进行部署(在部署机也就是机器A上操作)
//下载并解压Kubefate1.3的kubefate-docker-compose.tar.gz资源包
# curl -OL https://github.com/FederatedAI/KubeFATE/releases/download/v1.3.0/kubefate-docker-compose.tar.gz
# tar -xzf kubefate-docker-compose.tar.gz //进行解压
# cd docker-deploy/ //进入docker-deploy目录
# vi parties.conf //编辑parties.conf配置文件
user=root
dir=/data/projects/fate
partylist=(10000 9999) //此处为两个集群的ID
partyiplist=(192.168.16.129 192.168.16.130) //此处写入两个目标机的IP
servingiplist=(192.168.16.129 192.168.16.130) //此处写入两个目标机的IP
exchangeip=
# bash generate_config.sh //生成部署文件
# bash docker_deploy.sh all //执行启动部署集群脚本
//需要输入几次目标机的root密码
2.验证是否部署成功
分别在目标机A、B上验证。
# docker ps //集群A(ID 10000)
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6186cc50baa1 federatedai/serving-proxy:1.2.2-release "/bin/sh -c 'java -D…" 14 minutes ago Up 12 minutes 0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp serving-10000_serving-proxy_1
870a3048336b federatedai/serving-server:1.2.2-release "/bin/sh -c 'java -c…" 14 minutes ago Up 12 minutes 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp serving-10000_serving-server_1
9a594365a451 redis:5 "docker-entrypoint.s…" 14 minutes ago Up 12 minutes 6379/tcp serving-10000_redis_1
44a0df69d2b1 federatedai/egg:1.3.0-release "/bin/sh -c 'cd /dat…" 18 minutes ago Up 17 minutes 7778/tcp, 7888/tcp, 50000-60000/tcp confs-10000_egg_1
22fe1f5e1ec1 federatedai/federation:1.3.0-release "/bin/sh -c 'java -c…" 18 minutes ago Up 17 minutes 9394/tcp confs-10000_federation_1
f75f0405b4bc mysql:8 "docker-entrypoint.s…" 18 minutes ago Up 17 minutes 3306/tcp, 33060/tcp confs-10000_mysql_1
a503e90b1548 redis:5 "docker-entrypoint.s…" 18 minutes ago Up 17 minutes 6379/tcp confs-10000_redis_1
b09a08468ad3 federatedai/proxy:1.3.0-release "/bin/sh -c 'java -c…" 18 minutes ago Up 17 minutes 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-10000_proxy_1
# docker ps //集群B(ID 9999)
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
27262d0be615 federatedai/roll:1.3.0-release "/bin/sh -c 'java -c…" 10 minutes ago Up 9 minutes 8011/tcp confs-9999_roll_1
e0b244d55562 federatedai/meta-service:1.3.0-release "/bin/sh -c 'java -c…" 11 minutes ago Up 10 minutes 8590/tcp confs-9999_meta-service_1
6e249db9451c federatedai/egg:1.3.0-release "/bin/sh -c 'cd /dat…" 12 minutes ago Up 10 minutes 7778/tcp, 7888/tcp, 50000-60000/tcp confs-9999_egg_1
8db5215d3998 mysql:8 "docker-entrypoint.s…" 12 minutes ago Up 11 minutes 3306/tcp, 33060/tcp confs-9999_mysql_1
d16f4c43fb05 federatedai/proxy:1.3.0-release "/bin/sh -c 'java -c…" 12 minutes ago Up 11 minutes 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_proxy_1
b5062d978a12 federatedai/federation:1.3.0-release "/bin/sh -c 'java -c…" 12 minutes ago Up 11 minutes 9394/tcp confs-9999_federation_1
ad673a6e2c4a redis:5 "docker-entrypoint.s…" 12 minutes ago Up 11 minutes 6379/tcp confs-9999_redis_1
3.连通性验证
在部署机上运行以下命令(机器A)
# docker exec -it confs-10000_python_1 bash //进入部署机的python容器
# cd /data/projects/fate/python/examples/toy_example //进入测试脚本文件夹
# python run_toy_example.py 10000 9999 1 //运行测试脚本,1代表多机
测试成功会返回以下内容
"2019-08-29 07:21:25,353 - secure_add_guest.py[line:96] - INFO: begin to init parameters of secure add example guest"
"2019-08-29 07:21:25,354 - secure_add_guest.py[line:99] - INFO: begin to make guest data"
"2019-08-29 07:21:26,225 - secure_add_guest.py[line:102] - INFO: split data into two random parts"
"2019-08-29 07:21:29,140 - secure_add_guest.py[line:105] - INFO: share one random part data to host"
"2019-08-29 07:21:29,237 - secure_add_guest.py[line:108] - INFO: get share of one random part data from host"
"2019-08-29 07:21:33,073 - secure_add_guest.py[line:111] - INFO: begin to get sum of guest and host"
"2019-08-29 07:21:33,920 - secure_add_guest.py[line:114] - INFO: receive host sum from guest"
"2019-08-29 07:21:34,118 - secure_add_guest.py[line:121] - INFO: success to calculate secure_sum, it is 2000.0000000000002"
这样两台机器之间的FATE联邦学习环境就搭建完成了。
三、验证Serving-Service功能
使用部署好的两个FATE集群进行简单的训练和推理测试,训练用到的数据集是“breast”,是FATE自带的简单测试用数据集,放在"examples/data"中,其分为两部分“breast_a”和“breast_b”,参与训练的host方持有“breast_a”,而guest方持有“breast_b”。guest和host联合起来对数据集进行逻辑回归训练。最后训练完成的模型推送到FATE Serving中作在线推理。
1.上传数据
以下操作在机器A上进行
# docker exec -it confs-10000_python_1 bash //进入python容器
# cd fate_flow //进入fate_flow目录
# vi examples/upload_host.json //修改上传配置文件
{
"file": "examples/data/breast_a.csv",
"head": 1,
"partition": 10,
"work_mode": 1,
"namespace": "fate_flow_test_breast",
"table_name": "breast"
}
//将“breast_a.csv”上传到系统中
# python fate_flow_client.py -f upload -c examples/upload_host.json
以下操作在机器B上进行
# docker exec -it confs-9999_python_1 bash //进入python容器
# cd fate_flow //进入fate_flow目录
# vi examples/upload_guest.json //修改上传配置文件
{
"file": "examples/data/breast_b.csv",
"head": 1,
"partition": 10,
"work_mode": 1,
"namespace": "fate_flow_test_breast",
"table_name": "breast"
}
//将“breast_b.csv”上传到系统中
# python fate_flow_client.py -f upload -c examples/upload_guest.json
2.进行训练
# vi examples/test_hetero_lr_job_conf.json //修改训练用配置文件
{
"initiator": {
"role": "guest",
"party_id": 9999
},
"job_parameters": {
"work_mode": 1
},
"role": {
"guest": [9999],
"host": [10000],
"arbiter": [10000]
},
"role_parameters": {
"guest": {
"args": {
"data": {
"train_data": [{"name": "breast", "namespace": "fate_flow_test_breast"}]
}
},
"dataio_0":{
"with_label": [true],
"label_name": ["y"],
"label_type": ["int"],
"output_format": ["dense"]
}
},
"host": {
"args": {
"data": {
"train_data": [{"name": "breast", "namespace": "fate_flow_test_breast"}]
}
},
"dataio_0":{
"with_label": [false],
"output_format": ["dense"]
}
}
},
....
}
//提交任务对上传的数据集进行训练
# python fate_flow_client.py -f submit_job -d examples/test_hetero_lr_job_dsl.json -c examples/test_hetero_lr_job_conf.json
//输出结果
{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=2022041901241226828821&role=guest&party_id=9999",
"job_dsl_path": "/data/projects/fate/python/jobs/2022041901241226828821/job_dsl.json",
"job_runtime_conf_path": "/data/projects/fate/python/jobs/2022041901241226828821/job_runtime_conf.json",
"logs_directory": "/data/projects/fate/python/logs/2022041901241226828821",
"model_info": {
"model_id": "arbiter-10000#guest-9999#host-10000#model",
"model_version": "2022041901241226828821"
}
},
"jobId": "2022041901241226828821",
"retcode": 0,
"retmsg": "success"
}
//用命令查看训练进度,直到全部success,此处-j后跟的是上面的jobId
# python fate_flow_client.py -f query_task -j 2022041901241226828821 | grep f_status
3.查看训练结果
浏览器打开127.0.0.1:8080进入fateboard,查看可视化任务训练结果。
四、删除部署
如果需要删除部署,则在部署机器上运行以下命令可以停止所有FATE集群:
# bash docker_deploy.sh --delete all
如果想要彻底删除在运行机器上部署的FATE,可以分别登录节点,然后运行命令:
# cd /data/projects/fate/confs-<id>/ //此处的ID就是集群的ID
# docker-compose down
# rm -rf ../confs-<id>/