PYSPARK_PYTHON=/data/anaconda3/bin/python3 \
/opt/spark/bin/spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 50g \
--driver-cores 20 \
--executor-memory 50g \
--num-executors 3 \
--executor-cores 20 \
--py-files $path/nlp.zip \
--files $path/nlp/jieba_lib/dict.txt,$path/nlp/jieba_lib/words_call/dict_call.txt,$path/nlp/jieba_lib/words_call/dict_surname.txt \
--name 'dataProcess' \
./dataProcess.py $input_path $output_name >>./log/dataProcess.log 2>>./log/dataProcess.err
py-files
:逗号隔开的的.zip、.egg、.py文件,这些文件会放置在PYTHONPATH下,该参数仅针对python应用程序,需要用到的时候import导入即可
from nlp.NewsFilter import strict_sensitive_data
--files
:逗号隔开的文件列表,这些文件将存放于每一个工作节点进程目录下。直接引用文件名即可。
jieba.set_dictionary("dict.txt")
for name in open('dict_call.txt', 'r', encoding='utf-8'):
pass
如果–files,直接引用文件名还是找不到文件的话,在shell里将这几个文件先cp一下,因为提交作业的时候
client
会先扫本地程序,相对路径本地会找不到
cp ./nlp/jieba_lib/dict.txt ./dict.txt
cp ./nlp/jieba_lib/words_call/dict_call.txt ./dict_call.txt
cp ./nlp/jieba_lib/words_call/dict_surname.txt ./dict_surname.txt
参考:
https://blog.csdn.net/weixin_42649077/article/details/84976960
提交参数解释: