我的环境是Windows10 64位+Python3.6
一、下载 Hadoop:https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-3.1.1/
二、下载 Spark:https://archive.apache.org/dist/spark/spark-2.3.1/
三、把以上两个文件解压到指定目录存放
四、添加系统变量:SPARK_HOME:D:\tools\Tools\tool_packages\spark-2.3.1-bin-hadoop2.7
五、编写系统变量path,添加:D:\tools\Tools\tool_packages\hadoop-3.1.1\bin;
D:\tools\Tools\tool_packages\spark-2.3.1-bin-hadoop2.7\bin
六、使用pip安装 py4j : pip install py4j
七、下载Java的JDK并安装好以及添加环境变量到path:D:\Program Files\Java\jdk1.8.0_51\bin
八、把Spark目录下python目录下的pyspark文件夹复制到python安装目录下的:
D:\python\...\Lib\site-packages目录下
以上用到的三个工具包可以到百度网盘下载: https://pan.baidu.com/s/1IYc2A1MzA0H3nbhqI_-mTA 密码: r7j6
九、测试:
test_spark.py
from pyspark import SparkContext,SparkConf
# conf = SparkConf().setMaster('local[*]').setAppName('First_App')
# sc=SparkContext(conf=conf)
sc=SparkContext('local')
doc = sc.parallelize([['a','b','c'],['e','f','g']])
words = doc.flatMap(lambda d: d).distinct().collect()
word_dict = {w:i for w,i in zip(words,range(len(words)))}
word_dict_b = sc.broadcast(word_dict)
def word_count_perdoc(d):
dicts = {}
wd = word_dict_b.value
for w in d:
if wd[w] in dicts:
dicts[wd[w]] += 1
else:
dicts[wd[w]] = 1
return dicts
if __name__ == '__main__':
print(doc.map(word_count_perdoc).collect())
print('successfull!')
输出结果为:
2018-09-09 11:34:28 ERROR Shell:397 - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2467)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2467)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:220)
at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.scala:408)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:408)
at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironment$7.apply(SparkSubmit.scala:416)
at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironment$7.apply(SparkSubmit.scala:416)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(SparkSubmit.scala:415)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:250)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-09-09 11:34:28 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[{0: 1, 1: 1, 2: 1}, {3: 1, 4: 1, 5: 1}]
successfull!