版权声明:本文为博主九师兄(QQ群:spark源代码 198279782 欢迎来探讨技术)原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_21383435/article/details/81585363
1. 环境如下
参考:https://blog.csdn.net/qq_21383435/article/details/79240276
1.1 安装Anaconda
https://www.anaconda.com/download/#macos
注意:这里建议下载Python 2.7 version *
,因为最新的spark可能不支持
1.2 配置环境变量
lcc@lcc ~$ vim .bash_profile
export SPARK_HOME=/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/python:/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip
export PYSPARK_DRIVER_PYTHON=/anaconda2/bin/ipython
# added by Anaconda2 5.2.0 installer
export PATH="/anaconda2/bin:$PATH"
1.3 配置PyCharm
1.4 使用python代码找到Python site-packages目录位置
lcc@lcc anaconda2$ python
Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:05)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> ls
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'ls' is not defined
>>> import os
>>> os.path.dirname(os.__file__)
'/anaconda2/lib/python2.7'
>>> exit()
lcc@lcc site-packages$ pwd
/anaconda2/lib/python2.7/site-packages
配置,不知道啥用
lcc@lcc site-packages$ vim pyspark.pth
lcc@lcc site-packages$ cat pyspark.pth
/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/python
lcc@lcc site-packages$
1.5 安装py4j 不知道要不要安装(版本也有要求)
lcc@lcc site-packages$ pip install py4j
Collecting py4j
Using cached https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl
distributed 1.21.8 requires msgpack, which is not installed.
grin 1.2.1 requires argparse>=1.1, which is not installed.
Installing collected packages: py4j
Successfully installed py4j-0.10.7
You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
lcc@lcc site-packages$
2. 运行准备
2.1 代码
# !/usr/bin/python2
# -*- coding: UTF-8 -*-
from pyspark.sql import SparkSession
import os
print(os.environ['SPARK_HOME'])
print(os.environ['HADOOP_HOME'])
if __name__ == '__main__':
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
# 从文件中读取
rdd = spark.read.json('/lcc/student3.json')
rdd.createOrReplaceTempView("employees")
rdd = spark.sql("SELECT * from employees")
rdd.printSchema()
rdd.show()
print(rdd.collect())
print("aa")
spark.stop()
2.2 开启hdfs
开启hdfs保证启动成功,这里本来想用本地文件,但是默认读取hdfs,可能是我配置的问题
lcc@lcc ~$ cd $HADOOP_HOME
lcc@lcc hadoop$ sbin/start-all.sh
2.3 建立json文件,
按照spark的说法,这里的jsonFile是特殊的文件:它是按行分隔多个JSON对象,否则的话就会出错。
以下是一个jsonFile的内容:
scala> val path = "examples/src/main/resources/people.json"
path: String = examples/src/main/resources/people.json
scala> Source.fromFile(path).foreach(print)
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
json文件如下
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
2.4 上传到hdfs上
lcc@lcc hadoop$ hdfs dfs -put /Users/lcc/PycharmProjects/AllPythonTest/spark/com/lcc/spark/student3.json /lcc
lcc@lcc hadoop$ hdfs dfs -ls /lcc
18/08/11 19:22:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r-- 1 lcc supergroup 72 2018-08-11 19:11 /lcc/student3.json
lcc@lcc hadoop$
3.运行结果
可以看到运行结果如下
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
[Row(age=None, name=u'Michael'), Row(age=30, name=u'Andy'), Row(age=19, name=u'Justin')]
aa
4.spark-submit方式提交
错误的方法
lcc@lcc spark-2.0.1-bin-hadoop2.7$ spark-submit --py-files /Users/lcc/PycharmProjects/AllPythonTest/spark/com/lcc/spark/SparkSqlTest.py
/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/conf/spark-env.sh: line 85: spark.driver.extraClassPath: command not found
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:276)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
at org.apache.spark.launcher.Main.main(Main.java:86)
lcc@lcc spark-2.0.1-bin-hadoop2.7$
正确的方法
lcc@lcc spark-2.0.1-bin-hadoop2.7$ spark-submit /Users/lcc/PycharmProjects/AllPythonTest/spark/com/lcc/spark/SparkSqlTest.py