1.dataframe 转 rdd
dataframe 是从关系型数据库里读出来的数据 表的形式
rdd=sc.parallelize(df.rdd.collect())
2.dataframe groupBy之后distinct() 在count()
from pyspark.sql.functions import countDistinct
df.groupBy("a","b").agg(countDistinct(some_column
)).collect()
3.神奇的问题,ubuntu装spark时候,一切顺利最后启动时候报错
root@chenge:/usr/local/spark/sbin# ./start-all.sh hostname: Name or service not known starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-chenge.out failed to launch: nice -n 0 /usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080 at org.apache.spark.deploy.master.MasterArguments.<init>(MasterArguments.scala:30) at org.apache.spark.deploy.master.Master$.main(Master.scala:1049) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.net.UnknownHostException: chenge: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getLocalHost(InetAddress.java:1500) ... 10 more 2018-08-17 14:56:11 INFO ShutdownHookManager:54 - Shutdown hook called full log in /usr/local/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-chenge.out hostname: Name or service not known
最后发现问题,
root@chenge 这chenge找不到,修改Hostname
hostname ubuntu
然后打开新的终端,可以启动