scala> val df = spark.createDataset(Seq(
("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6)) ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]
scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
scala> val df = spark.createDataset(Seq(
("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6)) ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field] scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
filter函数
从Spark官网的文档中看到,filter函数有下面几种形式:
def filter(func: (T) ⇒ Boolean): Dataset[T]
def filter(conditionExpr: String): Dataset[T]
def filter(condition: Column): Dataset[T]
def filter(func: (T) ⇒ Boolean): Dataset[T]
def filter(conditionExpr: String): Dataset[T]
def filter(condition: Column): Dataset[T]
所以,以下几种写法都是可以的:
scala> df.filter($"key1">"aaa").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
scala> df.filter($"key1"==="aaa").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
+----+----+----+
scala> df.filter("key1='aaa'").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
+----+----+----+
scala> df.filter("key2=1").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
+----+----+----+
scala> df.filter($"key2"===3).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb| 3| 4|
| ccc| 3| 5|
+----+----+----+
scala> df.filter($"key2"===$"key3"-1).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
+----+----+----+
其中, ===是在Column类中定义的函数,对应的不等于是=!=。
$”列名”这个是语法糖,返回Column对象
where函数
scala> df.where("key1 = 'bbb'").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb| 3| 4|
| bbb| 4| 6|
+----+----+----+
scala> df.where($"key2"=!= 3).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 4| 6|
+----+----+----+
scala> df.where($"key3">col("key2")).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
scala> df.where($"key3">col("key2")+1).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+