版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sdut406/article/details/88710803
JavaPairRDD的filter方法讲解
官方文档
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
说明
返回一个新的过滤后的RDD,
过滤规则:只返回条件为true的数据
返回的是map
函数原型
// java
public JavaPairRDD<K,V> filter(Function<scala.Tuple2<K,V>,Boolean> f)
// scala
def filter(f: Function[(K, V), Boolean]): JavaPairRDD[K, V]
示例
public class Filter {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
new Tuple2<String, String>("cat", "13"), new Tuple2<String, String>("pig", "44")), 2);
// 只要cat
JavaPairRDD<String,String> javaPairRDD = javaPairRDD1.filter(new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> stringStringTuple2) throws Exception {
if (StringUtils.isNotEmpty(stringStringTuple2._1) && StringUtils.equals("cat", stringStringTuple2._1)){
return true;
}
return false;
}
});
// 输出
javaPairRDD.foreach(new VoidFunction<Tuple2<String, String>>() {
public void call(Tuple2<String, String> stringStringTuple2) throws Exception {
System.out.println(stringStringTuple2);
}
});
}
}
结果
19/03/21 11:29:31 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/03/21 11:29:31 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4900 bytes)
19/03/21 11:29:31 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
(cat,11)
19/03/21 11:29:31 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 622 bytes result sent to driver
19/03/21 11:29:31 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4900 bytes)
19/03/21 11:29:31 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/03/21 11:29:31 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 203 ms on localhost (executor driver) (1/2)
(cat,13)
19/03/21 11:29:31 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 536 bytes result sent to driver
JavaPairRDD的first方法讲解
官方文档
/**
* Return the first element in this RDD.
*/
说明
返回RDD的第一个元素
其实调用的take方法 默认为1
函数原型
// java
public scala.Tuple2<K,V> first()
// scala
def first(): (K, V)
返回的是RDD的一个元素数据