【SparkAPI JAVA版】JavaPairRDD——filter、 first(十五)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sdut406/article/details/88710803

JavaPairRDD的filter方法讲解

官方文档
 /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
说明

返回一个新的过滤后的RDD,
过滤规则:只返回条件为true的数据

返回的是map

函数原型
// java
public JavaPairRDD<K,V> filter(Function<scala.Tuple2<K,V>,Boolean> f)
// scala
def filter(f: Function[(K, V), Boolean]): JavaPairRDD[K, V]
示例
public class Filter {
    public static void main(String[] args) {
        System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
        final JavaSparkContext sc = new JavaSparkContext(sparkConf);

        JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
                new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
                new Tuple2<String, String>("cat", "13"), new Tuple2<String, String>("pig", "44")), 2);
        // 只要cat
        JavaPairRDD<String,String> javaPairRDD = javaPairRDD1.filter(new Function<Tuple2<String, String>, Boolean>() {
            public Boolean call(Tuple2<String, String> stringStringTuple2) throws Exception {
                if (StringUtils.isNotEmpty(stringStringTuple2._1) && StringUtils.equals("cat", stringStringTuple2._1)){
                    return true;
                }
                return false;
            }
        });
        // 输出
        javaPairRDD.foreach(new VoidFunction<Tuple2<String, String>>() {
            public void call(Tuple2<String, String> stringStringTuple2) throws Exception {
                System.out.println(stringStringTuple2);
            }
        });
    }
}

结果
19/03/21 11:29:31 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/03/21 11:29:31 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4900 bytes)
19/03/21 11:29:31 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
(cat,11)
19/03/21 11:29:31 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 622 bytes result sent to driver
19/03/21 11:29:31 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4900 bytes)
19/03/21 11:29:31 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
19/03/21 11:29:31 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 203 ms on localhost (executor driver) (1/2)
(cat,13)
19/03/21 11:29:31 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 536 bytes result sent to driver

JavaPairRDD的first方法讲解

官方文档
/**
   * Return the first element in this RDD.
   */
说明

返回RDD的第一个元素

其实调用的take方法 默认为1

函数原型
// java
public scala.Tuple2<K,V> first()
// scala
def first(): (K, V)

返回的是RDD的一个元素数据

猜你喜欢

转载自blog.csdn.net/sdut406/article/details/88710803