spark之distinct去重原理 - 代码天地

spark之distinct去重原理

其他 2020-04-10 10:51:06 阅读次数: 0

distinct算子原理：

贴上spark源码：

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

示例代码：

package com.wedoctor.utils.test

import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object Test {
  Logger.getLogger("org").setLevel(Level.ERROR)
  def main(args: Array[String]): Unit = {
    //本地环境需要加上
    System.setProperty("HADOOP_USER_NAME", "root")
    val session: SparkSession = SparkSession.builder()
       .master("local[*]")
       .appName(this.getClass.getSimpleName)
       .getOrCreate()

    val value: RDD[Int] = session.sparkContext.makeRDD(Array(3,3,4,5,5))
    value.distinct().foreach(println)
    //等价于
    value.map(x=>(x,null)).reduceByKey((x,y) => x).map(_._1).foreach(println)
    session.close()
  }
}

发布了79 篇原创文章 · 获赞 107 · 访问量 8万+

私信关注

猜你喜欢

转载自blog.csdn.net/zuochang_liu/article/details/105387704

spark之distinct去重原理

spark部分：distinct去重的原理

spark算子：distinct去重的原理

orcale distinct 去重

mysql distinct 去重

去重 DISTINCT

distinct去重

Distinct 条件去重

distinct （去重）

distinct 与group by 去重

去重算子：distinct

distinct left out join group by order by之去重

MySQL之去重（DISTINCT去掉重复数据）

C#黔驴技巧之去重（Distinct）

Stream流之distinct去重详细用法

mysql distinct 去重（转载）

thinkphp去重，distinct、group

Oracle的去重函数 distinct

去重是distinct还是group by？

.Net Collection Distinct 去重

3 List去重--distinct

mysql distinct()函数去重

Spark经典案例之数据去重

C#--Distinct C#黔驴技巧之去重（Distinct）

distinct 去重需要注意的地方

DISTINCT 去重仍有重复的分析

sql 去重 distinct 关键字

sql-distinct去重语句

SQL去重distinct方法解析

Access SQL distinct 去重失效问题

今日推荐

周排行

LRU cache算法

windows10, 自带的OpenSSH, key权限问题, 文件权限问题

测试用例书写方法

HIVE-默认分隔符的（linux系统的特殊字符）查看，输入和修改

最贵的AMD 7nm显卡来了！这设计够狂野

java多线程简单demo

[ 转载 ]在Android系统上使用busybox——最简单的方法

QT connect学习

BFSIFT算法分析

Xcode10：library not found for -lstdc++.6.0.9 临时解决

每日归档

更多

2024-08-06(0)

2024-08-05(0)

2024-08-04(0)

2024-08-03(0)

2024-08-02(0)

2024-08-01(0)

2024-07-31(0)

2024-07-30(0)

2024-07-29(0)

2024-07-28(0)