大数据基础之词频统计Word Count

其他 2018-12-13 17:44:41 阅读次数: 0

对文件进行词频统计，是一个大数据领域的hello word级别的应用，来看下实现有多简单：

1 Linux单机处理

egrep -o "\b[[:alpha:]]+\b" test_word.log|sort|uniq -c|sort -rn|head -10

2 Spark分布式处理（Scala）

val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
sc.textFile("test_word.log").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false).take(10).foreach(println)

测试文件test_word.log内容如下：

hello world
hello www

输出如下：

2 hello
1 world
1 barney

猜你喜欢

转载自www.cnblogs.com/barneywill/p/10115301.html

大数据基础之词频统计Word Count

Word Count

大数据系列（二）hadoop实现最基础word count

Spark的word count

SparkStreaming Word Count

python实现Word Count

Word Count作业

word count（小组）

Spark Word Count

Word Count结对编程

specific word count (index of )

specific word count(index of)

scala-word count

Flink:word count demo

special word count

word count项目情况

Spark 实现word count

Word Count 个人作业

Word Count (Java)

个人项目（Word Count）

个人项目 Word Count

个人项目(Word Count)

Word Count（C语言）

Spark Streaming的Word Count

数据库中count(1)、count(*)、count(列名)的总结

Reversion Count(Java大数)

大数 Reversion Count

linux wc word count（统计文件个数）

Word Count--字符统计小程序

今日推荐

周排行

Leetcode简单题61~80

解决zookeeper磁盘IO高的问题

多线程相关方法详解

Maven-setting.xml文件详解

Maven 项目的 classpath 理解

渊亭科技大数据笔试题

配置JVM内存分配

计算机网络个人学习笔记（三）网络层：第三部分连载

js中两个等号(==)和三个等号(===)的区别

用C程序自动打开电脑上的程序

每日归档

2024-09-18(0)

2024-09-17(0)

2024-09-16(0)

2024-09-15(0)

2024-09-14(0)

2024-09-13(0)

2024-09-12(0)

2024-09-11(0)

2024-09-10(0)

2024-09-09(0)