前言
这里记录一下使用Spark-Shell查询Hive中数据时遇到的一点点问题,Hive表中一个字段的数据如下:
+--------------+
| register_time|
+--------------+
|20190824192307|
|20151220134432|
|20150909235908|
|20150418125129|
|20150101103333|
|20140925173645|
|20150628155606|
|20150602170012|
|20180807102621|
|20170506102100|
|20141121074350|
|20151211102837|
|20131230144052|
|20140322135851|
|20151130130000|
|20160823210000|
|20160625210000|
|20160602180000|
|20150927113839|
|20150121181824|
+--------------+
数据特征如下:
长度14,不过里面有NULL值和‘’值,还有一些长度不为14的脏数据
操作一
sql(sqlText = "select register_time from ods.ods_user_5_tuple_m")
.filter("length(register_time) = 14") //过滤掉空值和脏数据
.select($"register_time".cast("int"))
.orderBy($"register_time")
.show()
结果:
+-------------+
|register_time|
+-------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------+
这个字段长度14,而int长度仅为10,因此这里显示全为null值,就合理了
操作二
sql(sqlText = "select register_time from ods.ods_user_5_tuple_m")
.filter("length(register_time) = 14") //过滤掉空值和脏数据
.select($"register_time".cast("long"))
.orderBy($"register_time")
.show()
结果
+-------------+
|register_time|
+-------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------+
这个真的很奇怪,我在第一步已经把长度定为14,为什么后面还有null?(文章后面有说明)
使用 asc_nulls_last把null值放在最后
sql(sqlText = "select register_time from ods.ods_user_5_tuple_m")
.filter("length(register_time) = 14")
.select($"register_time".cast("long"))
.orderBy($"register_time".asc_nulls_last)
.show()
结果:
+--------------+
| register_time|
+--------------+
| 10101000000|
| 10101000000|
| 10101000000|
| 10101000000|
| 10101000000|
| 10101000000|
| 10101000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
+--------------+
竟然有结果了,但是,长度依然不全是14
说明:
Spark中的asc_nulls_last类似于SQL中的NULLS LAST
将长度限定放在类型转换之后
sql(sqlText = "select register_time from ods.ods_user_5_tuple_m")
.select($"register_time".cast("long"))
.filter("length(register_time) = 14")
.orderBy($"register_time")
.show()
结果:
扫描二维码关注公众号,回复:
9189503 查看本文章
+--------------+
| register_time|
+--------------+
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
+--------------+
这个时候,一切正常了
整个流程问题追踪
1.为什么使用length函数后,里面有一堆null
原因就是length函数统计字符串的时候,如果底层为\N存储,那么显示出来的就是NULL,只要使用trim函数即可
sql(sqlText = "select register_time from ods.ods_user_5_tuple_m")
.where("length(trim(register_time)) = 14")
.show()
结果:
+--------------+
| register_time|
+--------------+
|20151128083141|
|20140309165232|
|20140416120642|
|20170210183403|
|20150719182910|
|20150502161826|
|20140920213016|
|20150628091716|
|20140526220130|
|20141109173058|
|20141019182158|
|20160501185629|
|20160801185911|
|20160302144822|
|20150807180000|
|20141030213002|
|20150510101610|
|20150510112424|
|20141019204319|
|20150101111359|
+--------------+
2.将长度限定放在类型转换之前为什么不行?
很明显,这个字符串前面本来就是以0开头的,长度本来是14,但是已转换为long,则把前面的0就去掉了
sql(sqlText = "select register_time from ods.ods_user_5_tuple_m")
.where("length(trim(register_time)) = 14")
//.select($"register_time".cast("long"))
.orderBy($"register_time".asc_nulls_last)
.show()
结果:
+--------------+
| register_time|
+--------------+
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|00010101000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
|17520913000000|
+--------------+
后记
Hive中的Null值和’'的具体细节请参考