sparksql_标记异常值_提取异常值
用 .approxQuantile(…) 方法计算四分位数
参考:
https://www.jianshu.com/p/56cff9f6e0be
df_outliers = spark.createDataFrame([(1,143.5,5.3,28),
(2,154.2,5.5,45),
(3,342.3,5.1,99),
(4,144.5,5.5,33),
(5,133.2,5.4,54),
(6,124.1,5.1,21),
(7,129.2,5.3,42)],["id","weight","height","age"])
cols = ["weight","height","age"]
#bounds,用来存储后面生成的各个字段值的边界
bounds = {}
for col in cols:
#涉及统计中的4分位。计算Q1和Q3
quantiles = df_outliers.approxQuantile(col, [0.25,0.75], 0.05)
#计算4分位距
IQR = quantiles[1] - quantiles[0]
#计算内限
bounds[col] = [quantiles[0] - 1.5*IQR, quantiles[1] + 1.5*IQR]
print("bounds: ",bounds)
#判断是否为异常值,在内限之外的值为异常值
outliers = df_outliers.select(*['id'] + \
[((df_outliers[c] < bounds[c][0]) | (df_outliers[c] > bounds[c][1]) )\
.alias(c +"_o") for c in cols])
outliers.show()
bounds: {'age': [-11.0, 93.0], 'height': [4.499999999999999, 6.1000000000000005], 'weight': [91.69999999999999, 191.7]}
+---+--------+--------+-----+
| id|weight_o|height_o|age_o|
+---+--------+--------+-----+
| 1| false| false|false|
| 2| false| false|false|
| 3| true| false| true|
| 4| false| false|false|
| 5| false| false|false|
| 6| false| false|false|
| 7| false| false|false|
+---+--------+--------+-----+
#查询出异常值
df_outliers = df_outliers.join(outliers,on = 'id')
#上面的join语句不要写成 df_outliers.join(outliers, df_outliers.id == outliers.id) 否则在
#新生成的 df_outliers中会有2列id,后面在select时会报错AnalysisException: "Reference 'id' is ambiguous
df_outliers.show()
+---+------+------+---+--------+--------+-----+
| id|weight|height|age|weight_o|height_o|age_o|
+---+------+------+---+--------+--------+-----+
| 7| 129.2| 5.3| 42| false| false|false|
| 6| 124.1| 5.1| 21| false| false|false|
| 5| 133.2| 5.4| 54| false| false|false|
| 1| 143.5| 5.3| 28| false| false|false|
| 3| 342.3| 5.1| 99| true| false| true|
| 2| 154.2| 5.5| 45| false| false|false|
| 4| 144.5| 5.5| 33| false| false|false|
+---+------+------+---+--------+--------+-----+
df_outliers.filter('weight_o').select('id','weight').show()
+---+------+
| id|weight|
+---+------+
| 3| 342.3|
+---+------+
df_outliers.filter("age_o").select("id","age").show()
+---+---+
| id|age|
+---+---+
| 3| 99|
+---+---+
0人点赞
spark
作者:AntFish
链接:https://www.jianshu.com/p/56cff9f6e0be
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。