hive允许用户使用自定义函数解决hive 自带函数无法处理的逻辑。hive自定义函数只在当前线程内临时有效,可以使用shell脚本调用执行hive命令。
- UDF
输入一行数据输出一行数据。
解决问题描述
想要比较两个逗号分隔的字符串是否相同。
-使用方法
如果ignoreNullFlag是1,则两个字符串都是空算相等,如果不是1,算不等
-
add jar /home/mart_wzyf/zhuhongmei/plist_udf_udaf.jar;
-
CREATE TEMPORARY FUNCTION compareStringBySplit AS 'com.jd.plist.udf.TestUDF';
-
SELECT compareStringBySplit("22,11,33", "11,33,22",1) FROM scores;
-
DROP TEMPORARY FUNCTION compareStringBySplit;
java代码中用户必须要继承UDF,且必须至少实现一个evalute方法
-
package com.jd.plist.udf;
-
import org.apache.commons.lang.StringUtils;
-
import org.apache.hadoop.hive.ql.exec.UDF;
-
public class TestUDF extends UDF {
-
private static final int MATCH = 1;
-
private static final int NOT_MATCH = 0;
-
/**
-
* 入参3个。
-
* @param aids
-
* @param bids
-
* @param ignoreNullFlag
-
* @return
-
*/
-
public int evaluate(String aids, String bids, int ignoreNullFlag) {
-
if (StringUtils.isBlank(aids) && StringUtils.isBlank(bids)) {
-
if (ignoreNullFlag == 1) {
-
return MATCH;
-
} else {
-
return NOT_MATCH;
-
}
-
} else if (StringUtils.isBlank(aids) && !StringUtils.isBlank(bids)) {
-
return NOT_MATCH;
-
} else if (!StringUtils.isBlank(aids) && StringUtils.isBlank(bids)) {
-
return NOT_MATCH;
-
} else {
-
String[] aidArray = aids.split(",");
-
String[] bidArray = bids.split(",");
-
for (String aid : aidArray) {
-
boolean exist = false;
-
for (String bid : bidArray) {
-
if (aid.equals(bid)) {
-
exist = true;
-
}
-
}
-
if (!exist) {
-
return NOT_MATCH;
-
}
-
}
-
return MATCH;
-
}
-
}
-
}
- UDAF
输入多行数据输出一行数据,一般在group by中使用。
解决问题描述
自己实现将相同主id下的子id用逗号拼接
使用方法
-
add jar /home/mart_wzyf/zhuhongmei/plist_udf_udaf-0.0.1.jar;
-
CREATE TEMPORARY FUNCTION concat_sku_id AS 'com.jd.plist.udaf.TestUDAF';
-
select concat_sku_id(item_sku_id,',') from app.app_cate3_sku_info where dt =sysdate(-1) and item_third_cate_cd = 870 group by main_sku_id;
-
DROP TEMPORARY FUNCTION concat_sku_id;
java代码
Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数
init初始化,iterate函数处理读入的行数据,terminatePartial返回iterate处理的中建结果,merge合并上述处理结果,terminate返回最终值。
-
package com.jd.plist.udaf;
-
import org.apache.hadoop.hive.ql.exec.UDAF;
-
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
-
public class TestUDAF extends UDAF {
-
public static class TestUDAFEvaluator implements UDAFEvaluator {
-
public static class PartialResult {
-
String skuids;
-
String delimiter;
-
}
-
private PartialResult partial;
-
public void init() {
-
partial = null;
-
}
-
public boolean iterate(String item_sku_id, String deli) {
-
if (item_sku_id == null) {
-
return true;
-
}
-
if (partial == null) {
-
partial = new PartialResult();
-
partial.skuids = new String("");
-
if (deli == null || deli.equals("")) {
-
partial.delimiter = new String(",");
-
} else {
-
partial.delimiter = new String(deli);
-
}
-
}
-
if (partial.skuids.length() > 0) {
-
partial.skuids = partial.skuids.concat(partial.delimiter);
-
}
-
partial.skuids = partial.skuids.concat(item_sku_id);
-
return true;
-
}
-
public PartialResult terminatePartial() {
-
return partial;
-
}
-
public boolean merge(PartialResult other) {
-
if (other == null) {
-
return true;
-
}
-
if (partial == null) {
-
partial = new PartialResult();
-
partial.skuids = new String(other.skuids);
-
partial.delimiter = new String(other.delimiter);
-
} else {
-
if (partial.skuids.length() > 0) {
-
partial.skuids = partial.skuids.concat(partial.delimiter);
-
}
-
partial.skuids = partial.skuids.concat(other.skuids);
-
}
-
return true;
-
}
-
public String terminate() {
-
return new String(partial.skuids);
-
}
-
}
- UDTF
udtf用来实现一行输入多行输出
用途
将字符串(key1:20;key2:30;key3:40)按照分好拆分行按照冒号拆分列进行展示。
使用方法
-
add jar /home/mart_wzyf/zhuhongmei/plist_udf_udaf-0.0.3.jar;
-
CREATE TEMPORARY FUNCTION explode_map AS 'com.jd.plist.udtf.TestUDTF';
-
select explode_map(mapstrs) as (col1,col2) from app.app_test_zhuzhu_maps;
-
DROP TEMPORARY FUNCTION explode_map;
java代码
initialize初始化校验参数是否正确。process处理返回结果。forward将结果返回
-
package com.jd.plist.udtf;
-
import java.util.ArrayList;
-
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
-
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
-
import org.apache.hadoop.hive.ql.metadata.HiveException;
-
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
-
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
-
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
-
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
-
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
-
public class TestUDTF extends GenericUDTF {
-
@Override
-
public void close() throws HiveException {
-
// TODO Auto-generated method stub
-
}
-
@Override
-
public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
-
if (args.length != 1) {
-
throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
-
}
-
if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
-
throw new UDFArgumentException("ExplodeMap takes string as a parameter");
-
}
-
ArrayList<String> fieldNames = new ArrayList<String>();
-
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
-
fieldNames.add("col1");
-
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
-
fieldNames.add("col2");
-
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
-
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
-
}
-
@Override
-
public void process(Object[] args) throws HiveException {
-
String input = args[0].toString();
-
String[] test = input.split(";");
-
for (int i = 0; i < test.length; i++) {
-
try {
-
String[] result = test[i].split(":");
-
forward(result);
-
} catch (Exception e) {
-
continue;
-
}
-
}
-
}
-
}
-注意UDTF使用
UDTF有两种使用方法,一种直接放到select后面,一种和lateral view一起使用。
1:直接select中使用
select explode_map(properties) as (col1,col2) from src;
不可以添加其他字段使用
select a, explode_map(properties) as (col1,col2) from src
不可以嵌套调用
select explode_map(explode_map(properties)) from src
不可以和group by/cluster by/distribute by/sort by一起使用
select explode_map(properties) as (col1,col2) from src group by col1, col2
2:和lateral view一起使用
select src.id, mytable.col1, mytable.col2 from src lateral view explode_map(properties) mytable as col1, col2;
此方法更为方便日常使用。执行过程相当于单独执行了两次抽取,然后union到一个表里。