一、说明
编写Impala UDF和Hive UDF 其实是一回事;大致分为以下两种添加UDF;
(1) 编写Hive的UDF后,登陆impala-shell ,invalidate metadata;
(2) 编写impala 的UDF,指定UDF的jar包所在位置和返回值的类型;
二、编写hive UDF (按照永久的处理的,但是还是一个session结束后没有了;还是临时的)
hive 最低版本:0.13.0
2.1 编写UDF 对应的java类,本文以md5加密为例;
package com.nanine.md5.utils;
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import sun.misc.BASE64Encoder;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
/*
* @param string
* @ return the string under MD5
*/
public class Md5Utils extends UDF {
public String getMd5Str(String str) throws NoSuchAlgorithmException, UnsupportedEncodingException{
MessageDigest md5=MessageDigest.getInstance("MD5");
BASE64Encoder base64en = new BASE64Encoder();
String newstr=base64en.encode(md5.digest(str.getBytes("utf-8")));
return newstr;
}
public Text evaluate(Text s) throws NoSuchAlgorithmException, UnsupportedEncodingException {
if (s == null) { return null; }
return new Text(getMd5Str(s.toString()));
}
}
2.2 打包.jar包:md5Utils.jar
2.3 把jar包放到hdfs路径下/或者本地某个目录:hdfs dfs -put md5Utils.jar /
2.4 把jar包添加到classpath中;
add jar hdfs:///md5Utils.jar;
2.5 创建hive md5函数;
CREATE FUNCTION default.mymd5 AS 'com.nanine.md5.utils.Md5Utils' using jar 'hdfs:///md5Utils.jar';
2.6 reload function;
2.7 hive 测试;
hive> select mymd5(msisdn) from rong_getCustomers_data limit 10;
converting to local hdfs:///md5Utils.jar
Added [/tmp/8315888d-2385-450a-89cf-698b692b38e9_resources/md5Utils.jar] to class path
Added resources: [hdfs:///md5Utils.jar]
OK
dSbzJoHdEK57WxGJEdrG4w==
WhUTClF0lDLiJTp6iKxXqg==
aFxT/6wJBnk0nBKRNJ7zMA==
G4oo6rN/GFCsOd4hzEy5fQ==
iYgsUMJyTvJ42EiJRDWPYg==
I+kU50AcLvo5aXrN0+X+LQ==
vJSo7sXzXpULVX7Gj0RySA==
0AMIyc3+1bH/dm88vuCzFw==
F9VqTHOFr596cGEq4vJu/w==
6vSFVDrim8UM4Mxh4617Yw==
Time taken: 1.426 seconds, Fetched: 10 row(s)
2.8 impala测试;
invalidate metadata;
[evercloud113:21000] > select mymd5(msisdn) from rong_getCustomers_data limit 10;
Query: select mymd5(msisdn) from rong_getCustomers_data limit 10
+--------------------------+
| default.mymd5(msisdn) |
+--------------------------+
| dSbzJoHdEK57WxGJEdrG4w== |
| WhUTClF0lDLiJTp6iKxXqg== |
| aFxT/6wJBnk0nBKRNJ7zMA== |
| G4oo6rN/GFCsOd4hzEy5fQ== |
| iYgsUMJyTvJ42EiJRDWPYg== |
| I+kU50AcLvo5aXrN0+X+LQ== |
| vJSo7sXzXpULVX7Gj0RySA== |
| 0AMIyc3+1bH/dm88vuCzFw== |
| F9VqTHOFr596cGEq4vJu/w== |
| 6vSFVDrim8UM4Mxh4617Yw== |
+--------------------------+
Fetched 10 row(s) in 4.95s
三、编写impala UDF
版本最低:1.2 (Impala support for UDFs is available in Impala 1.2 and higher)
3.1 执行2.1-2.3;
3.2 创建md5函数
create function mymd5(string) returns string location '/md5Utils.jar' symbol='com.nanine.md5.utils.Md5Utils';
3.3 查询
[evercloud113:21000] > select md5(msisdn) from rong_getCustomers_data limit 10;
Query: select md5(msisdn) from rong_getCustomers_data limit 10
+----------------------------------+
| default.md5(msisdn) |
+----------------------------------+
| 7526f32681dd10ae7b5b118911dac6e3 |
| 5a15130a51749432e2253a7a88ac57aa |
| 685c53ffac090679349c1291349ef330 |
| 1b8a28eab37f1850ac39de21cc4cb97d |
| 89882c50c2724ef278d8488944358f62 |
| 23e914e7401c2efa39697acdd3e5fe2d |
| bc94a8eec5f35e950b557ec68f447248 |
| d00308c9cdfed5b1ff766f3cbee0b317 |
| 17d56a4c7385af9f7a70612ae2f26eff |
| eaf485543ae29bc50ce0cc61e3ad7b63 |
+----------------------------------+
Fetched 10 row(s) in 0.16s
kill impala客户端,之后再启动impala也能正常用;
但是impala服务器重启之后需要重新指定路径;因此最好在脚本中预先执行一次次函数;
参考impala官方文档:点击打开链接