hive编写udf处理非utf-8数据

hive默认都是utf-8编码处理数据的，如果原始数据不是utf-8，例如是gbk，我们怎么处理这种数据呢？

方式很简单，我们写udf的时候，继承GenericUDF类就行了。例如：

public class CharsetConvertor extends GenericUDF {

	private transient StringObjectInspector oi = null;

	@Override
	public ObjectInspector initialize(ObjectInspector[] arguments)
			throws UDFArgumentException {
		oi = (StringObjectInspector) arguments[0];
		
		return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
	}

	@Override
	public Object evaluate(DeferredObject[] arguments) throws HiveException {
		try {
			Text t = oi.getPrimitiveWritableObject(arguments[0].get());
			// 得到原始字节
			byte[] bytes = t.getBytes();
			// 这里假定原始数据是gbk编码，使用gbk解码
			String gbkStr = new String(bytes, "GBK");
			// 对gbkStr进行处理。。。
			
			// 最后根据需要使用相应的字符集输出，例如这里仍然使用原始的GBK输出
			Text new_str = new Text(gbkStr.getBytes("GBK"));

			return new_str;
		} catch (Exception e) {
			return new Text("Charset conversion failed.");
		}
	}

	@Override
	public String getDisplayString(String[] paramArrayOfString) {
		// TODO Auto-generated method stub
		return null;
	}
}

hive编写udf处理非utf-8数据

猜你喜欢