107.字符集及乱码问题

常见编码集：

字符集	说明
ASCII	英文编码
ISO-8859-1	Lation-1 拉丁字符，包含中文、日文等
UST-8	扩展变长的unicode字符(1-3),国际通用
UTF-16BE	定长unicode字符(2字节)，大端Big-endian表示¹
UTF-16LE	定长unicode字符(2字节)，小端little-endian表示²
UTF-16	文件中开头指定大端还是小端表示方式，即BOM(Byte-Order-Mark):FE FF 表示大端，FF FE 表示小端

Java字符使用16位的双字节存储，但是实际文件存储中有各种字符编码集

字符编码：

package cn.yzy.IO;

import java.io.UnsupportedEncodingException;

/*
 * 编码
 */
public class ContentEncode {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String msgString = "性命生命使命";
		
		//编码：字节数组
		byte[] datas = msgString.getBytes(); //不给定参数默认使用工程的字符集
		System.out.println(datas.length);
		//当前工程的默认是GBK，一个汉字是两个字节
		
		//指定编码方式
		datas = msgString.getBytes("UTF-16LE");
		System.out.println(datas.length);
		
		datas = msgString.getBytes("GBK");
		System.out.println(datas.length);
	}
}

以上输出结果都是12，UTF-16LE和GBK编码一个汉字占两个字节

字符解码：

package cn.yzy.IO;

import java.io.UnsupportedEncodingException;

/*
 * 解码
 */
public class ContentDecode {
	public static void main(String[] args) throws UnsupportedEncodingException {
		String msgString = "性命生命使命";
		byte[] datas = msgString.getBytes();
		
		//解码
		msgString = new String(datas, 0, datas.length, "GBK");
		//GBK可以小写也可以大写
		System.out.println(msgString);
		
		//乱码的情况：字节不够
		msgString = new String(datas, 0, datas.length-1, "GBK");
		System.out.println(msgString);
		
		//乱码的情况：字符集不统一
		msgString = new String(datas, 0, datas.length, "utf8");
		System.out.println(msgString);
	}
}

解码用如下String的构造函数

public String(byte bytes[], int offset, int length, String charsetName)

String的构造函数（请点击跳转后在文末查看）

工程的默认编码是GBK因此上述代码使用GBK做测试，解码后能得到原先的文字，证明解码正确性。

注意：乱码的情况出现在字节不够以及字符集不统一的情况下

高字节低地址 ↩︎
低字节低地址 ↩︎

107.字符集及乱码问题

常见编码集：

字符编码：

字符解码：

猜你喜欢