【Java】各阶段的编码

在Java程序整个开发及运行的过程中，和编码有关的阶段分为下面几种：

.java源文件；
.class字节码文件；
运行时；
输出。

.java源文件的编码由用户指定或者根据操作系统语言设置自动使用系统默认编码，这一阶段的编码并不能做到统一，每一个源文件的编码都可以是不同的。

然后我们需要将.java源文件通过javac编译成.class文件，但是javac读取源文件的方式是根据操作系统的默认编码来读取的，如果我的操作系统默认编码时GBK，但是我的源文件的编码设置为UTF-8，那么javac使用GBK编码的格式去读取UTF-8的话，就会出现乱码的情况，这时候就需要在javac后面加上这么一个参数（我用的是jdk 1.8）：-encoding utf-8。

public class Test {

    public static void main(String[] args) {
        String a="你好，世界";
        System.out.println(a);
    }   
}

Microsoft Windows [版本 10.0.16299.371]
(c) 2017 Microsoft Corporation。保留所有权利。

C:\Users\MasterVing>D:

D:\>javac Test.java
Test.java:4: 错误: 编码GBK的不可映射字符
                String a="浣犲ソ锛屼笘鐣?";
                                 ^
1 个错误

Microsoft Windows [版本 10.0.16299.371]
(c) 2017 Microsoft Corporation。保留所有权利。

C:\Users\MasterVing>D:

D:\>javac -encoding utf-8 Test.java

D:\>java Test
你好，世界

当然，如果我们使用IDE的话，IDE会自动帮我们加上这个参数。

通过javac编译后生成的.class文件的编码是统一的，使用的是Modified UTF-8编码（官方文档）。

4.3. Descriptors
A descriptor is a string representing the type of a field or method. Descriptors are represented in the class file format using modified UTF-8 strings (§4.4.7) and thus may be drawn, where not further constrained, from the entire Unicode codespace.

在程序运行的过程中，使用的是UTF-16编码。

而在最后的输出阶段，使用的编码也可以自定义，没有强制要求。

UTF-8 和 Modified UTF-8

UTF-8是一种可变长编码，最少占用1个字节（例如：英文字母），最多占用6个字节，中文字符一般占用3个字节。

Modified UTF-8是改进版的UTF-8编码，它和标准的UTF-8编码有下面三点区别：

null空字符的编码从一个字节的'\u0000'改变为2个字节的形式，因此在字符串的编码中不会出现嵌入的null字符；
只使用1~3个字节的格式；
辅助字符以代理对的形式表示。

The differences between this format and the standard UTF-8 format are the following:
1. The null byte ‘\u0000’ is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.
2. Only the 1-byte, 2-byte, and 3-byte formats are used.
3. Supplementary characters are represented in the form of surrogate pairs.

UTF-16

UTF-16是一种可变长编码，用1~2个16位长的单位码元（一个码元2个字节）来表示。

因此，一个UTF-16编码的字符占2个字节或4个字节。