java String类，底层自学自看笔记（实时更新） 1

提示：本文章是基于jdk1.7，对于一些常见类底层学习的公开笔记，学艺不精，发现错误请评论提出。

按照jdk，String类自上而下的顺序挨个学习研究

package java.lang;
import java.io.ObjectStreamField;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.Formatter;
import java.util.Locale;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

一：首先String类型实现了三个接口
Serializable：序列化，（可被转为二进制，实现永久存储）
Comparable：自然排序（String提供了比较方法）
CharSequence：可读序列（）
二：拥有两个私有实体类

private final  char value[];  字符数组

分析：String类型的存储是以字符数组的方式进行存储。

private int hash;  哈希 生成的hash码

private static final ObjectStreamField[] serialPersistentFields =
        new ObjectStreamField[0];（声明一个类的序列化字段）

三：方法

public String() {
        this.value = new char[0];
}
可以  String i = new String(); 的方式创建一个字符串对象，他会创建一个0长度的char类型数组

public String(String original) {
        this.value = original.value;
        this.hash = original.hash;
    }

注：一个没有什么用的构造器，字符串本身就是不可变的，直接 ‘=’ 就可以。
从内存意义上讲，也许设计之初是，将一个不会再用到的字符串调用此方法进行替换，不占用新的内存，也许想法是达到节省内存的目的，但是显然，这不符合String的特性。

之前都是热身，现在开始稍微，有点难度-.-

public String(char value[]) {
        this.value = Arrays.copyOf(value, value.length);
}

注：这个方法是new String时传入一个Char类型数组。
然后使用Arrays.copyOf，copy整个数组，生成一个新的一模一样的字符数组返回给String。
且此处也不适合使用=直接内存赋值，容易产生一些不可预知的事情。

public String(char value[], int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count < 0) {
            throw new StringIndexOutOfBoundsException(count);
        }
        if (offset > value.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }
        this.value = Arrays.copyOfRange(value, offset, offset+count);
    }

注：这个其实去上面的是一样的功能，只是增加了控制copy的起点和字符个数。

public String(int[] codePoints, int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);//起点不能小于0
        }
        if (count < 0) {
            throw new StringIndexOutOfBoundsException(count);//个数不能小于0
        }
        if (offset > codePoints.length - count) {//不允许起点+个数大于了总长
            throw new StringIndexOutOfBoundsException(offset + count);
        }
        final int end = offset + count;//结束位置
        int n = count;//需要获取的个数
        for (int i = offset; i < end; i++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))//判断是否是有效的hash点，简单说就是是否小与2的16次方
                continue;
            else if (Character.isValidCodePoint(c))//简单说这个是判断他右移16位是否小于17
                n++;
            else throw new IllegalArgumentException(Integer.toString(c));
        }
	//这个for循环 就是判断 这个hashcode码值是否在 2的16 到 2的22之间，如果在 就n+1，就是增加一个位置
        final char[] v = new char[n];
        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;//如果正常的直接插入
            else
                Character.toSurrogates(c, v, j++);如果不在正常范围内，则以两个位置来存储
        }

        this.value = v;
    }

注：这个方法从外层来看，是把一个char类型指定起点和位置的字符拿出来展示出来。
但以底层的形式来看，输入5个不见得一定会返回5个，也可能是6个或者更多。
我猜测应该是防止码值扩展而导致该方法无法使用，增加了该方法的扩展性。
2018-11-14 15:52- -1.0.0//该方法解读存在修改可能，记录下版本

接下来是几个已经被废弃的。
5

//功能：从8位整数值的数组的子数组中分配一个新的构造。
//hibyte：每个16位Unicode码单元的前8位
@Deprecated //废弃注解
    public String(byte ascii[], int hibyte, int offset, int count) {
        checkBounds(ascii, offset, count);//这个方法就是验证起点和个数的和不能超过最大值，否则异常
        char value[] = new char[count];//创建一个新的，长度是count
        if (hibyte == 0) {
            for (int i = count; i-- > 0;) {
                value[i] = (char)(ascii[i + offset] & 0xff);
            }
        } else {
            hibyte <<= 8;
            for (int i = count; i-- > 0;) {
                value[i] = (char)(hibyte | (ascii[i + offset] & 0xff)); 做了二进制非运算
            }
        }
        //该if执行的是如果给了Unicode码
        this.value = value;
    }
//官方给的废弃原因是-- 此方法没有正确地将字节转换为字符

注：通过观察，在hibyte有值的情况下确实没有正确的将字节转化为字符，笔者并没有看懂做二进制非运算这部操作的意义和用途，尽管他实现了分配新的构造。通过实验生成的是乱码
猜测：该方法的目的可能是改变内存地址之类的，但是错误的直接改变内里存储的值(废弃)
最新猜测：
2018-11-14 16:46- -1.0.0//该方法解读存在修改可能，记录下版本

@Deprecated//废弃 this引用的就是上面那个方法，所以废弃了
//该方法就是默认从头到尾
    public String(byte ascii[], int hibyte) {
        this(ascii, hibyte, 0, ascii.length);
}

//这就是一个简单的私有的工具类
//起始位置和需要的长度不能小于0
//其实值+需要值不能大于总长度
//之后再出现这个方法将不再进行解释
private static void checkBounds(byte[] bytes, int offset, int length) {
        if (length < 0)//
            throw new StringIndexOutOfBoundsException(length);
        if (offset < 0)
            throw new StringIndexOutOfBoundsException(offset);
        if (offset > bytes.length - length)
            throw new StringIndexOutOfBoundsException(offset + length);
 }

//charsetName  要转码的名字
public String(byte bytes[], int offset, int length, String charsetName)
            throws UnsupportedEncodingException {
        if (charsetName == null)
            throw new NullPointerException("charsetName");
        checkBounds(bytes, offset, length);
        this.value = StringCoding.decode(charsetName, bytes, offset, length);//转码操作之后里面会copy一个新的你要的char[]返回
}

注：这个是String提供的，一般对于乱码问题才会用到的方法，先将字符串转成原始的byte然后转码。
同时提供了默认从头至尾的转码，默认为ISO-8859-1

public String(byte bytes[], String charsetName)
            throws UnsupportedEncodingException {
        this(bytes, 0, bytes.length, charsetName);//调用了上述方法，从起始到结尾
}

//charset：编码的实体类，和上面的功能是一样的
public String(byte bytes[], int offset, int length, Charset charset) {
        if (charset == null)
            throw new NullPointerException("charset");
        checkBounds(bytes, offset, length);
        this.value =  StringCoding.decode(charset, bytes, offset, length);
}
//以及他的另一种形式
public String(byte bytes[], Charset charset) {
        this(bytes, 0, bytes.length, charset);
}

//及直接生成默认编码的构造方法，该decode的底层默认是UTF-8
public String(byte bytes[], int offset, int length) {
        checkBounds(bytes, offset, length);
        this.value = StringCoding.decode(bytes, offset, length);
}
//以及他的简化构造方法
public String(byte bytes[]) {
        this(bytes, 0, bytes.length);
}

*转码一共就是这些，底层提供了输入String类型转码，对象的形式转码，以及简化版转码。
应用：如果直接以byte数组的形式创建String，那么直接就是utf-8了，非常的方便。

//将StringBuffer转为String，StringBuffer和StringBuilder的底层都是char[] ，而且Buffer除了对其方法做了大量的synchronized之外二者并没有什么区别。
public String(StringBuffer buffer) {
        synchronized(buffer) {
            this.value = Arrays.copyOf(buffer.getValue(), buffer.length());//传统的char类型复制
        }
}

Arrays.copy(char[],length);复制的数组，长度
他的底层核心：
System.arraycopy(char[],int,char[],int,int);
底层是System.arraycopy(现有数组（char[]）,现有数组起点(int),要复制内容进去的目标数组（char[]）,目标数组复制内容的起点 , 复制的长度);

复制的长度一般都是现有数组和目标数组长度比较的较小值。
Math.min(one(int), two(int));会返回一个较小的one或者two
Math.min()的底层实现也很简单

public static int min(int a, int b) {
        return (a <= b) ? a : b;
}

下一个方法是
11

//和buffer一样，不浪费时间了-.-
public String(StringBuilder builder) {
        this.value = Arrays.copyOf(builder.getValue(), builder.length());
}

String(char[] value, boolean share) {
        // assert share : "unshared not supported";
        this.value = value;
}//这是指定是否允许该字符串被断开，被拆分等操作。
//官方所希望的功能显然并没有完成，这是一个毫无功能的构造方法，也许以后的版本会完善
//String类型本身也是进行某种操作后会生成新的String，这个功能也显得不大可能

//这个是截取String，前面就有这个方法，这个this调用的也是序号3的方法
@Deprecated
    String(int offset, int count, char[] value) {
        this(value, offset, count);
}//感觉没什么用，仅仅只是换了下参数的顺序，废弃也是正常的
//java公司出的源码书籍也提到过，构造方法仅仅换个参数是他们不推荐做的，因为这样做增加了代码体积
，实际上也没有任何功能上的丰富和加强。

构造方法结束

//获取字符串的长度
public int length() {
        return value.length;//value是char[]类型也是获取数组的长度就是string类型的长度
}

public boolean isEmpty() {
        return value.length == 0;
}
//空格不算是空，依旧有长度。
//如果value为null会报错
//如果为""则是0

//返回指定位置的字符
public char charAt(int index) {
        if ((index < 0) || (index >= value.length)) {//不能小于0或者大于他的长度，否则会抛出异常
            throw new StringIndexOutOfBoundsException(index);
        }
        return value[index];
}
//索引是从0开始的

//返回指定位置的code值
public int codePointAt(int index) {
        if ((index < 0) || (index >= value.length)) {//验证
            throw new StringIndexOutOfBoundsException(index);
        }
        return Character.codePointAtImpl(value, index, value.length);
}
//Character.codePointAtImpl该工具也不仅仅是做了返回操作，还是验证了一些东西

static int codePointAtImpl(char[] a, int index, int limit) {
        char c1 = a[index++];//这里是先返回index，再加1
        if (isHighSurrogate(c1)) {//十六进制比较：c1>='\uD800'   && c1 < \uDBFF +1.   简单说就是在 55296和56319+1 之间①   
            if (index < limit) {
                char c2 = a[index];//高位在左边，低位在右边
                if (isLowSurrogate(c2)) {//大于等于56320和<57343+1之间
                    return toCodePoint(c1, c2);
                }
            }
        }
        return c1;//如果是普通的字符直接返回码点
    }

①：是补充字符验证，比如说一个特殊字符，它在char数组保存要占两个位置来保存，那么在这里就用到了。因为标识这个特殊字符同时需要两个位置来保存
注：如果该位置是补充字符，那么就会寻找低位字符，然后做了一步toCodePoint运算，就是把之前分离保存的字符再逆运算组合起来，都是一些int类型二进制的位运算，在此不过多研究。

//返回输入位置的前一个字符。
public int codePointBefore(int index) {
        int i = index - 1;
        if ((i < 0) || (i >= value.length)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return Character.codePointBeforeImpl(value, index, 0);
}//和上一个比，就是输出你要的前一个，没啥好说的

static int codePointBeforeImpl(char[] a, int index, int start) {
        char c2 = a[--index];//这是直接给c2输出减完后的
        if (isLowSurrogate(c2)) {
            if (index > start) {
                char c1 = a[--index];
                if (isHighSurrogate(c1)) {
                    return toCodePoint(c1, c2);
                }
            }
        }
        return c2;
}

//返回指定起点终点字符所对应字符串的个数
public int codePointCount(int beginIndex, int endIndex) {
	    if (beginIndex < 0 || endIndex > value.length || beginIndex > endIndex) {
	       throw new IndexOutOfBoundsException();
	    }
	    return Character.codePointCountImpl(value, beginIndex, endIndex - beginIndex);
}

//验证
public static int codePointCount(char[] a, int offset, int count) {
        if (count > a.length - offset || offset < 0 || count < 0) {
            throw new IndexOutOfBoundsException();
        }
        return codePointCountImpl(a, offset, count);//调下面那个方法，未做任何操作
    }

    static int codePointCountImpl(char[] a, int offset, int count) {
        int endIndex = offset + count;
        int n = count;
        for (int i = offset; i < endIndex; ) {
            if (isHighSurrogate(a[i++]) && i < endIndex &&
                isLowSurrogate(a[i])) {//遇到补码  i+1  长度减1
                n--;
                i++;
            }
        }
        return n;
}

注：这应该是面向框架，或者内部调用的，于实际应用，一般用不上

//返回偏移量的方法
public int offsetByCodePoints(int index, int codePointOffset) {
        if (index < 0 || index > value.length) {
            throw new IndexOutOfBoundsException();
        }
        return Character.offsetByCodePointsImpl(value, 0, value.length,
                index, codePointOffset);
}
//由指定索引开始，正数向左便宜，负数向右偏移。返回之后的字符在char数组中的第几位。
//解释一下，比如果abcde中的c占了两个字符还保存，那么这个方法(0,3）就是4,索引 1是a，2是b，3和4是c
//用到的非常少，对这个方法进行追溯，发现只有一个sun公司自己开发的包用过，还只用过一次。

//复制字符串
void getChars(char dst[], int dstBegin) {
        System.arraycopy(value, 0, dst, dstBegin, value.length);//这是一个java层面最底层的复制数组的方法，
        //所有复制数组，及底层是数组的字符串复制的话底层都会是这个方法。
       //第一个参数，要被复制的数组对象
       //第二个参数，复制的起点
       //第三个参数，复制到另一个数组里的目标数组
       //第四个参数，开始复制的起点
       //第五个，复制的长度
}

//与上个方法功能相同
//参数1：复制的起点  2： 复制的重点 3：目标数组  4：目标数组复制进去时的起点
public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin) {
        if (srcBegin < 0) {
            throw new StringIndexOutOfBoundsException(srcBegin);
        }
        if (srcEnd > value.length) {
            throw new StringIndexOutOfBoundsException(srcEnd);
        }
        if (srcBegin > srcEnd) {
            throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
        }
        System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
        //上个方法已经详细说明，此处不再过多讨论。
}

//这是一个废弃的方法，远古时期jdk1.1用到的，远古时期的东西一直遗留到了现在。
@Deprecated//废弃
    public void getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin) {
        if (srcBegin < 0) {
            throw new StringIndexOutOfBoundsException(srcBegin);
        }
        if (srcEnd > value.length) {
            throw new StringIndexOutOfBoundsException(srcEnd);
        }
        if (srcBegin > srcEnd) {
            throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
        }
        int j = dstBegin;
        int n = srcEnd;
        int i = srcBegin;
        char[] val = value;   /* avoid getfield opcode */

        while (i < n) {
            dst[j++] = (byte)val[i++];
        }
}
//试了一下这个方法，各种问题，比方说只有为0的复制进去了。官方解释此方法不能正确的将字符转化为字节，
//这就可能是因为jvm的机制更改，底层功能的更改，因为此方法表面上看并没有转字节的操作

//比较常用的一个方法，返回指定名称字符编码的byte数组
public byte[] getBytes(String charsetName)
            throws UnsupportedEncodingException {
        if (charsetName == null) throw new NullPointerException();
        return StringCoding.encode(charsetName, value, 0, value.length);//看似简单 操作很深（1）
}

（1）
static byte[] encode(String charsetName, char[] ca, int off, int len)
        throws UnsupportedEncodingException
    {
        StringEncoder se = deref(encoder);//encode是ThreadLocal类，实现数据的隔离。
        //使用ThreadLocal的作用就是，当为该字符串进行操作转码的时候，防止该对象发生了某些改变，就会发生很多意料之外的错误
        String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;//如果不指定，默认的字符编码
        if ((se == null) || !(csn.equals(se.requestedCharsetName())
                              || csn.equals(se.charsetName()))) {
            se = null;
            try {
                Charset cs = lookupCharset(csn);//该方法底涉及的很深，大部分是验证，所有关键的操作都放到了C语言，所以也没有
                //必要写出来，但是实现的东西很简单。就是把输出的字符串生成对应的字符编码对象
                if (cs != null)
                    se = new StringEncoder(cs, csn);
                    //这是一个生成StringEncode的静态工程①
            } catch (IllegalCharsetNameException x) {}
            if (se == null)
                throw new UnsupportedEncodingException (csn);//严谨的验证，一般不会跑到这里
            set(encoder, se);//这一步就是把StringEncoder对象放进ThreadLocal里。encoder就是ThreadLocal，同时也是
            //StringEncoder对象的一个实体类。由StringEncoder对象自己调用
        }
        return se.encode(ca, off, len);②//准备工作完成，开始最后一步
}

（1）①
private StringEncoder(Charset cs, String rcn) {
            this.requestedCharsetName = rcn;
            this.cs = cs;
            this.ce = cs.newEncoder()
                .onMalformedInput(CodingErrorAction.REPLACE)
                .onUnmappableCharacter(CodingErrorAction.REPLACE);
                //看似高深，其实这是为对象赋值的另一种形式，也可称为工厂模式，就相当于set，是底层众多初始化对象的其中一种手段
                //每个方法都是返回自己本身,这样赋值之后就可以继续调其方法。好处，就是简单明了。
                //在创建之初就可以直接赋值，并且后期维护时可以非常简单。可以直接看到到底初始化了哪些数据
            this.isTrusted = (cs.getClass().getClassLoader0() == null);//加载class
}

（1）②

byte[] encode(char[] ca, int off, int len) {
            int en = scale(len, ce.maxBytesPerChar());//简单的东西就不展现代码了，这一步仅仅是两个参数相乘。
            //第一个参数是字符串的长度，第二个是每个字符生成的最大字节数，得到的就是该字符串的最大字节数的长度
            byte[] ba = new byte[en];//创建一个这么长的byte数组
            if (len == 0)
                return ba;//如果长度为0  就不存在转码，直接返回
            if (ce instanceof ArrayEncoder) {
                int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba);
                return safeTrim(ba, blen, cs, isTrusted);//如果为true，就跳出了jdk，很坑
                //分析一下吧：ce是CharsetEncoder抽象类,ArrayEncoder是一个接口，相互之间没有接口或者继承关系，
                //应该是  与本次逻辑无关的一部 判断，大概是 其他地方调用会用到，因为应该是 进不来这一步
                //看不到ArrayEncoder的源码，只能知道这么多了
            } else {
                ce.reset();//对CharsetEncoder的一步初始化操作，只是设置他的属性state为0
                ByteBuffer bb = ByteBuffer.wrap(ba);//byte类型数据读取缓冲区，项目没用到过，所以具体解释不清
                //不过字面意思很好理解，就是一个提供缓冲的类
                CharBuffer cb = CharBuffer.wrap(ca, off, len);//顾名思义，char类型缓冲区
                try {
                    CoderResult cr = ce.encode(cb, bb, true);①
                    if (!cr.isUnderflow())
                        cr.throwException();
                    cr = ce.flush(bb);
                    if (!cr.isUnderflow())
                        cr.throwException();
                } catch (CharacterCodingException x) {
                    // Substitution is always enabled,
                    // so this shouldn't happen
                    throw new Error(x);
                }
                return safeTrim(ba, bb.position(), cs, isTrusted);
            }
}

（1）②①
public final CoderResult encode(CharBuffer in, ByteBuffer out,
                                    boolean endOfInput){
                                    //endOfinput  传的固定的true
                                    //另外两个就是   把数据扔进换成 然后带过来
        int newState = endOfInput ? ST_END : ST_CODING;
        //ST_END为2，ST_CODING为3：前面的为true,所以   newState为2
        if ((state != ST_RESET) && (state != ST_CODING)
            && !(endOfInput && (state == ST_END)))
            throwIllegalStateException(state, newState);
        state = newState;//state由0变为2

        for (;;) {//倾向于c语言的习惯。题外话：其实jdk的代码规范也没那么严谨，依旧能看出来，谁是谁写的代码
	//就是普通的for循环，从编译的角度讲，while的代码体积要大一点点，所以for要好一点点
	//代码规范其实和代码效率并没有那么的贴合，尽管别的国家的代码规范实施的很好，但是中国程序员写的代码显然比外国
	//效率是要高的。首先外国写的项目几乎没什么高并发，可在中国几千几万还算是正常的，一般程序员都能应付
            CoderResult cr;
            try {
                cr = encodeLoop(in, out);//编码循环，c语言实现的，返回的类，是c对内存地址数据处理后对产生结果的总结。
            } catch (BufferUnderflowException x) {
                throw new CoderMalfunctionError(x);
            } catch (BufferOverflowException x) {
                throw new CoderMalfunctionError(x);
            }

            if (cr.isOverflow())
                return cr;

            if (cr.isUnderflow()) {
                if (endOfInput && in.hasRemaining()) {
                    cr = CoderResult.malformedForLength(in.remaining());
                    // Fall through to malformed-input case
                } else {
                    return cr;
                }
            }

            CodingErrorAction action = null;
            if (cr.isMalformed())
                action = malformedInputAction;
            else if (cr.isUnmappable())
                action = unmappableCharacterAction;
            else
                assert false : cr.toString();

            if (action == CodingErrorAction.REPORT)
                return cr;

            if (action == CodingErrorAction.REPLACE) {
                if (out.remaining() < replacement.length)
                    return CoderResult.OVERFLOW;
                out.put(replacement);
            }

            if ((action == CodingErrorAction.IGNORE)
                || (action == CodingErrorAction.REPLACE)) {
                // Skip erroneous input either way
                in.position(in.position() + cr.length());
                continue;
            }

            assert false;
            //断言，运行的此处就肯定是错误的，也是意料之外发生的事情，但是肯定不是预期的，是错的，断言抛异常
            //一些列12345的各种判断，没写备注，无法搞清楚在判断啥，但也算是到最底层了。这个getByte终于结束了
        }

}

2018年11月14日 17:26:00，第一次更新1-12
2018年11月15日 11:56:25，第二次更新13-17
2018年11月17日 17:29:38，第三次更新18-23

java String类，底层自学自看 笔记（实时更新） 1

提示：本文章是基于jdk1.7，对于一些常见类底层学习的公开笔记，学艺不精，发现错误请评论提出。

猜你喜欢

java String类，底层自学自看笔记（实时更新） 1