The content is personal learning experience , There is not much guarantee of accuracy , I hope you can give me some advice on the mistakes .
Sometimes we come across some \u Starting string , We know these are Unicode code , A group of \uxxxx The string corresponds to a Unicode character . What is the actual binary storage format of these coded characters ?
We know Unicode Coding can present most of the text in the world , And in its most common way of encoding UTF-8
) Next , The storage length of a single character is 1-4 byte ( variable ), The origin and advantages of this kind of design will not be mentioned much , Here we mainly talk about what we saw \u The conversion between code string and binary .
stay UTF-8 coded java Under the code , Yes “ test ” Two words print its bytes and characters and the result is as follows :
System.out.println(Charset.defaultCharset());
String s = " test ";
System.out.println(s.chars().mapToObj(Integer::toHexString).collect(Collectors.joining("\t")));
byte[] bs = s.getBytes();
System.out.println(Arrays.toString(bs));
/*Result:
UTF-8
6d4b 8bd5
[-26, -75, -117, -24, -81, -107] */
The observation shows that ,“ test ” Two words in UTF-8 Six bytes under the encoding , take 【-26, -75, -117, -24, -81, -107】 6 To binary complement format , Or get “ test ” Binary storage of two words , by :11100110 10110101 10001011 11101000 10101111 10010101
And by char.ToHexString Got 6d4b 8bd5 It's the combination of these two words Unicode code
How are the two related ?
adopt UTF-8 The encyclopedia page of is as follows :
UTF-8 The meaning of encoded bytes
- about UTF-8 Any byte in the encoding B, If B The first one in the world is 0, be B Independently represent a character (ASCII code );
- If B The first one in the world is 1, The second is 0, be B Is a byte in a multibyte character ( Not ASCII character );
- If B The top two are 1, The third is 0, be B Is the first byte of a character represented by two bytes ;
- If B The top three are 1, The fourth is 0, be B Is the first byte of a character represented by three bytes ;
- If B The top four are 1, The fifth is 0, be B Is the first byte of a character represented by four bytes ;
therefore , For the binary string obtained above , Every time 8 The front part of the bit is used for marking ,1110 The beginning indicates the need for 3 Bytes to describe the current character , And the current byte is 3 The first part of the byte , The following bytes use 10 The beginning indicates that it is the last part of the current character encoding string .
Mark the first three bytes as remove and merge , obtain 0110 110101 001011, and “ measuring ” The word 16 Base number Unicode Code to binary , It is 0110 1101 0100 1011.
The advantages of this are obvious , Easy to expand ( It seems to support 8 Byte encoding ), The coding structure removes the binary tag bits , Smaller size makes data transmission easier .1 Bytes of UTF-8 Code is also fully compatible with ASCII code , therefore UTF-8 It can be said that it should be the best choice in most scenarios .