solve java socket In the transmission of Chinese characters, there is the problem of truncation leading to garbled code
When using socket Conduct TCP When data is transmitted , The transmitted string is encoded into an array of bytes , When utf8 When coding , The length of numbers and letters is 1 Bytes , And Chinese characters are generally 3 Bytes . Reference here
If the transmitted string is a number , Characters and Chinese characters are mixed . At the receiving end of the data , Every time you call read Method received byte The length of the array is fixed , Because of the number , Letters correspond to Chinese characters utf8 The encoding length is different , It may cause the Chinese character at the end to be truncated . for instance ： Suppose that every time the receiver calls read Method byte The array length is 20. The short sending string is
"Hello World! Hello China !", Convert to byte The array is
[0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x20, 0xE4, 0xBD, 0xA0, 0xE5, 0xA5, 0xBD, 0xE4, 0xB8, 0xAD, 0xE5, 0x9B, 0xBD, 0x21]
First read at the receiving end byte The array length is 20, Corresponding byte The array is
[0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x20, 0xE4, 0xBD, 0xA0,0xE5, 0xA5, 0xBD, 0xE4]
The conversion to characters becomes
"Hello World! Hello �", The last character is garbled . This is because of the former 19 Bytes ( Punctuation in English space 13 Bytes , The first two Chinese characters take up six bytes ) Converted to a string "Hello World! Hello ", The last byte corresponds to the character " in " Three bytes of [0xE4, 0xB8, 0xAD] The first byte in 0xE4. But this one byte ( The first is 1 There is no corresponding single byte character ) Cannot be converted to visible characters . Here is the partition ." in " The remaining two bytes of are the next read Read the first two bytes of the byte array .
In order to solve the truncation problem , We need to determine whether the last two bytes of the received byte array are the corresponding bytes of the truncated Chinese characters .
To solve this problem , First understand a Chinese character in UTF-8 Format in coding . Basically, the common Chinese characters are UTF-8 It's all three bytes in the code . stay UTF-8 In the coding scheme , Every byte of three byte encoding is specified ：
If it's three byte encoding , So the first three bits of the first byte are 111, The fourth is 0, The first two bits of the remaining two bytes are 10
Like Chinese characters " in ", The corresponding three bytes can be expressed in binary as ：
[1110 0100, 1011 1000, 1010 1101]
Satisfy UTF-8 Coding requirements .
How to judge the result according to the coding format byte The last or two bytes of the array are truncated .
We will byte The last two bytes of the array are defined as firstByte and secondByte, It corresponds to the second byte of derivative and the first byte from the bottom .
If the penultimate byte matches 1110 xxxx Format , This byte corresponds to the first byte of Chinese characters , The remaining two bytes are in the next received byte Array , There's a truncation . We need to hold the last byte secondByte, With the next byte The first two bytes of the array are combined to parse Chinese characters .
If the binary of the penultimate byte matches 1110 xxxx Format , That means that this byte corresponds to the first byte of Chinese characters , The last byte corresponds to the second byte of the Chinese character , The third byte is the next received byte The first byte of the array , There's a truncation . This situation requires that byte The last two characters of the array are saved , With the next byte The first byte of the array can be combined to parse the corresponding Chinese characters .
For how to judge whether a byte matches 1110 xxxx Format , Here we take the mask approach , Retain firstByte Top four , Shield the back four （ Zero the last four positions ）.
firstByte & 11100000 == 11100000 Is it true , If it is true, it corresponds to the first case , otherwise
secondByte & 11100000 == 11100000 Is it true , The establishment corresponds to the second situation