To solve the problem of garbled code caused by Chinese character truncation in java socket transmission

sunfulv 2021-06-23 22:49:53
solve problem garbled code caused


solve java socket In the transmission of Chinese characters, there is the problem of truncation leading to garbled code

When using socket Conduct TCP When data is transmitted , The transmitted string is encoded into an array of bytes , When utf8 When coding , The length of numbers and letters is 1 Bytes , And Chinese characters are generally 3 Bytes . Reference here

Where is the character set UTF-8 in , Why does a Chinese character need three bytes ? - Bitter tea - Blog Garden (cnblogs.com)

UTF-8 Once upon a time in (taoshu.in)

If the transmitted string is a number , Characters and Chinese characters are mixed . At the receiving end of the data , Every time you call read Method received byte The length of the array is fixed , Because of the number , Letters correspond to Chinese characters utf8 The encoding length is different , It may cause the Chinese character at the end to be truncated . for instance : Suppose that every time the receiver calls read Method byte The array length is 20. The short sending string is "Hello World! Hello China !", Convert to byte The array is

[0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x20, 0xE4, 0xBD, 0xA0, 0xE5, 0xA5, 0xBD, 0xE4, 0xB8, 0xAD, 0xE5, 0x9B, 0xBD, 0x21]

First read at the receiving end byte The array length is 20, Corresponding byte The array is

[0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x20, 0xE4, 0xBD, 0xA0,0xE5, 0xA5, 0xBD, 0xE4]

The conversion to characters becomes "Hello World! Hello �", The last character is garbled . This is because of the former 19 Bytes ( Punctuation in English space 13 Bytes , The first two Chinese characters take up six bytes ) Converted to a string "Hello World! Hello ", The last byte corresponds to the character " in " Three bytes of [0xE4, 0xB8, 0xAD] The first byte in 0xE4. But this one byte ( The first is 1 There is no corresponding single byte character ) Cannot be converted to visible characters . Here is the partition ." in " The remaining two bytes of are the next read Read the first two bytes of the byte array .

In order to solve the truncation problem , We need to determine whether the last two bytes of the received byte array are the corresponding bytes of the truncated Chinese characters .

To solve this problem , First understand a Chinese character in UTF-8 Format in coding . Basically, the common Chinese characters are UTF-8 It's all three bytes in the code . stay UTF-8 In the coding scheme , Every byte of three byte encoding is specified :

If it's three byte encoding , So the first three bits of the first byte are 111, The fourth is 0, The first two bits of the remaining two bytes are 10

Like Chinese characters " in ", The corresponding three bytes can be expressed in binary as :[1110 0100, 1011 1000, 1010 1101]

Satisfy UTF-8 Coding requirements .

How to judge the result according to the coding format byte The last or two bytes of the array are truncated .

We will byte The last two bytes of the array are defined as firstByte and secondByte, It corresponds to the second byte of derivative and the first byte from the bottom .

situation 1:

If the penultimate byte matches 1110 xxxx Format , This byte corresponds to the first byte of Chinese characters , The remaining two bytes are in the next received byte Array , There's a truncation . We need to hold the last byte secondByte, With the next byte The first two bytes of the array are combined to parse Chinese characters .

situation 2:

If the binary of the penultimate byte matches 1110 xxxx Format , That means that this byte corresponds to the first byte of Chinese characters , The last byte corresponds to the second byte of the Chinese character , The third byte is the next received byte The first byte of the array , There's a truncation . This situation requires that byte The last two characters of the array are saved , With the next byte The first byte of the array can be combined to parse the corresponding Chinese characters .

For how to judge whether a byte matches 1110 xxxx Format , Here we take the mask approach , Retain firstByte Top four , Shield the back four ( Zero the last four positions ).

Judge firstByte & 11100000 == 11100000 Is it true , If it is true, it corresponds to the first case , otherwise

Judge secondByte & 11100000 == 11100000 Is it true , The establishment corresponds to the second situation

版权声明
本文为[sunfulv]所创,转载请带上原文链接,感谢
https://javamana.com/2021/06/20210623224903255n.html

  1. redis cluster如何支持pipeline
  2. How does redis cluster support pipeline
  3. 上海 | 人英网络 | 招Java开发25-35K、React前端开发25-40K
  4. Shanghai | Renying network | recruit java development 25-35k, react front end development 25-40k
  5. SpringCloud+Docker+Jenkins+GitLab+Maven实现自动化构建与部署实战
  6. Spring cloud + docker + Jenkins + gitlab + Maven to realize automatic construction and deployment
  7. 性能工具之linux三剑客awk、grep、sed详解
  8. Performance tools of Linux three swordsmen awk, grep, sed
  9. 一次“不负责任”的 K8s 网络故障排查经验分享
  10. An "irresponsible" experience sharing of k8s network troubleshooting
  11. 性能工具之linux三剑客awk、grep、sed详解
  12. Performance tools of Linux three swordsmen awk, grep, sed
  13. 使用Spring Data JPA 访问 Mysql 数据库-配置项
  14. Accessing MySQL database with spring data JPA - configuration item
  15. 一次“不负责任”的 K8s 网络故障排查经验分享
  16. An "irresponsible" experience sharing of k8s network troubleshooting
  17. 注册中心ZooKeeper,Eureka,Consul,Nacos对比
  18. Linux最常用的指令大全!快看看你掌握了吗?
  19. Comparison of zookeeper, Eureka, consult and Nacos
  20. Linux most commonly used instruction encyclopedia! Let's see. Do you have it?
  21. Matrix architecture practice of Boshi fund's Internet open platform based on rocketmq
  22. 字节面试,我这样回答Spring中的循环依赖,拿下20k offer!
  23. Byte interview, I answer the circular dependence in spring like this, and get 20K offer!
  24. oracle 11g查看alert日志方法
  25. How to view alert log in Oracle 11g
  26. 手写Spring Config,最终一战,来瞅瞅撒!
  27. Handwritten spring config, the final battle, come and see!
  28. 用纯 JavaScript 撸一个 MVC 框架
  29. Build an MVC framework with pure JavaScript
  30. 使用springBoot实现服务端XML文件的前端界面读写
  31. Using springboot to read and write the front interface of server XML file
  32. 【Javascript + Vue】实现随机生成迷宫图片
  33. [Javascript + Vue] random generation of maze pictures
  34. 大数据入门:Hadoop伪分布式集群环境搭建教程
  35. Introduction to big data: Hadoop pseudo distributed cluster environment building tutorial
  36. 八股文骚套路之Java基础
  37. commons-collections反序列化利用链分析(3)
  38. Java foundation of eight part wensao routine
  39. Analysis of common collections deserialization utilization chain (3)
  40. dubbogo 社区负责人于雨说
  41. Yu Yu, head of dubbogo community, said
  42. dubbogo 社区负责人于雨说
  43. Yu Yu, head of dubbogo community, said
  44. 设计模式 选自《闻缺陷则喜》此书可免费下载
  45. The design pattern is selected from the book "you are happy when you hear defects", which can be downloaded free of charge
  46. xDAI被选为 Swarm 的侧链解决方案,将百倍降低 Swarm 网络Gas费
  47. L2 - 深入理解Arbitrum
  48. Xdai is selected as the side chain solution of swarm, which will reduce the gas cost of swarm network 100 times
  49. L2 - deep understanding of arbitrum
  50. Java全栈方向学习路线
  51. 设计模式学习04(Java实现)——单例模式
  52. Java full stack learning route
  53. Design pattern learning 04 (Java implementation) - singleton pattern
  54. Mybatis学习01:利用mybatis查询数据库
  55. Mybatis learning 01: using mybatis to query database
  56. Java程序员从零开始学Vue(01)- 前端发展史
  57. Java程序员从零开始学Vue(05)- 基础知识快速补充(html、css、js)
  58. Java programmers learn Vue from scratch
  59. Java programmers learn Vue from scratch (05) - quick supplement of basic knowledge (HTML, CSS, JS)
  60. 【Java并发编程实战14】构建自定义同步工具(Building-Custom-Synchronizers)