One 、 Error report review
take emoji Words are written directly into SQL in , perform insert Statement error ;
INSERT INTO `csjdemo`.`student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`) VALUES ('20', ' Chen ha ha ', ' male ', '20', '181 class ', '9 grade ', ' Watch the movie ');
[Err] 1366 - Incorrect string value: '\xF0\x9F\x98\x93' for column 'NAME' at row 1
Changed the database code 、 System code and table field encoding format → utf8mb4 after , That's all right. ：
INSERT INTO `student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`) VALUES (null, ' Chen ha ha ', ' male ', '20', '181 class ', '9 grade ', ' Watch the movie ');
Two 、MySQL in utf8 Interesting things about
MySQL Of “utf8” It's not really UTF-8.
stay MySQL in ,“utf8” Encoding only supports up to three bytes per character , And the real UTF-8 It's up to four bytes per character .
stay utf8 In the encoding , Chinese is Zhan 3 Bytes , Other figures 、 english 、 The symbol occupies one byte . I've compiled a very detailed report MySQL Learning notes , You can look at it .
but emoji The symbol occupies 4 Bytes , Some more complicated words 、 Traditional Chinese characters are also 4 Bytes . So the write failed , Should be changed into utf8mb4.
As shown in the figure above , It's a code change to utf8mb4 After the storage of data , You can clearly compare the number of characters 、 Number of bytes . Because of this ,4 The contents of bytes go to utf8 Insert... Into the code , Surely not , I can't insert it , Is that so? （ Pan shows his hand ）.
MySQL Never fixed this bug, They are 2010 In, a new one called “utf8mb4” Character set for , Skillfully bypassing the problem .
Of course , They didn't advertise the new character set （ Maybe it's because of this bug Make them feel embarrassed ）, Even now, developers are still recommended to use “utf8”, But these suggestions are all wrong .
1. utf8mb4 It's real UTF-8
Yes ,MySQL Of “utf8mb4” It's real “UTF-8”.
MySQL Of “utf8” It's a kind of “ Exclusive code ”, It can code Unicode There are not many characters .
ad locum Mark once ： All in use “utf8” Of MySQL and MariaDB Users should use “utf8mb4”, Never use “utf8”.
So what is coding ？ What is? UTF-8？
We all know , Computer use 0 and 1 To store text . Like characters “C” Be saved into “01000011”, So the computer needs to go through two steps to display this character ：
- Computer reading “01000011”, Get figures 67, because 67 To be encoded as “01000011”.
- The computer is in Unicode Find... In character set 67, eureka “C”.
- My computer will “C” mapping Unicode In the character set 67.
- My computer will 67 Code as “01000011”, And send it to Web The server .
Almost all web applications use Unicode Character set , Because there's no reason to use other character sets .
Unicode The character set contains millions of characters . The simplest coding is UTF-32, Use... For each character 32 position . It's the easiest thing to do , Because all along , The computer will 32 Bits are treated as numbers , And computers are the best at dealing with numbers . But the problem is , This is a waste of space .
UTF-8 You can save space , stay UTF-8 in , character “C” It only needs 8 position , Some unusual characters , such as “” need 32 position . Other characters may use 16 Bit or 24 position . An article like this one , If you use UTF-8 code , The only space occupied is UTF-32 About a quarter of .
2. utf8 A brief history of
Why? MySQL Developers will let “utf8” invalid ？ We may be able to start with MySQL Find the answer in the version submission log .
MySQL from 4.1 Version starting support UTF-8, That is to say 2003 year , And what we use today UTF-8 standard （RFC 3629） It was then that .
The old version of the UTF-8 standard （RFC 2279） At most, each character is supported 6 Bytes .2002 year 3 month 28 Japan ,MySQL Developers in the first MySQL 4.1 The preview version uses RFC 2279.
Same year 9 month , They are for MySQL The source code has been adjusted ：“UTF8 Now it only supports 3 A sequence of bytes ”.
Who submitted the code ？ Why did he do this ？ The question is not known . Migrating to Git after （MySQL The first thing to use is BitKeeper）,MySQL Many of the names of the committers in the code base are missing .2003 year 9 There is no clue to explain the change in the month's mailing list .
But we can try to guess ：
2002 year ,MySQL Made a decision ： If the user can guarantee that each row of the data table uses the same number of bytes , that MySQL You can make a big improvement in performance . So , The user needs to define the text column as “CHAR”, Every “CHAR” Columns always have the same number of characters . If fewer characters are inserted than defined ,MySQL Will be followed by a space , If you insert more characters than defined , The excess behind will be truncated .
MySQL Developers are at the beginning of trying to UTF-8 Every character is used 6 Bytes ,CHAR(1) Use 6 Bytes ,CHAR(2) Use 12 Bytes , And so on .
It should be said , Their first act was right , Unfortunately, this version has not been released . But that's what the document says , And it's widely spread , All you know UTF-8 People agree with what's written in the document .
But obviously ,MySQL Developers or vendors are worried that users will do these two things ：
- Use CHAR Definition column （ In the present view ,CHAR It's an old thing , But at that time , stay MySQL Use in CHAR Will be faster , But in the 2005 It won't be like this after ）.
- take CHAR The encoding of the column is set to “utf8”.
My guess is MySQL Developers wanted to help users who wanted to win in space and speed , But they screwed up “utf8” code .
So the result is no winner . Those who want to win in space and speed , When they are using “utf8” Of CHAR Column time , The space used is actually larger than expected , It's slower than expected . And users who want to be right , When they use “utf8” When coding , But can't save like “” Such characters , because “” yes 4 Bytes of .
After the illegal character set was released ,MySQL You can't fix it , Because that requires all users to rebuild their databases . Final ,MySQL stay 2010 It was redistributed in “utf8mb4” To support the real UTF-8.
3、 ... and 、 summary
The main reason is that almost all the articles on the Internet put “utf8” Think it's real UTF-8, Including the articles I wrote and the projects I did before （ Over your face ）; So I hope more friends can see this article .
I believe there are many people in the same boat with me , It's inevitable .
therefore , We'll build it later MySQL、MariaDB Database time , Remember to code the database accordingly utf8mb4. One day , After the programmer in your class or your leader shows this problem , I will feel your technical cow in my heart B.