Linux Character set and garbled code processing

1、 character (Character) It's a general term for all kinds of words and symbols , Including national characters 、 Punctuation 、 Graphic symbols 、 Numbers etc. . Character set (Character set) It's a collection of characters , There are many types of character sets , Each character set contains a different number of characters , Common character set names :ASCII Character set 、GB2312 Character set 、BIG5 Character set 、 GB18030 Character set 、Unicode Character set, etc.

  1. Character set is an environment variable in the system , How to view the character set adopted by the current system terminal

[[email protected] ~]# echo $LANG            #LANG The name of the environment variable for the character set


[[email protected] ~]# env|grep LANG        #env Command to view the environment variables of the system


[[email protected] ~]# export|grep LANG    #export The command is used to put shell The output of a variable or function is an environment variable

declare -x LANG="en_US.UTF-8"

[[email protected] ~]# locale            #Get locale-specific information Lists the current locale environment LANG=en_US.UTF-8                # Specify all and locale The default value of the variable in question

LC_CTYPE="en_US.UTF-8"            # Language symbols and their classification

LC_NUMERIC="en_US.UTF-8"            # Digital format

LC_TIME="en_US.UTF-8"            # Date and time formats

LC_COLLATE="en_US.UTF-8"            # Sort rule

LC_MONETARY="en_US.UTF-8"        # Currency format

LC_MESSAGES="en_US.UTF-8"         # Response information is mainly prompt information , error message , State information , title , label , Buttons and menus, etc

LC_PAPER="en_US.UTF-8"            # Default paper size

LC_NAME="en_US.UTF-8"            # The way names are written

LC_ADDRESS="en_US.UTF-8"            # The way the address is written

LC_TELEPHONE="en_US.UTF-8"        # The way the phone number is written

LC_MEASUREMENT="en_US.UTF-8"    # The expression of weights and measures

LC_IDENTIFICATION="en_US.UTF-8"    # It contains information, metadata information


LC_CTYPE( Character recognition coding ) The character set used by the system representing this system is en_US.UTF-8

  1. How to modify the character set

1)、 Modify the way of setting variables directly , There are two commands :

[[email protected] ~]# LANG=xxx perhaps  export  LANG=xxx;

[[email protected] ~]# LC_ALL="xxx"   perhaps  export LC_ALL="xxx";

notes :xxx For the character set to be modified to

How to view the standard character set ,locale –a command , Commonly used zh_CN.GB2312、zh_CN.GB18030 perhaps zh_CN.UTF-8、en_US.UTF-8 etc.

But the above changes will only be made in the current shell Enter into force , newly build shell This environment variable disappears .

So usually log in to the system to execute "LANG= " There is no garbled code in this command , It means to cancel the display of character set , Canceling a character set can also perform [[email protected] ~]# unset LANG This command .

2)、 Modify the file mode , By modifying the /etc/sysconfig/i18n Document control

[[email protected] ~]# vim /etc/sysconfig/i18n

LANG="en_US.UTF-8 "   System language  


After the modification file is saved and exited, it will take effect only by executing the following command

[[email protected] ~]$ source /etc/sysconfig/i18n

4、vim Editors are about coding :

1)fileencoding, Code used to configure open and save files , But there can only be one value , Only suitable for a few files are the same kind of coding environment , So I don't use

2)fileencodings, From the name, I know it's fileencoding Enhanced Edition , You can configure a variety of different codes , The common configuration is , After configuration , As long as the text encoding in the list is legal , Can be vim correct The read , Recommended configuration :set fileencodings=utf-bom,utf-8,gbk,gb2312,gb18030,cp936,latin1

3)encoding,vim Internal encoding ,vim After reading the file , But it's not encoded to read the file , It's converted to an internal coded format , This code is generally related to the operating system ,linux Next utf-8 Mostly , chinese windows Next is gdk, Recommended configuration :set encoding=utf-8

4)termencoding,vim Coding of output , Output refers to the output to the operating system or command terminal, etc , The default is the same as the language encoding of the operating system , If you use linux Command terminal , Suggest terminal and linux The system is configured with the same encoding , Then configure the same termencoding, Otherwise, I'll take care of it vim I don't care shell, But if shell There is no Chinese name file , Then configure the terminal and termencoding It's OK to be consistent , about windows, Can automatically identify gbk and utf-8, No special configuration , Recommended configuration :set termencoding=utf-8

5)fileformats, Used to differentiate between operating systems , Mainly return \r\n The difference between , Recommended configuration :set fileformats=unix,dos

There are several common cases of garbled code

(1) take windows Files in the environment rz To linux The next file appears garbled

Solution :1. stay rz Before using notepad++ Convert the file format to UTF-8 nothing BOM Format or ANSI Coding format ;2.set encoding=utf-8;

(2)secureCRT perhaps xterm2 Garbled in editing environment , Simply adjust the character encoding in the session option GB2312 perhaps UTF-8

(3) When the log file is vim Garbled while editing , Most of the time, it's because the format of the log file is GB2312.

Solution :1.set encoding=GB2312;2 If the plan 1 If it doesn't work , adjustment secureCRT perhaps xterm2 Our editing environment is GB2312

(4)wget Download file name garbled

Solution : In general, add –restrict-file-names=nocontrol, for example wget --restrict-file-names=nocontrol -m

(5)cat The file is OK ,vim The file is not OK

Solution :

a. Direct write /etc/vim/vimrc  , Add... To the last line

Modified as set fileencodings=ucs-bom,utf-8,gbk,gb2312,latin1

set fileencoding=gb2312  

set termencoding=utf-8  

b. Transcoding iconv -f gb2312 -t utf-8 19.txt

Batch file transcoding command iconv -c -f gbk -t utf-8 $data_path/$item_uv