awk What is it?

awk yes linux A command line tool in environment , But because of awk Powerful ability , We can awk The tool passes a string , The content of the string is similar to the syntax of a programming language , We can call it Awk Language , and awk The tool itself can be seen as Awk Language parser . like python Parsers and Python The relationship of language . We usually use awk What to do ,awk What kind of work is suitable for . because awk Born to provide the text of the document for processing , So if every line in a file is delimited by a specific separator ( It's common for spaces ) separate , We can think of this file as consisting of many columns of text , Such a file is most suitable for awk To deal with , Actually awk It's used to deal with a lot of things at work log file , Do some statistical work, etc .

Back to the top

awk The general composition of a command

awk The most common task is to traverse every line in a file , Then each line of the file is processed separately , A complete awk The form of the order is as follows :

awk  [options]  'BEGIN{ commands } pattern{ commands } END{ commands }'  file

among options Express awk Optional command line options for , I'm afraid the most common one is -F It specifies the separator that separates each row in the file into columns . And all that's in the following single quotation marks is awk The script of the program ,awk You need to process each column after each line of the file is divided .file It is awk The name of the file to process . Let's go through demo To experience awk The function of .

Back to the top

awk Split each line

echo '11 22 33 44' | awk '{print $3" "$2" "$1}'
Output :33 22 11

We will string 11 22 33 44 Pass through a pipe to awk command , amount to awk Processing a file , The content of the document is 11 22 33 44 We didn't add -F Specify the division symbol , In fact, by default awk Use spaces to split each line , If you need to specify other characters, use -F According to specified . The order above is to 11 22 33 44 Through the space ( No matter how many spaces there are between columns, they will be treated as one ) Divided into 4 Column , stay awk One of them is through  $ Numbers   Quoted variables , The content of this variable reference is the content of each column split in the current row , The serial number of the number is from 1 Start , for example $1 It means the first one 1 Column content ,$2 Indicates the second column , And so on .$0 Represents the contents of the current line .print yes awk Built in functions for , Values for printing out variables . And we are $3 $2 $1 A space enclosed in double quotation marks is added between , without , The values of these variables are printed out and linked together . there awk In command {} The content is actually the middle part of our complete pattern above , We omitted the above BEGIN block ,END block , And we have omitted the middle program block pattern part , That is, if you don't add BEGIN perhaps END So the program block is the middle one in the complete mode above , The operation performed by the middle program block is to loop through each line of the file content , If the document has 10 That's ok , Then the middle program block will run 10 Time , One line at a time , And when it's done, when it's done , The next loop will automatically process the next line in turn .

Let's see , If there are two columns, what's the effect , for example :
    
echo -e '11 22 33 44\naa bb cc dd' | awk '{print $3" "$2" "$1}'
Output :
33 22 11
cc bb aa

Note that there echo The command uses -e The purpose of the option is to keep... In the string \n The format will work , Otherwise, the line feed will be ignored . So how is the above command executed , Let's simulate awk Implementation process of , First awk Read the first line , Use spaces to separate the columns in the row , And the string 11 Assign a value to $1,22 Assign a value to $2,33 Assign a value to $3,44 Assign a value to $4. And then through print Print out . Then read the second line , Do the same thing .

Back to the top

Use parttern part

We have learned awk The simplest command , Now let's add a little more , Add before the block pattern part , for example :

echo -e '1 2 3 4\n5 6 7 8' | awk '$1>2{print $3" "$2" "$1}'
Output :7 6 5

The program is almost the same as the one above , It's just that we added in front of the block   $1>2 Indicates that if the second line of the current line 1 The value of the column is greater than 2 Then process the current line , Otherwise, it will not be dealt with . To put it bluntly pattern Section is used to filter out the lines that need to be processed from the file for processing , If not, loop through all the lines in the file .pattern The part can be the judgment result of any conditional expression , for example >,<,==,>=,<=,!= You can also use +,-,*,/ Compound expression combining operation and conditional expression , Logic &&,||,! It can also be used . in addition pattern Some of them can also be used / Regular / Select the line to be processed .

Back to the top

awk Of BEGIN Sentence block

BEGIN A statement block is a statement block that runs before the first line of the matching file . The first line is due to the match , In fact, BEGIN In the block $n Is not available . In general, it can be in BEGIN Do some variables in the statement block (awk You can customize variables in , Assigning a value to a variable directly defines a variable ,awk There are no keywords that specifically define variables in ) Initialization work , And some output information that only needs to be printed once at the beginning ( For example, the header of the output table ). for example :

echo -e '1 2 3 4\n5 6 7 8' | awk 'BEGIN{print "c1 c2 c3";print ""}{print $3"  "$2"  "$1}'  
Output is :
c1 c2 c3

3  2  1
7  6  5

Notice a statement block ( Curly brackets surround ) There can be multiple statements in , Use semicolons to separate , This is related to C The language is the same . If you need to print blank lines separately , Need to use print "" We realized the effect of output header above .

Back to the top

awk Of END Sentence block

END The statement block is in awk After the loop has finished processing all the lines , Just to carry out , And BEGIN equally ,END The statement block is also executed only once , Let's look at the complete example .

echo -e '1\n2\n3' | awk 'BEGIN{print "begin"}{print $1}END{print "end"}'
Output :
begin
1
2
3
end

Back to the top

awk Define variables to sum Columns

test.txt Is as follows :
11 22 33
23 45 34
22 32 43

awk 'BEGIN{sum=0}{sum+=$1}END{print sum}' test.txt
Output results :56

First, in the BEGIN Variables in the statement block sum assignment 0, Then, in the loop statement block, add the second... Of each line 1 Lega to sum in , When all the lines of the file are looped through , Print out sum The value of the variable . Of course, in this case BEGIN Statement blocks can be omitted , We can use it directly in the loop statement block sum Variable , here sum For the first time , This variable is automatically created , The default initial value is 0.

Back to the top

awk The judgment sentence in

awk Judgment statements can be used in all statement blocks of , Its judgment sentence grammar is related to C The language is the same .

//test.txt The contents are as follows
1 2 3
4 5 6
7 8 9
10 11 12

awk '{if($1%2==0)print $1" "$2" "$3}' test.txt
Output :
4 5 6
10 11 12

Back to the top

awk The cycle in

//while loop
awk 'BEGIN{count=0;while(count<5){print count;count ++;}}'
Output :
0
1
2
3
4
It can be seen that awk There can be complex compound statements in a statement block of , Its use is related to C The language is almost the same , Multiple statements are separated by semicolons , Compound statement blocks are separated by curly brackets .

//do..while loop
awk 'BEGIN{count=0;do{print count;count++}while(count<5)}'
Output :
0
1
2
3
4

//for loop
awk 'BEGIN{for(count=0;count<5;count++)print count}'
Output : Same as above

We can see the forms of these cycles and C Language is the same , There is no obstacle to our understanding .awk Also used in break Exit loop , Use continue Skip this cycle , Its meaning is similar to C The language is the same .

Back to the top

Use array grouping to sum ,for..in loop

awk Arrays in are basically dictionaries , See the following example :
//test.txt File contents of
zhangsan 2 3
lisi 5 6
zhangsan 8 9
lisi 11 12
wangwu 33 11

Divide all the first columns into the same group , And sum the second column in the group .
awk '{sum[$1]+=$2}END{for(k in sum)print k" "sum[k]}' test.txt
Output :
zhangsan 10
lisi 16
wangwu 33
This example uses for..in Loop through the array key, At the same time through key Get the value of the array . about key It's not an array of numbers , It can't pass the ordinary for Loop to access array elements with a numeric index . We can go through length() Function to get the number of elements in the array , for example length(array)

Back to the top

awk Use in shell A variable's value

Sometimes we're at home shell The value of the variable calculated in the awk Command to use , Of course we can't be in awk You can use $VAR, Because the dollar sign is awk Chinese is a special symbol , stay awk Can be used in $n Refer to the... Of the current line n The value of the column , So it's impossible to use it directly ,awk Provides an option -v To specify variables , stay awk There are two kinds of variables in , One is $n Variable of form , This is when looping through the lines of the file , Used to refer to the... Of the current line n The value of the column , There's another variable , It can be used without definition , You don't need to use the dollar sign to quote . Let's see below. shell How the values of variables in awk Use in :

1

2

3

a=22

b=33

awk -v x=$a -v y=$b 'BEGIN{print x" "y}'

You can see that we just need to use awk Time pass -v Appoint awk The variables that will be used are , Variable values can be referenced shell Variables get , That means we can only be in awk Of options Part quote shell The variable of , stay awk Using the dollar symbol to refer to variables in the statement block of is awk Resolve to your own variables instead of shell The variable of .

Back to the top

awk Operators and priority lists in

11

Back to the top

awk Built in functions for

awk There are many built-in functions defined , Next, we list the commonly used functions according to the function type , The following functions are just a part of , For a complete list of functions, you need to look up awk Official documents of .

The arithmetic :
atan2(y,x) return y/x It's all right .
cos(x) return x The cosine of ;x It's a radian .
sin(x) return x Sine of ;x It's a radian .
exp(x) return x power function .
log(x) return x The natural logarithm of .
sqrt(x) return x square root .
int(x) return x Truncated to the value of an integer .
rand() Return any number n, among 0 <= n < 1.
srand([expr]) take rand The seed value of the function is set to Expr The value of the parameter , Or if omitted Expr The parameter uses the time of the day . Returns the previous seed value .

character string :
gsub(reg,str1,str2) Use str1 Replace all str2 In a regular expression reg The string of
sub(reg,str1,str2) Meaning and gsub identical , It's just gsub Is to replace all matches ,sub Replace only the first match
index(str,substr) return substr stay str The first index in , Pay attention to the index from 1 Start calculating , If not, return 0
length(str) return str Length of string ,length Function can also return the number of array elements
blength(str) Returns the number of bytes in the string
match(str,reg) And index The function is the same , It's just reg Using regular expressions , for example match("hello",/lo/)
split(str,array,reg) take str Separate into arrays and save to array in , Separation uses regular reg, Or strings , Return array length
tolower(str) Convert to lowercase
toupper(str) Convert to uppercase
substr(str,start,length) Intercepting string , from start At the beginning of the index length Characters , If not specified length Then intercept to the end , Index from 1 Start

other :
system(command) Execute system commands , Return to exit code
mktime( YYYY MM dd HH MM ss[ DST]) Generate time format
strftime(format,timestamp) Format time output , Convert a timestamp to a time string
systime() Get timestamp , Return from 1970 year 1 month 1 From the beginning of the day to the current time ( Not counting leap years ) The whole number of seconds

Back to the top

awk Built in variables for

awk There are also many built-in variables defined in , We can use them just as we use ordinary variables , because awk There are many versions of , Some built-in variables don't get all awk Version support .

explain :[A][N][P][G] Represents the tool that supports the variable ,[A]=awk、[N]=nawk、[P]=POSIXawk、[G]=gawk
$n The current record number n A field , such as n by 1 Represents the first field ,n by 2 Represents the second field .
$0 This variable contains the text content of the current line during execution .
[N] ARGC Number of command line arguments .
[G] ARGIND The location of the current file on the command line ( from 0 Start counting ).
[N] ARGV An array containing command line arguments .
[G] CONVFMT Digital conversion format ( The default value is %.6g).
[P] ENVIRON Environment variable associative array .
[N] ERRNO The last description of the system error .
[G] FIELDWIDTHS Field width list ( Use the space bar to separate ).
[A] FILENAME The name of the current input file .
[P] FNR Same as NR, But relative to the current file .
[A] FS Field separator ( The default is any space ).
[G] IGNORECASE If it is true , The case is ignored .
[A] NF Represents the number of fields , Corresponding to the current number of fields during execution .
[A] NR Represents the number of records , Corresponding to the current line number during execution .
[A] OFMT Digital output format ( The default value is %.6g).
[A] OFS Output field separator ( The default is a space ).
[A] ORS Output record separator ( The default is a newline character ).
[A] RS Record separator ( The default is a newline character ).
[N] RSTART from match The first position of the string that the function matches .
[N] RLENGTH from match The length of the string that the function matches .
[N] SUBSEP Array subscript separator ( The default value is 34).

awk Official documents :
https://www.gnu.org/software/gawk/manual/gawk.html