When the server is concurrent TCP Connecting hundreds of thousands of timings , We're going to treat one TCP How much memory is consumed by connecting to the operating system kernel is of interest .socket Programming methods provide SO_SNDBUF、SO_RCVBUF This interface is used to set the read / write cache of the connection ,linux The following system level configuration is also provided on the server to set up the server as a whole TCP Memory usage , But the names of these configurations conflict with each other 、 The sense of ambiguity , as follows （sysctl -a The command can view these configurations ）：
net.ipv4.tcp_rmem = 8192 87380 16777216 net.ipv4.tcp_wmem = 8192 65536 16777216 net.ipv4.tcp_mem = 8388608 12582912 16777216 net.core.rmem_default = 262144 net.core.wmem_default = 262144 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216
There are also some less mentioned 、 Also with TCP Memory related configuration ：
net.ipv4.tcp_moderate_rcvbuf = 1net.ipv4.tcp_adv_win_scale = 2
（ notes ： For the convenience of the following , The prefix is omitted when introducing the above system configuration , Multiple numbers separated by spaces are called arrays , for example tcp_rmem It means the first row and the last column 16777216.）
Many descriptions of these system configuration items can be found on the Internet , However, it is often difficult to understand , for example ,tcp_rmem and rmem_max It seems to be related to the maximum size of the receive cache , But they can be inconsistent , What's the difference ？ perhaps tcp_wmem and wmem_default Seems to indicate the default value of the send cache , What to do if there is a conflict ？ Capture the software package used in syn Handshake bag , Why? TCP The receive window size seems to have nothing to do with these configurations ？
TCP The amount of memory used by connections in the process can vary , Usually, when the program is more complex, it may not be directly based on socket Programming , At this time, the platform level components may be encapsulated TCP The user state memory used by the connection . Different platforms 、 Components 、 middleware 、 Network libraries are very different . And the kernel state is TCP The algorithm for connecting to allocate memory is basically unchanged , This article will try to explain TCP How much memory will the connection use in kernel mode , What kind of strategy does the operating system use to balance the macro throughput and the micro connection transmission speed . This article will also, as always, be oriented to application developers , Not a system level Kernel Developer , therefore , It won't be introduced in detail for a TCP Connect 、 One TCP How many bytes of memory has been allocated by the message operating system , Kernel level data structure is not the focus of this article , These are not the concerns of application level programmers either . This article mainly describes linux The kernel is for TCP How the data transferred over the connection manages the read / write cache .
One 、 What is the maximum cache limit ？
（1） First from the application programming can be set SO_SNDBUF、SO_RCVBUF Speaking of .
Whatever the language , All the TCP Connection provision is based on setsockopt Method SO_SNDBUF、SO_RCVBUF, How to understand the meaning of these two attributes ？
SO_SNDBUF、SO_RCVBUF It's all individual settings , namely , It will only affect the set connection , It doesn't work for other connections .SO_SNDBUF Indicates the maximum kernel write cache limit on this connection . actually , Process settings SO_SNDBUF It's not really the upper limit , In the kernel, this value is doubled and used as the upper limit of write cache , We don't have to wrestle with the details , Just need to know , When you set SO_SNDBUF when , It is equivalent to delimiting the operation of TCP The maximum memory that the write cache on the connection can use . However , This value is not set at will by the process , It will be subject to the upper and lower limits of the system level , When it is larger than the system configuration above wmem_max（net.core.wmem_max） when , Will be wmem_max replace （ It's also doubled ）; And when it's special , For example, in 2.6.18 The minimum write cache designed in the kernel is 2K byte , It will also be directly replaced by 2K.
SO_RCVBUF Indicates the maximum read cache limit on the connection , And SO_SNDBUF similar , It is also subject to rmem_max Configuration item , It's also true in the kernel 2 Multiple size as the upper limit of read cache usage .SO_RCVBUF There is also a lower limit when setting , Also in 2.6.18 In the kernel, if this value is less than 256 The byte will be 256 Replace .
（2） that , It can be set SO_SNDBUF、SO_RCVBUF What is the relationship between the maximum cache usage and the actual memory ？
TCP The memory used by the connection is mainly determined by the read-write cache , The size of the read-write cache is only related to the actual usage scenarios , When the upper limit is not reached ,SO_SNDBUF、SO_RCVBUF It doesn't work . For read caching , Received a from the opposite end of the connection TCP When the message , Will cause the read cache to increase , Of course , If the size of the message is added, the read cache has exceeded the upper limit of the read cache , The message is discarded so that the read cache size remains unchanged . When will the read cache use less memory ？ When the process calls read、recv This way to read TCP When the flow , The read cache will be reduced . therefore , Read caching is a dynamic process 、 How much buffer memory is actually used to allocate , When this connection is very idle , And the user process has consumed all the data received on the connection , So read cache uses memory 0.
The same goes for write caching . When the user process calls send perhaps write This way to send TCP When the flow , Will cause the write cache to increase . Of course , If the write cache has reached the upper limit , Then the write cache remains unchanged , Return failure to user process . And whenever you receive TCP Connect from the opposite end ACK When the successful transmission of the message is confirmed , The write cache will be reduced , This is because TCP The reliability of , When a message is sent out, it will not be destroyed for fear of losing it , The retransmission timer may be used to retransmit messages . therefore , Write caching is also dynamic , On the free normal connection , The memory used by the write cache is usually 0.
therefore , Only when the speed of receiving network message is greater than that of application program reading message , May have reached the upper limit of the read cache , At this point, the upper limit of cache usage will work . The effect is ： Discard the newly received message , To prevent this TCP Connection consumes too much server resources . Again , When the application sends messages faster than the receiving party confirms ACK The speed of the message , The write cache may reach the upper limit , So that send This method failed , The kernel does not allocate memory for it .
need C/C++ Linux Server architects learn how to add clusters 812855908（ The information includes C/C++,Linux,golang technology ,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK, Streaming media ,CDN,P2P,K8S,Docker,TCP/IP, coroutines ,DPDK,ffmpeg etc. ）, Free to share
Two 、 The size of the cache is the same as TCP What's the matter with sliding windows ？
（1） The size of the sliding window must be related to the size of the cache , But it's not a one-to-one relationship , It will not have a one-to-one correspondence with the upper limit of cache . therefore , There are a lot of materials on the Internet rmem_max Set the maximum value of sliding window , With us tcpdump I saw it when I grabbed my bag win The window values are completely inconsistent , It makes sense . Let's take a closer look at the differences .
The function of read cache is 2 individual ：1、 Will be disorderly 、 Falling in the receive slide window TCP The message is cached ;2、 When orderly 、 When a message that can be read by an application appears , Because the reading of the application is delayed , So the message to be read by the application will also be saved in the read cache . therefore , The read cache is split in two , Part of the cache is out of order , Part of the cache is to delay the read of the ordered message . The sum of the two cache sizes is subject to the same upper limit , So they interact with each other , When the application read rate is too slow , This large application cache will affect the socket cache , Make the receiving sliding window smaller , So as to inform the opposite end of the connection to slow down the sending speed , Avoid unnecessary network transmission . When the application does not read data for a long time , Cause the application cache to squeeze the socket cache to no space , The receiving window will be 0 The notice of , Tell them ： I can't digest any more messages now .
conversely , The receiving sliding window is also changing all the time , We use it tcpdump Grab three handshakes ：
14:49:52.421674 IP houyi-vm02.dev.sd.aliyun.com.6400 > r14a02001.dg.tbsite.net.54073: S 2736789705:2736789705(0) ack 1609024383 win 5792 <mss 1460,sackOK,timestamp 2925954240 2940689794,nop,wscale 9>
You can see that the initial receive window is 5792, Of course, it's much smaller than the maximum receive buffer （ I'll talk about it later tcp_rmem）.
There's a reason ,TCP The protocol needs to consider the complex network environment , So we used slow start 、 Congestion window （ See High performance network programming 2----TCP Sending of messages ）, The initial window when establishing a connection is not initialized according to the maximum size of the receive cache . This is because , Too large initial window from a macro point of view , It may overload the whole network and cause a vicious circle , That is to say, considering many routers in each link of the link 、 The switch may not be able to withstand the pressure of continuous packet loss （ Especially wan ）, And micro TCP Both sides of the connection only use their own read cache limit as the receiving window , So the sending window of both sides （ The receiving window of the other party ） The bigger, the worse the impact on the network . Slow start is to make the initial window as small as possible , After receiving the other party's valid message , After confirming the effective transmission capacity of the network , Just started to increase the receiving window .
Different linux The kernel has different initial windows , We use widely used linux2.6.18 Kernel as an example , In Ethernet ,MSS The size is 1460, The initial window size is 4 Times MSS, Simply list the code （*rcv_wnd The initial receiving window ）：
int init_cwnd = 4; if (mss > 1460*3) init_cwnd = 2; else if (mss > 1460) init_cwnd = 3; if (*rcv_wnd > init_cwnd*mss) *rcv_wnd = init_cwnd*mss;
You may want to ask , Why is the display window on the above snapshot actually 5792, Not at all 14604 by 5840 Well ？ This is because 1460 What I want to express is ： take 1500 Bytes of MTU In addition to the 20 Bytes of IP head 、20 Bytes of TCP After the head , The effective data length that a maximum message can carry . But in some networks , Will be in TCP In the head , Use 12 Bytes are used as timestamps , such , The valid data is MSS subtracting 12, The initial window is （1460-12）4=5792, This is consistent with what the window wants to express , namely ： The effective length of data I can handle .
stay linux3 In later versions , The initial window is adjusted to 10 individual MSS size , This mainly comes from GOOGLE The advice of . That's why , Although the receiving window is often exponentially increased in size （ Below the congestion threshold is exponential growth , If the threshold value is above the threshold, the congestion avoidance phase will increase linearly , and , The congestion threshold itself is receiving 128 The above data packets also have the opportunity to increase rapidly ）, If you're transmitting big data like video , So as the window increases to （ near ） After maximum read cache , will “ At full power ” To transmit data , But if it's usually dozens KB The web page of , Then the too small initial window has not been added to the appropriate window , The connection is over . This is a relatively large initial window , It makes users need more time （RTT） Before the data can be transmitted , The experience is not good .
Then you may have questions , When the window expands all the way from the initial window to the largest receiving window , Is the maximum receive window the maximum read cache ？
No , Because it must be divided into a part of the cache for the application's delayed message reading . How much will it be divided into ？ This is a configurable system option , as follows ：
net.ipv4.tcp_adv_win_scale = 2
there tcp_adv_win_scale signify , Will come out 1/(2^ tcp_adv_win_scale ) Cache out to do the application cache . namely , Default tcp_adv_win_scale Configure to 2 when , Is to take out at least 1/4 Memory for application read caching , that , The maximum size of the receive sliding window can only reach the size of the read cache 3/4.
（2） How much maximum read cache should be set to ？
When the share of the application cache is passed through tcp_adv_win_scale After the configuration is determined , The upper limit of the read cache should be the maximum TCP The receiving window decides . The initial window may only have 4 A or 10 individual MSS, But in the case of no packet loss, the interaction window will increase with the message , When the window is too large ,“ Too big ” What does that mean ？ namely , It's not a big memory for the two machines that communicate , But it's too much for the entire network load , It will cause a vicious circle to network devices , Packet loss caused by busy network equipment . And the window is too small , Can't make full use of network resources . therefore , In general, I will use BDP To set the maximum receive window （ The maximum read cache can be calculated ）.BDP It's called bandwidth delay product , That is, the product of bandwidth and network delay , For example, if our bandwidth is 2Gbps, The delay is 10ms, So the bandwidth delay product BDP Then for 2G/80.01=2.5MB, Therefore, in such a network, the maximum receiving window can be set as 2.5MB, In this way, the maximum read cache can be set to 4/32.5MB=3.3MB.
Why? ？ because BDP It means the network carrying capacity , The maximum receiving window represents the message that can be sent without confirmation within the network carrying capacity . As shown in the figure below ：
The so-called "long fat network" is often mentioned ,“ Long ” It's time extension ,“ fat ” It's a lot of bandwidth , Either of them ,BDP Big , Should cause the maximum window to increase , This leads to an increase in the upper limit of the read cache . So the server in Changfei network , The upper limit of cache is large .（ Of course ,TCP The original 16 The number of bits indicates that the window has an upper limit , But in RFC1323 The flexible sliding window defined in allows the sliding window to be extended to a large enough size .）
The send window is actually TCP Connect to the receiving window of the other party , So you can infer from the receiving window that , There is no more verbosity here .
3、 ... and 、linux Of TCP Cache limit auto adjustment policy
that , After setting the maximum cache limit, you can rest assured ？ For one TCP In connection , You may have made full use of the network resources , Use the big window 、 Large cache to keep high speed . For example, in Changfei network , The cache limit may be set to tens of megabytes , But the total memory of the system is limited , When every connection is running at full speed to use the largest window ,1 Ten thousand connections will take up hundreds of memory G 了 , This limits the use of high concurrency scenarios , Fairness is not guaranteed . The scene we want is , When there are fewer concurrent connections , Some cache restrictions , Let every one TCP The connection works at full power ; When there are many concurrent connections , At this time, the system is running out of memory resources , Then reduce the cache limit a little bit , Make every one TCP Try to keep the connection cache as small as possible , To accommodate more connections .
linux To achieve this scenario , The function of automatically adjusting memory allocation is introduced , from tcp_moderate_rcvbuf Configuration decision , as follows ：
net.ipv4.tcp_moderate_rcvbuf = 1
Default tcp_moderate_rcvbuf Configure to 1, It means open TCP Memory auto adjustment function . If the configuration is 0, This feature will not work （ Use with caution ）.
Please also note ： When we program the connection SO_SNDBUF、SO_RCVBUF, Will make linux The kernel no longer performs auto tuning on such connections ！
that , How does this function work ？ Look at the following configuration ：
net.ipv4.tcp_rmem = 8192 87380 16777216 net.ipv4.tcp_wmem = 8192 65536 16777216 net.ipv4.tcp_mem = 8388608 12582912 16777216
tcp_rmem An array represents any one of TCP Maximum read cache limit on connection , among tcp_rmem It means the minimum upper limit , tcp_rmem Represents the initial upper limit （ Be careful , It will cover all the protocols that apply to rmem_default To configure ）, tcp_rmem It means the maximum upper limit .
tcp_wmem Array represents write cache , And tcp_rmem similar , I won't repeat .
tcp_mem Arrays are used to set TCP Overall memory usage , So it's worth a lot （ Its units are not bytes either , It's a page --4K perhaps 8K Wait for a unit like this ！）. this 3 Values define TCP No pressure value of overall memory 、 Pressure mode on threshold 、 Maximum usage value . With this 3 If the value is a marker point, then the memory will share 4 In this case ：
1、 When TCP The overall memory is less than tcp_mem when , Indicates that there is no pressure on the system memory . If the memory has exceeded tcp_mem Put the system into memory pressure mode , Then the pressure mode will also be turned off at this time .
In this case , as long as TCP The cache used by the connection is not up to the limit （ Be careful , Although the initial upper limit is tcp_rmem, But the value is variable , The following details ）, Then the allocation of new memory must be successful .
2、 When TCP In memory tcp_mem And tcp_mem Between time , The system may be in memory pressure mode , For example, the total memory just came from tcp_mem Up and down ; It could also be in non pressure mode , For example, the total memory just came from tcp_mem Here's up .
here , Whether or not in pressure mode , as long as TCP The cache used for the connection does not exceed tcp_rmem perhaps tcp_wmem, Then all of them will be able to allocate new memory successfully . otherwise , Basically, there will be allocation failures .（ Be careful ： There are also some exception scenarios that allow memory allocation to succeed , Because it is not significant for us to understand these configuration items , So it's omitted .）
3、 When TCP In memory tcp_mem And tcp_mem Between time , The system must be in system pressure mode . Other behaviors are the same as above .
4、 When TCP In memory tcp_mem When above , without doubt , The system must be in pressure mode , And all the new TCP Cache allocation fails .
The following figure shows the simplified logic of the kernel when a new cache is needed ：
When the system is in non pressure mode , The upper limit of read / write cache per connection I mentioned above , It is possible to increase , Of course, the maximum is not more than tcp_rmem perhaps tcp_wmem. contrary , In pressure mode , The upper limit of read-write cache may be reduced , Although the upper limit may be less than tcp_rmem perhaps tcp_wmem.
therefore , In a rough summary , For this 3 You can look at arrays like this ：
1、 As long as the system TCP The total memory of is over tcp_mem , New memory allocation will fail .
2、tcp_rmem perhaps tcp_wmem It's also a high priority , As long as the conditions are 1 No overruns , So as long as the connection memory is less than these two values , To ensure that the new memory allocation will be successful .
3、 As long as the total memory does not exceed tcp_mem, Then the new memory can guarantee successful allocation even if it does not exceed the upper limit of connection cache .
4、tcp_mem And tcp_mem It's the opening 、 Turn off the memory pressure mode switch . In pressure mode , The connection cache limit may be reduced . In non pressure mode , The connection cache limit may be increased , Up to tcp_rmem perhaps tcp_wmem.