Memory usage of TCP connection in Linux high performance network programming

Linux background development 2020-11-09 16:06:48
memory usage tcp connection linux

When the server is concurrent TCP Connecting hundreds of thousands of timings , We're going to treat one TCP How much memory is consumed by connecting to the operating system kernel is of interest .socket Programming methods provide SO_SNDBUF、SO_RCVBUF This interface is used to set the read / write cache of the connection ,linux The following system level configuration is also provided on the server to set up the server as a whole TCP Memory usage , But the names of these configurations conflict with each other 、 The sense of ambiguity , as follows (sysctl -a The command can view these configurations ):

net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

There are also some less mentioned 、 Also with TCP Memory related configuration :

net.ipv4.tcp_moderate_rcvbuf = 1net.ipv4.tcp_adv_win_scale = 2

( notes : For the convenience of the following , The prefix is omitted when introducing the above system configuration , Multiple numbers separated by spaces are called arrays , for example tcp_rmem[2] It means the first row and the last column 16777216.)

Many descriptions of these system configuration items can be found on the Internet , However, it is often difficult to understand , for example ,tcp_rmem[2] and rmem_max It seems to be related to the maximum size of the receive cache , But they can be inconsistent , What's the difference ? perhaps tcp_wmem[1] and wmem_default Seems to indicate the default value of the send cache , What to do if there is a conflict ? Capture the software package used in syn Handshake bag , Why? TCP The receive window size seems to have nothing to do with these configurations ?

TCP The amount of memory used by connections in the process can vary , Usually, when the program is more complex, it may not be directly based on socket Programming , At this time, the platform level components may be encapsulated TCP The user state memory used by the connection . Different platforms 、 Components 、 middleware 、 Network libraries are very different . And the kernel state is TCP The algorithm for connecting to allocate memory is basically unchanged , This article will try to explain TCP How much memory will the connection use in kernel mode , What kind of strategy does the operating system use to balance the macro throughput and the micro connection transmission speed . This article will also, as always, be oriented to application developers , Not a system level Kernel Developer , therefore , It won't be introduced in detail for a TCP Connect 、 One TCP How many bytes of memory has been allocated by the message operating system , Kernel level data structure is not the focus of this article , These are not the concerns of application level programmers either . This article mainly describes linux The kernel is for TCP How the data transferred over the connection manages the read / write cache .

One 、 What is the maximum cache limit ?

(1) First from the application programming can be set SO_SNDBUF、SO_RCVBUF Speaking of .

Whatever the language , All the TCP Connection provision is based on setsockopt Method SO_SNDBUF、SO_RCVBUF, How to understand the meaning of these two attributes ?

SO_SNDBUF、SO_RCVBUF It's all individual settings , namely , It will only affect the set connection , It doesn't work for other connections .SO_SNDBUF Indicates the maximum kernel write cache limit on this connection . actually , Process settings SO_SNDBUF It's not really the upper limit , In the kernel, this value is doubled and used as the upper limit of write cache , We don't have to wrestle with the details , Just need to know , When you set SO_SNDBUF when , It is equivalent to delimiting the operation of TCP The maximum memory that the write cache on the connection can use . However , This value is not set at will by the process , It will be subject to the upper and lower limits of the system level , When it is larger than the system configuration above wmem_max(net.core.wmem_max) when , Will be wmem_max replace ( It's also doubled ); And when it's special , For example, in 2.6.18 The minimum write cache designed in the kernel is 2K byte , It will also be directly replaced by 2K.

SO_RCVBUF Indicates the maximum read cache limit on the connection , And SO_SNDBUF similar , It is also subject to rmem_max Configuration item , It's also true in the kernel 2 Multiple size as the upper limit of read cache usage .SO_RCVBUF There is also a lower limit when setting , Also in 2.6.18 In the kernel, if this value is less than 256 The byte will be 256 Replace .

(2) that , It can be set SO_SNDBUF、SO_RCVBUF What is the relationship between the maximum cache usage and the actual memory ?

TCP The memory used by the connection is mainly determined by the read-write cache , The size of the read-write cache is only related to the actual usage scenarios , When the upper limit is not reached ,SO_SNDBUF、SO_RCVBUF It doesn't work . For read caching , Received a from the opposite end of the connection TCP When the message , Will cause the read cache to increase , Of course , If the size of the message is added, the read cache has exceeded the upper limit of the read cache , The message is discarded so that the read cache size remains unchanged . When will the read cache use less memory ? When the process calls read、recv This way to read TCP When the flow , The read cache will be reduced . therefore , Read caching is a dynamic process 、 How much buffer memory is actually used to allocate , When this connection is very idle , And the user process has consumed all the data received on the connection , So read cache uses memory 0.

The same goes for write caching . When the user process calls send perhaps write This way to send TCP When the flow , Will cause the write cache to increase . Of course , If the write cache has reached the upper limit , Then the write cache remains unchanged , Return failure to user process . And whenever you receive TCP Connect from the opposite end ACK When the successful transmission of the message is confirmed , The write cache will be reduced , This is because TCP The reliability of , When a message is sent out, it will not be destroyed for fear of losing it , The retransmission timer may be used to retransmit messages . therefore , Write caching is also dynamic , On the free normal connection , The memory used by the write cache is usually 0.

therefore , Only when the speed of receiving network message is greater than that of application program reading message , May have reached the upper limit of the read cache , At this point, the upper limit of cache usage will work . The effect is : Discard the newly received message , To prevent this TCP Connection consumes too much server resources . Again , When the application sends messages faster than the receiving party confirms ACK The speed of the message , The write cache may reach the upper limit , So that send This method failed , The kernel does not allocate memory for it .

need C/C++ Linux Server architects learn how to add clusters 812855908( The information includes C/C++,Linux,golang technology ,Nginx,ZeroMQ,MySQL,Redis,fastdfs,MongoDB,ZK, Streaming media ,CDN,P2P,K8S,Docker,TCP/IP, coroutines ,DPDK,ffmpeg etc. ), Free to share

linux High performance network programming tcp Connected memory usage

Two 、 The size of the cache is the same as TCP What's the matter with sliding windows ?

(1) The size of the sliding window must be related to the size of the cache , But it's not a one-to-one relationship , It will not have a one-to-one correspondence with the upper limit of cache . therefore , There are a lot of materials on the Internet rmem_max Set the maximum value of sliding window , With us tcpdump I saw it when I grabbed my bag win The window values are completely inconsistent , It makes sense . Let's take a closer look at the differences .

The function of read cache is 2 individual :1、 Will be disorderly 、 Falling in the receive slide window TCP The message is cached ;2、 When orderly 、 When a message that can be read by an application appears , Because the reading of the application is delayed , So the message to be read by the application will also be saved in the read cache . therefore , The read cache is split in two , Part of the cache is out of order , Part of the cache is to delay the read of the ordered message . The sum of the two cache sizes is subject to the same upper limit , So they interact with each other , When the application read rate is too slow , This large application cache will affect the socket cache , Make the receiving sliding window smaller , So as to inform the opposite end of the connection to slow down the sending speed , Avoid unnecessary network transmission . When the application does not read data for a long time , Cause the application cache to squeeze the socket cache to no space , The receiving window will be 0 The notice of , Tell them : I can't digest any more messages now .

conversely , The receiving sliding window is also changing all the time , We use it tcpdump Grab three handshakes :

14:49:52.421674 IP > S 2736789705:2736789705(0) ack 1609024383 win 5792 <mss 1460,sackOK,timestamp 2925954240 2940689794,nop,wscale 9>

You can see that the initial receive window is 5792, Of course, it's much smaller than the maximum receive buffer ( I'll talk about it later tcp_rmem[1]).

There's a reason ,TCP The protocol needs to consider the complex network environment , So we used slow start 、 Congestion window ( See High performance network programming 2----TCP Sending of messages ), The initial window when establishing a connection is not initialized according to the maximum size of the receive cache . This is because , Too large initial window from a macro point of view , It may overload the whole network and cause a vicious circle , That is to say, considering many routers in each link of the link 、 The switch may not be able to withstand the pressure of continuous packet loss ( Especially wan ), And micro TCP Both sides of the connection only use their own read cache limit as the receiving window , So the sending window of both sides ( The receiving window of the other party ) The bigger, the worse the impact on the network . Slow start is to make the initial window as small as possible , After receiving the other party's valid message , After confirming the effective transmission capacity of the network , Just started to increase the receiving window .

Different linux The kernel has different initial windows , We use widely used linux2.6.18 Kernel as an example , In Ethernet ,MSS The size is 1460, The initial window size is 4 Times MSS, Simply list the code (*rcv_wnd The initial receiving window ):

 int init_cwnd = 4;
if (mss > 1460*3)
init_cwnd = 2;
else if (mss > 1460)
init_cwnd = 3;
if (*rcv_wnd > init_cwnd*mss)
*rcv_wnd = init_cwnd*mss;

You may want to ask , Why is the display window on the above snapshot actually 5792, Not at all 14604 by 5840 Well ? This is because 1460 What I want to express is : take 1500 Bytes of MTU In addition to the 20 Bytes of IP head 、20 Bytes of TCP After the head , The effective data length that a maximum message can carry . But in some networks , Will be in TCP In the head , Use 12 Bytes are used as timestamps , such , The valid data is MSS subtracting 12, The initial window is (1460-12)4=5792, This is consistent with what the window wants to express , namely : The effective length of data I can handle .

stay linux3 In later versions , The initial window is adjusted to 10 individual MSS size , This mainly comes from GOOGLE The advice of . That's why , Although the receiving window is often exponentially increased in size ( Below the congestion threshold is exponential growth , If the threshold value is above the threshold, the congestion avoidance phase will increase linearly , and , The congestion threshold itself is receiving 128 The above data packets also have the opportunity to increase rapidly ), If you're transmitting big data like video , So as the window increases to ( near ) After maximum read cache , will “ At full power ” To transmit data , But if it's usually dozens KB The web page of , Then the too small initial window has not been added to the appropriate window , The connection is over . This is a relatively large initial window , It makes users need more time (RTT) Before the data can be transmitted , The experience is not good .

Then you may have questions , When the window expands all the way from the initial window to the largest receiving window , Is the maximum receive window the maximum read cache ?

No , Because it must be divided into a part of the cache for the application's delayed message reading . How much will it be divided into ? This is a configurable system option , as follows :

net.ipv4.tcp_adv_win_scale = 2

there tcp_adv_win_scale signify , Will come out 1/(2^ tcp_adv_win_scale ) Cache out to do the application cache . namely , Default tcp_adv_win_scale Configure to 2 when , Is to take out at least 1/4 Memory for application read caching , that , The maximum size of the receive sliding window can only reach the size of the read cache 3/4.

(2) How much maximum read cache should be set to ?

When the share of the application cache is passed through tcp_adv_win_scale After the configuration is determined , The upper limit of the read cache should be the maximum TCP The receiving window decides . The initial window may only have 4 A or 10 individual MSS, But in the case of no packet loss, the interaction window will increase with the message , When the window is too large ,“ Too big ” What does that mean ? namely , It's not a big memory for the two machines that communicate , But it's too much for the entire network load , It will cause a vicious circle to network devices , Packet loss caused by busy network equipment . And the window is too small , Can't make full use of network resources . therefore , In general, I will use BDP To set the maximum receive window ( The maximum read cache can be calculated ).BDP It's called bandwidth delay product , That is, the product of bandwidth and network delay , For example, if our bandwidth is 2Gbps, The delay is 10ms, So the bandwidth delay product BDP Then for 2G/80.01=2.5MB, Therefore, in such a network, the maximum receiving window can be set as 2.5MB, In this way, the maximum read cache can be set to 4/32.5MB=3.3MB.

Why? ? because BDP It means the network carrying capacity , The maximum receiving window represents the message that can be sent without confirmation within the network carrying capacity . As shown in the figure below :

linux High performance network programming tcp Connected memory usage

The so-called "long fat network" is often mentioned ,“ Long ” It's time extension ,“ fat ” It's a lot of bandwidth , Either of them ,BDP Big , Should cause the maximum window to increase , This leads to an increase in the upper limit of the read cache . So the server in Changfei network , The upper limit of cache is large .( Of course ,TCP The original 16 The number of bits indicates that the window has an upper limit , But in RFC1323 The flexible sliding window defined in allows the sliding window to be extended to a large enough size .)

The send window is actually TCP Connect to the receiving window of the other party , So you can infer from the receiving window that , There is no more verbosity here .

3、 ... and 、linux Of TCP Cache limit auto adjustment policy

that , After setting the maximum cache limit, you can rest assured ? For one TCP In connection , You may have made full use of the network resources , Use the big window 、 Large cache to keep high speed . For example, in Changfei network , The cache limit may be set to tens of megabytes , But the total memory of the system is limited , When every connection is running at full speed to use the largest window ,1 Ten thousand connections will take up hundreds of memory G 了 , This limits the use of high concurrency scenarios , Fairness is not guaranteed . The scene we want is , When there are fewer concurrent connections , Some cache restrictions , Let every one TCP The connection works at full power ; When there are many concurrent connections , At this time, the system is running out of memory resources , Then reduce the cache limit a little bit , Make every one TCP Try to keep the connection cache as small as possible , To accommodate more connections .

linux To achieve this scenario , The function of automatically adjusting memory allocation is introduced , from tcp_moderate_rcvbuf Configuration decision , as follows :

net.ipv4.tcp_moderate_rcvbuf = 1

Default tcp_moderate_rcvbuf Configure to 1, It means open TCP Memory auto adjustment function . If the configuration is 0, This feature will not work ( Use with caution ).

Please also note : When we program the connection SO_SNDBUF、SO_RCVBUF, Will make linux The kernel no longer performs auto tuning on such connections !

that , How does this function work ? Look at the following configuration :

net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.tcp_mem = 8388608 12582912 16777216

tcp_rmem[3] An array represents any one of TCP Maximum read cache limit on connection , among tcp_rmem[0] It means the minimum upper limit , tcp_rmem[1] Represents the initial upper limit ( Be careful , It will cover all the protocols that apply to rmem_default To configure ), tcp_rmem[2] It means the maximum upper limit .

tcp_wmem[3] Array represents write cache , And tcp_rmem[3] similar , I won't repeat .

tcp_mem[3] Arrays are used to set TCP Overall memory usage , So it's worth a lot ( Its units are not bytes either , It's a page --4K perhaps 8K Wait for a unit like this !). this 3 Values define TCP No pressure value of overall memory 、 Pressure mode on threshold 、 Maximum usage value . With this 3 If the value is a marker point, then the memory will share 4 In this case :

1、 When TCP The overall memory is less than tcp_mem[0] when , Indicates that there is no pressure on the system memory . If the memory has exceeded tcp_mem[1] Put the system into memory pressure mode , Then the pressure mode will also be turned off at this time .

In this case , as long as TCP The cache used by the connection is not up to the limit ( Be careful , Although the initial upper limit is tcp_rmem[1], But the value is variable , The following details ), Then the allocation of new memory must be successful .

2、 When TCP In memory tcp_mem[0] And tcp_mem[1] Between time , The system may be in memory pressure mode , For example, the total memory just came from tcp_mem[1] Up and down ; It could also be in non pressure mode , For example, the total memory just came from tcp_mem[0] Here's up .

here , Whether or not in pressure mode , as long as TCP The cache used for the connection does not exceed tcp_rmem[0] perhaps tcp_wmem[0], Then all of them will be able to allocate new memory successfully . otherwise , Basically, there will be allocation failures .( Be careful : There are also some exception scenarios that allow memory allocation to succeed , Because it is not significant for us to understand these configuration items , So it's omitted .)

3、 When TCP In memory tcp_mem[1] And tcp_mem[2] Between time , The system must be in system pressure mode . Other behaviors are the same as above .

4、 When TCP In memory tcp_mem[2] When above , without doubt , The system must be in pressure mode , And all the new TCP Cache allocation fails .

The following figure shows the simplified logic of the kernel when a new cache is needed :

linux High performance network programming tcp Connected memory usage

When the system is in non pressure mode , The upper limit of read / write cache per connection I mentioned above , It is possible to increase , Of course, the maximum is not more than tcp_rmem[2] perhaps tcp_wmem[2]. contrary , In pressure mode , The upper limit of read-write cache may be reduced , Although the upper limit may be less than tcp_rmem[0] perhaps tcp_wmem[0].

therefore , In a rough summary , For this 3 You can look at arrays like this :

1、 As long as the system TCP The total memory of is over tcp_mem[2] , New memory allocation will fail .

2、tcp_rmem[0] perhaps tcp_wmem[0] It's also a high priority , As long as the conditions are 1 No overruns , So as long as the connection memory is less than these two values , To ensure that the new memory allocation will be successful .

3、 As long as the total memory does not exceed tcp_mem[0], Then the new memory can guarantee successful allocation even if it does not exceed the upper limit of connection cache .

4、tcp_mem[1] And tcp_mem[0] It's the opening 、 Turn off the memory pressure mode switch . In pressure mode , The connection cache limit may be reduced . In non pressure mode , The connection cache limit may be increased , Up to tcp_rmem[2] perhaps tcp_wmem[2].

本文为[Linux background development]所创,转载请带上原文链接,感谢

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云