1. An overview of the process of receiving packets

Introduce the packet receiving process , It helps us to understand Linux The location of kernel network devices in the process of data packet receiving , Next, from a macro point of view, we will introduce how the data packets are received and entered by the network card socket The whole process of receiving the queue :

  • Load the network card driver , initialization
  • Data packets from the external network into the network card
  • network card ( adopt DMA) Copy the package to... In kernel memory ring buffer
  • Generate hardware interrupt , The notification system received a packet
  • Drive call NAPI , If polling (poll) Not yet , Start polling
  • ksoftirqd Soft interrupt call NAPI Of poll Function from ring buffer Receive package (poll The function is registered in the initialization phase by the NIC driver ; Every cpu There's one running all over the world ksoftirqd process , Registered during system startup )
  • ring buffer The corresponding memory area inside is unmapped (unmapped)
  • If packet steering Function on , Or the network card has multiple queues , The packets received by the NIC will be distributed to multiple cpu
  • Packets go from the queue to the protocol layer
  • The protocol layer processes packets
  • The packets go from the protocol layer to the corresponding socket Receive queue for

2. Network device initialization

Here is the common Intel I350 The driver of network card ibg This paper introduces its working process as an example :

2.1 initialization

The driver will use module_init Register an initialization function with the kernel , When the driver is loaded , The kernel calls this function . stay drivers/net/ethernet/intel/igb/igb_main.c Initialization function in (igb_init_module):

 *  igb_init_module - Driver Registration Routine
 *  igb_init_module is the first routine called when the driver is
 *  loaded. All it does is register with the PCI subsystem.
static int __init igb_init_module(void)
  int ret;
  pr_info("%s - version %s\n", igb_driver_string, igb_driver_version);
  pr_info("%s\n", igb_copyright);
  /* ... */
  ret = pci_register_driver(&igb_driver);
  return ret;

Most of the initialization is done in pci_register_driver Finish in .

2.2 PCI initialization

Intel I350 Network card is PCI express equipment .PCI Equipment passing PCI Configuration Space The registers inside identify themselves .

PCI express The bus is a completely different form of the past PCI A new bus specification for bus , And PCI Bus sharing parallel architecture ,PCI Express Bus is a kind of point-to-point serial connection of devices , Point to point means every PCI Express Each device has its own independent data connection , The concurrent data transmission between various devices does not affect each other , And for the past PCI That shared bus mode ,PCI Only one device can communicate on the bus , once PCI More devices are attached to the bus , The actual transmission rate of each device will drop , Performance is not guaranteed .PCI Express Handle communications in a point-to-point manner , Each device establishes its own transmission channel when it is required to transmit data , For other devices, this channel is closed , This operation ensures the specificity of the channel , Avoid interference from other devices .

When the device driver compiles ,MODULE_DEVICE_TABLE macro ( It's defined in include/module.h) It will export a PCI equipment ID list (a table of PCI device IDs), The driver then identifies the device it can control , The kernel will also load corresponding drivers for different devices according to this list .

igb The device list and PCI equipment ID See :drivers/net/ethernet/intel/igb/igb_main.c and drivers/net/ethernet/intel/igb/e1000_hw.h.

static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = {
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 },
  { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 },
  /* ... */
MODULE_DEVICE_TABLE(pci, igb_pci_tbl);

Mentioned earlier , When the driver is initialized, it will call pci_register_driver, This function will register the various callback methods of the driver to a struct pci_driver Variable ,drivers/net/ethernet/intel/igb/igb_main.c:

static struct pci_driver igb_driver = {
  .name     = igb_driver_name,
  .id_table = igb_pci_tbl,
  .probe    = igb_probe,
  .remove   = igb_remove,
  /* ... */

2.3 Network device initialization

adopt PCI ID After identifying the device , The kernel will choose the right driver for it . Every PCI The driver registered a probe() Method , The kernel will call its driver for each device in turn probe Method , Once a suitable driver is found , You won't try another driver for this device .

Many drivers need a lot of code to make the device ready, There are different things to do . Typical process :

  • Enable PCI equipment
  • request (requesting) Memory range and IO port
  • Set up DMA Mask
  • Register device driver support ethtool Method ( Later on )
  • Required for registration watchdog( for example ,e1000e There's a test device to see if it's dead watchdog)
  • Other things related to specific equipment , For example, some workaround, Or unconventional processing of specific hardware
  • establish 、 Initialize and register a struct net_device_ops Type variable , This variable contains callback functions for device related , For example, turn on the device 、 Send data to the network 、 Set up MAC Address, etc
  • establish 、 Initialize and register a higher level struct net_device Type variable ( One variable represents One device )

The following term igb Driven igb_probe What processes are involved (drivers/net/ethernet/intel/igb/igb_main.c):

err = pci_enable_device_mem(pdev);
/* ... */
err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
/* ... */
err = pci_request_selected_regions(pdev, pci_select_bars(pdev,

See the kernel documentation for more details :https://github.com/torvalds/linux/blob/v3.13/Documentation/PCI/pci.txt

3. The network device starts up

igb_probe Did a lot of important equipment initialization work . except PCI dependent , There are also the following general network functions and network equipment related work :

  • register struct net_device_ops Variable
  • register ethtool Related methods
  • Get default from NIC MAC Address
  • Set up net_device Characteristic marks

3.1 struct net_device_ops

All operation functions related to network devices are registered to struct net_device_ops In variables of type (drivers/net/ethernet/intel/igb/igb_main.c):

static const struct net_device_ops igb_netdev_ops = {
  .ndo_open               = igb_open,
  .ndo_stop               = igb_close,
  .ndo_start_xmit         = igb_xmit_frame,
  .ndo_get_stats64        = igb_get_stats64,
  .ndo_set_rx_mode        = igb_set_rx_mode,
  .ndo_set_mac_address    = igb_set_mac,
  .ndo_change_mtu         = igb_change_mtu,
  .ndo_do_ioctl           = igb_ioctl,
  /* ... */

This variable will be in igb_probe() Give to struct net_device Medium netdev_ops Field :

static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
  netdev->netdev_ops = &igb_netdev_ops;

3.2 ethtool Function registration

ethtool Is a command line tool , You can view and modify some configurations of network devices , Commonly used to collect network card statistics . stay Ubuntu On , Sure adopt apt-get install ethtool install , I will show you how to monitor network card data through this tool .

ethtool adopt ioctl Communicate with device drivers . The kernel implements a universal ethtool Interface , The NIC driver implements these interfaces , Can be ethtool call . When ethtool After a system call is initiated , The kernel will find the callback function for the corresponding operation . Callbacks implement a variety of simple or complex functions , Simple as changing a flag value , The complexity includes adjusting how the NIC hardware works .

For related implementation, see :drivers/net/ethernet/intel/igb/igb_ethtool.c.

3.3 Soft interrupt

When a data frame passes through DMA writes RAM( Memory ) after , How does the NIC inform other systems that the packet can be processed ?

The traditional way is , The network card will generate a hardware interrupt (IRQ), Notification packet arrived . There are three common types of hard interrupts :

  • MSI-X
  • MSI
  • legacy IRQ

If a large number of packets arrive , There will be a lot of hardware interrupts .CPU Busy dealing with hardware interrupts , Less time is available for other tasks .

NAPI(New API) It's a new mechanism , Can reduce the number of hardware interrupts generated ( But hard interrupts cannot be completely eliminated ).

3.4 NAPI

NAPI The way to receive packets is different from the traditional way , It allows device drivers to register a poll Method , And then call this method to complete the collection .

NAPI How to use :

  • Drive to open NAPI function , It is not working by default ( I'm not collecting bags )
  • Packet arrival , Network card through DMA Write to memory
  • The NIC triggers a hard interrupt , The interrupt handler starts executing
  • Soft interrupt (softirq), Wake up the NAPI Subsystem . This will trigger in a separate thread , Call driver registered poll How to collect the package
  • The driver prevents the NIC from generating new hardware interrupts , This is for the sake of NAPI It can receive packets without being disturbed by new interruptions
  • Once there are no bags to collect ,NAPI close , The hard interrupt of the network card is turned on again
  • Go to the next step 2

Compared to the traditional way ,NAPI One interrupt receives multiple packets , So we can reduce the number of hardware interrupts .

poll Method is by calling netif_napi_add Sign up to NAPI Of , You can also specify weights weight, Most of the drivers are hardcode by 64.

Generally speaking , The driver registers at initialization time NAPI poll Method .

3.5 igb Driven NAPI initialization

igb The initialization of the driver is a long chain of calls :

  • igb_probe -> igb_sw_init
  • igb_sw_init -> igb_init_interrupt_scheme
  • igb_init_interrupt_scheme -> igb_alloc_q_vectors
  • igb_alloc_q_vectors -> igb_alloc_q_vector
  • igb_alloc_q_vector -> netif_napi_add

From a macro point of view , The calling process does the following :

  • If the support MSI-X, call pci_enable_msix To open it
  • Calculate and initialize some configurations , Including the number of receiving and sending queues of network card
  • call igb_alloc_q_vector Create each send and receive queue
  • igb_alloc_q_vector Will call... Further netif_napi_add register poll Method to NAPI Variable

Let's introduce igb_alloc_q_vector How to register poll Methods and private data (drivers/net/ethernet/intel/igb/igb_main.c):

static int igb_alloc_q_vector(struct igb_adapter *adapter,
                              int v_count, int v_idx,
                              int txr_count, int txr_idx,
                              int rxr_count, int rxr_idx)
  /* ... */
  /* allocate q_vector and rings */
  q_vector = kzalloc(size, GFP_KERNEL);
  if (!q_vector)
          return -ENOMEM;
  /* initialize NAPI */
  netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
  /* ... */

q_vector Is the newly assigned queue ,igb_poll yes poll Method , When it takes the bag , It will find the connection through the receiving queue NAPI Variable (q_vector->napi).

4. Enable network card (Bring A Network Device Up)

Mentioned earlier structure net_device_ops Variable , It contains NIC enabled 、 Contract awarding 、 Set up mac Address and other callback functions ( A function pointer ).

When a network card is enabled ( for example , adopt ifconfig eth0 up),net_device_ops Of ndo_open Method will be called . It usually does the following things :

  • Distribute RX、TX Queue memory
  • open NAPI function
  • Register interrupt handler
  • open (enable) Hard interrupt
  • other

igb Driving medium , This method corresponds to igb_open function .

4.1 Ready to receive data from the network

At present, most network cards use DMA Write data directly to memory , Next, the operating system can read directly from it . The data structure used for this purpose is ring buffer( Ring buffer ).

To implement this function , Device drivers have to work with the operating system , reserve (reserve) Give a section of memory to the network card . After the reservation is successful , The network card knows the address of this memory , The next packet you receive will be put here , And then it's taken away by the operating system .

Because this memory area is limited , If the packet rate is very fast , Single CPU There's no time to take these bags , New packets will be discarded . Now ,Receive Side Scaling(RSS, Receiver extension ) Or multiple queues ( multiqueue) One kind of technology may come in handy .

Some network cards have the ability to write the received packets to different memory areas , Each area is a separate receive queue . So the operating system can take advantage of multiple CPU( Hardware level ) Parallel processing of received packets . Only some network cards support this function .

Intel I350 The NIC supports multiple queues , We can do it in igb You can see it in the driver .igb When the driver is enabled , One of the first things to do is call igb_setup_all_rx_resources function . This function will be used for each RX Queue calls igb_setup_rx_resources, It will manage DMA Of memory .

RX The number and size of queues can be accessed by ethtool To configure , Adjusting these two parameters will have a visible impact on packet receiving or packet loss .

Network card through to packet head ( For example, the source address 、 Destination address 、 Port, etc ) Do hashing to decide that packet Where to put it RX queue . Only a few network cards support adjusting hash algorithm . If you support it , The algorithm can be used to translate the specific Of flow To a specific queue , It can even discard some packets directly at the hardware level .

Some network cards support adjustment RX The weight of the queue , You can intentionally send more traffic to the designated queue.

4.2 Enable NAPI

How to register the driver is described NAPI poll Method , however , Generally, until the network card is enabled ,NAPI It's been activated .

Enable NAPI It's simple , call napi_enable Function is OK , This function sets NAPI Variable (struct napi_struct) A flag bit indicating whether to enable or not in . It was said that ,NAPI It doesn't work immediately after it's enabled ( But wait for the hard interrupt to trigger ).

about igb, Driver initialization or through ethtool modify queue Number or size , Will enable each q_vector Of NAPI Variable ( drivers/net/ethernet/intel/igb/igb_main.c):

for (i = 0; i < adapter->num_q_vectors; i++)

4.3 Register interrupt handler

Enable NAPI after , The next step is to register the interrupt handler . The device can trigger an interrupt in many ways :

  • MSI-X
  • MSI
  • legacy interrupts

The implementation of device driver is different . The driver must determine which interrupt mode the device supports , Then register the corresponding interrupt handling function , These functions are executed when an interrupt occurs .

Some drives , for example igb, An attempt is made to register an interrupt handler for each interrupt type , If registration fails , Just try the next type .

MSI-X Interrupt is the recommended way , Especially for the network card supporting multiple queues . Because of every RX Queues have independent MSI-X interrupt , So it can be different CPU Handle ( adopt irqbalance The way , Or modify /proc/irq/IRQ_NUMBER/smp_affinity). Dealing with interruptions CPU It's also the process of this package CPU. In this case , From the hardware interrupt level of the network card, you can set the received packets to be different CPU Handle .

If not MSI-X, that MSI Compared with the traditional interrupt mode, it still has some advantages , The driver will still give priority to it .

stay igb Driving medium , function igb_msix_ring,igb_intr_msi,igb_intr Namely MSI-X,MSI And traditional interrupt processing function .

How does the driver try various interrupt types ( drivers/net/ethernet/intel/igb/igb_main.c):

static int igb_request_irq(struct igb_adapter *adapter)
  struct net_device *netdev = adapter->netdev;
  struct pci_dev *pdev = adapter->pdev;
  int err = 0;
  if (adapter->msix_entries) {
    err = igb_request_msix(adapter);
    if (!err)
      goto request_done;
    /* fall back to MSI */
    /* ... */
  /* ... */
  if (adapter->flags & IGB_FLAG_HAS_MSI) {
    err = request_irq(pdev->irq, igb_intr_msi, 0,
          netdev->name, adapter);
    if (!err)
      goto request_done;
    /* fall back to legacy interrupts */
    /* ... */
  err = request_irq(pdev->irq, igb_intr, IRQF_SHARED,
        netdev->name, adapter);
  if (err)
    dev_err(&pdev->dev, "Error %d getting interrupt\n", err);
  return err;

This is it. igb The driver registers the interrupt handler , This function is executed when a packet arrives at the network card and triggers a hardware interrupt .

4.4 Enable Interrupts

Come here , Almost all the preparations are in place . The only thing left is to turn on the hard interrupt , Waiting for packets to come in . The way hard interrupts are turned on varies depending on the hardware ,igb The driver is in the __igb_open Call auxiliary functions in igb_irq_enable Accomplished .

Interrupts are opened by writing registers :

static void igb_irq_enable(struct igb_adapter *adapter)
  /* ... */
    wr32(E1000_IMS, IMS_ENABLE_MASK | E1000_IMS_DRSTA);
    wr32(E1000_IAM, IMS_ENABLE_MASK | E1000_IMS_DRSTA);
  /* ... */

Now? , The network card has been enabled . The driver may also do something extra , For example, start the timer , Work queue ( work queue), Or other hardware related settings . After all this work is done , The network card can receive packets .

5. Network card monitoring

There are several different ways to monitor network devices , The monitoring granularity of each method (granularity) Unlike complexity . Let's start with the coarsest grain size , Gradually refine .

5.1 ethtool -S

ethtool -S You can view network card statistics ( For example, the total number of packets received and sent , Received and sent traffic , Number of dropped packets , Wrong number of packets, etc ):

Linux Kernel network device driver


It's difficult to monitor these data . Because it's easy to get it from the command line , However, there is no unified standard for the above fields . Different drivers , Even different versions of the same driver may have different fields .

You can have a rough look at “drop”, “buffer”, “miss” Etc . then , Find the corresponding place to update these fields in the source code of the driver , This may be updated at the software level , It may also be updated through registers at the hardware level . If it's through hardware registers , You have to check the network card data sheet( Instructions ), Find out what this register stands for .ethtoool The field names given , Some are misleading (misleading).

5.2 sysfs

sysfs It also provides statistics , But compared with the statistics of network card layer , It's going to be higher .

for example , Available ens33 There are these types of packets at the receiving end of :

Linux Kernel network device driver


Get the total number of packets received as :

Linux Kernel network device driver


Different types of statistics are located in /sys/class/net/<NIC>/statistics/ The different files below , Include collisions, rx_dropped, rx_errors, rx_missed_errors wait .

It should be noted that , What does each type mean , It's driven , So it's also up to the driver to decide when and where to update these counts . You may find that some drivers classify certain types of errors as drop, Other drivers may classify them as miss.

These values are crucial , Therefore, you need to check the corresponding NIC driver , Find out what they really stand for .

5.2 /proc/net/dev

/proc/net/dev Provides a higher level of network card statistics .

Linux Kernel network device driver


The statistics shown in this file are just sysfs A subset of it , But it is suitable as a routine statistical reference .

If the accuracy of these data is particularly high , You have to look at the kernel source code 、 Driver source code and driver manual , Figure out what each field really means , How and when the count is updated .Linux Here is the kernel network device driver , Thank you for reading .


original text :https://www.toutiao.com/i6907115700997915147/