|
| 1 | +Linux kernel driver for Elastic Network Adapter (ENA) family: |
| 2 | +============================================================= |
| 3 | + |
| 4 | +Overview: |
| 5 | +========= |
| 6 | +ENA is a networking interface designed to make good use of modern CPU |
| 7 | +features and system architectures. |
| 8 | + |
| 9 | +The ENA device exposes a lightweight management interface with a |
| 10 | +minimal set of memory mapped registers and extendable command set |
| 11 | +through an Admin Queue. |
| 12 | + |
| 13 | +The driver supports a range of ENA devices, is link-speed independent |
| 14 | +(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has |
| 15 | +a negotiated and extendable feature set. |
| 16 | + |
| 17 | +Some ENA devices support SR-IOV. This driver is used for both the |
| 18 | +SR-IOV Physical Function (PF) and Virtual Function (VF) devices. |
| 19 | + |
| 20 | +ENA devices enable high speed and low overhead network traffic |
| 21 | +processing by providing multiple Tx/Rx queue pairs (the maximum number |
| 22 | +is advertised by the device via the Admin Queue), a dedicated MSI-X |
| 23 | +interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, |
| 24 | +and CPU cacheline optimized data placement. |
| 25 | + |
| 26 | +The ENA driver supports industry standard TCP/IP offload features such |
| 27 | +as checksum offload and TCP transmit segmentation offload (TSO). |
| 28 | +Receive-side scaling (RSS) is supported for multi-core scaling. |
| 29 | + |
| 30 | +The ENA driver and its corresponding devices implement health |
| 31 | +monitoring mechanisms such as watchdog, enabling the device and driver |
| 32 | +to recover in a manner transparent to the application, as well as |
| 33 | +debug logs. |
| 34 | + |
| 35 | +Some of the ENA devices support a working mode called Low-latency |
| 36 | +Queue (LLQ), which saves several more microseconds. |
| 37 | + |
| 38 | +Supported PCI vendor ID/device IDs: |
| 39 | +=================================== |
| 40 | +1d0f:0ec2 - ENA PF |
| 41 | +1d0f:1ec2 - ENA PF with LLQ support |
| 42 | +1d0f:ec20 - ENA VF |
| 43 | +1d0f:ec21 - ENA VF with LLQ support |
| 44 | + |
| 45 | +ENA Source Code Directory Structure: |
| 46 | +==================================== |
| 47 | +ena_com.[ch] - Management communication layer. This layer is |
| 48 | + responsible for the handling all the management |
| 49 | + (admin) communication between the device and the |
| 50 | + driver. |
| 51 | +ena_eth_com.[ch] - Tx/Rx data path. |
| 52 | +ena_admin_defs.h - Definition of ENA management interface. |
| 53 | +ena_eth_io_defs.h - Definition of ENA data path interface. |
| 54 | +ena_common_defs.h - Common definitions for ena_com layer. |
| 55 | +ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. |
| 56 | +ena_netdev.[ch] - Main Linux kernel driver. |
| 57 | +ena_syfsfs.[ch] - Sysfs files. |
| 58 | +ena_ethtool.c - ethtool callbacks. |
| 59 | +ena_pci_id_tbl.h - Supported device IDs. |
| 60 | + |
| 61 | +Management Interface: |
| 62 | +===================== |
| 63 | +ENA management interface is exposed by means of: |
| 64 | +- PCIe Configuration Space |
| 65 | +- Device Registers |
| 66 | +- Admin Queue (AQ) and Admin Completion Queue (ACQ) |
| 67 | +- Asynchronous Event Notification Queue (AENQ) |
| 68 | + |
| 69 | +ENA device MMIO Registers are accessed only during driver |
| 70 | +initialization and are not involved in further normal device |
| 71 | +operation. |
| 72 | + |
| 73 | +AQ is used for submitting management commands, and the |
| 74 | +results/responses are reported asynchronously through ACQ. |
| 75 | + |
| 76 | +ENA introduces a very small set of management commands with room for |
| 77 | +vendor-specific extensions. Most of the management operations are |
| 78 | +framed in a generic Get/Set feature command. |
| 79 | + |
| 80 | +The following admin queue commands are supported: |
| 81 | +- Create I/O submission queue |
| 82 | +- Create I/O completion queue |
| 83 | +- Destroy I/O submission queue |
| 84 | +- Destroy I/O completion queue |
| 85 | +- Get feature |
| 86 | +- Set feature |
| 87 | +- Configure AENQ |
| 88 | +- Get statistics |
| 89 | + |
| 90 | +Refer to ena_admin_defs.h for the list of supported Get/Set Feature |
| 91 | +properties. |
| 92 | + |
| 93 | +The Asynchronous Event Notification Queue (AENQ) is a uni-directional |
| 94 | +queue used by the ENA device to send to the driver events that cannot |
| 95 | +be reported using ACQ. AENQ events are subdivided into groups. Each |
| 96 | +group may have multiple syndromes, as shown below |
| 97 | + |
| 98 | +The events are: |
| 99 | + Group Syndrome |
| 100 | + Link state change - X - |
| 101 | + Fatal error - X - |
| 102 | + Notification Suspend traffic |
| 103 | + Notification Resume traffic |
| 104 | + Keep-Alive - X - |
| 105 | + |
| 106 | +ACQ and AENQ share the same MSI-X vector. |
| 107 | + |
| 108 | +Keep-Alive is a special mechanism that allows monitoring of the |
| 109 | +device's health. The driver maintains a watchdog (WD) handler which, |
| 110 | +if fired, logs the current state and statistics then resets and |
| 111 | +restarts the ENA device and driver. A Keep-Alive event is delivered by |
| 112 | +the device every second. The driver re-arms the WD upon reception of a |
| 113 | +Keep-Alive event. A missed Keep-Alive event causes the WD handler to |
| 114 | +fire. |
| 115 | + |
| 116 | +Data Path Interface: |
| 117 | +==================== |
| 118 | +I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx |
| 119 | +SQ correspondingly). Each SQ has a completion queue (CQ) associated |
| 120 | +with it. |
| 121 | + |
| 122 | +The SQs and CQs are implemented as descriptor rings in contiguous |
| 123 | +physical memory. |
| 124 | + |
| 125 | +The ENA driver supports two Queue Operation modes for Tx SQs: |
| 126 | +- Regular mode |
| 127 | + * In this mode the Tx SQs reside in the host's memory. The ENA |
| 128 | + device fetches the ENA Tx descriptors and packet data from host |
| 129 | + memory. |
| 130 | +- Low Latency Queue (LLQ) mode or "push-mode". |
| 131 | + * In this mode the driver pushes the transmit descriptors and the |
| 132 | + first 128 bytes of the packet directly to the ENA device memory |
| 133 | + space. The rest of the packet payload is fetched by the |
| 134 | + device. For this operation mode, the driver uses a dedicated PCI |
| 135 | + device memory BAR, which is mapped with write-combine capability. |
| 136 | + |
| 137 | +The Rx SQs support only the regular mode. |
| 138 | + |
| 139 | +Note: Not all ENA devices support LLQ, and this feature is negotiated |
| 140 | + with the device upon initialization. If the ENA device does not |
| 141 | + support LLQ mode, the driver falls back to the regular mode. |
| 142 | + |
| 143 | +The driver supports multi-queue for both Tx and Rx. This has various |
| 144 | +benefits: |
| 145 | +- Reduced CPU/thread/process contention on a given Ethernet interface. |
| 146 | +- Cache miss rate on completion is reduced, particularly for data |
| 147 | + cache lines that hold the sk_buff structures. |
| 148 | +- Increased process-level parallelism when handling received packets. |
| 149 | +- Increased data cache hit rate, by steering kernel processing of |
| 150 | + packets to the CPU, where the application thread consuming the |
| 151 | + packet is running. |
| 152 | +- In hardware interrupt re-direction. |
| 153 | + |
| 154 | +Interrupt Modes: |
| 155 | +================ |
| 156 | +The driver assigns a single MSI-X vector per queue pair (for both Tx |
| 157 | +and Rx directions). The driver assigns an additional dedicated MSI-X vector |
| 158 | +for management (for ACQ and AENQ). |
| 159 | + |
| 160 | +Management interrupt registration is performed when the Linux kernel |
| 161 | +probes the adapter, and it is de-registered when the adapter is |
| 162 | +removed. I/O queue interrupt registration is performed when the Linux |
| 163 | +interface of the adapter is opened, and it is de-registered when the |
| 164 | +interface is closed. |
| 165 | + |
| 166 | +The management interrupt is named: |
| 167 | + ena-mgmnt@pci:<PCI domain:bus:slot.function> |
| 168 | +and for each queue pair, an interrupt is named: |
| 169 | + <interface name>-Tx-Rx-<queue index> |
| 170 | + |
| 171 | +The ENA device operates in auto-mask and auto-clear interrupt |
| 172 | +modes. That is, once MSI-X is delivered to the host, its Cause bit is |
| 173 | +automatically cleared and the interrupt is masked. The interrupt is |
| 174 | +unmasked by the driver after NAPI processing is complete. |
| 175 | + |
| 176 | +Interrupt Moderation: |
| 177 | +===================== |
| 178 | +ENA driver and device can operate in conventional or adaptive interrupt |
| 179 | +moderation mode. |
| 180 | + |
| 181 | +In conventional mode the driver instructs device to postpone interrupt |
| 182 | +posting according to static interrupt delay value. The interrupt delay |
| 183 | +value can be configured through ethtool(8). The following ethtool |
| 184 | +parameters are supported by the driver: tx-usecs, rx-usecs |
| 185 | + |
| 186 | +In adaptive interrupt moderation mode the interrupt delay value is |
| 187 | +updated by the driver dynamically and adjusted every NAPI cycle |
| 188 | +according to the traffic nature. |
| 189 | + |
| 190 | +By default ENA driver applies adaptive coalescing on Rx traffic and |
| 191 | +conventional coalescing on Tx traffic. |
| 192 | + |
| 193 | +Adaptive coalescing can be switched on/off through ethtool(8) |
| 194 | +adaptive_rx on|off parameter. |
| 195 | + |
| 196 | +The driver chooses interrupt delay value according to the number of |
| 197 | +bytes and packets received between interrupt unmasking and interrupt |
| 198 | +posting. The driver uses interrupt delay table that subdivides the |
| 199 | +range of received bytes/packets into 5 levels and assigns interrupt |
| 200 | +delay value to each level. |
| 201 | + |
| 202 | +The user can enable/disable adaptive moderation, modify the interrupt |
| 203 | +delay table and restore its default values through sysfs. |
| 204 | + |
| 205 | +The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK |
| 206 | +and can be configured by the ETHTOOL_STUNABLE command of the |
| 207 | +SIOCETHTOOL ioctl. |
| 208 | + |
| 209 | +SKB: |
| 210 | +The driver-allocated SKB for frames received from Rx handling using |
| 211 | +NAPI context. The allocation method depends on the size of the packet. |
| 212 | +If the frame length is larger than rx_copybreak, napi_get_frags() |
| 213 | +is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer |
| 214 | +content is copied (by CPU) to the SKB, and the buffer is recycled. |
| 215 | + |
| 216 | +Statistics: |
| 217 | +=========== |
| 218 | +The user can obtain ENA device and driver statistics using ethtool. |
| 219 | +The driver can collect regular or extended statistics (including |
| 220 | +per-queue stats) from the device. |
| 221 | + |
| 222 | +In addition the driver logs the stats to syslog upon device reset. |
| 223 | + |
| 224 | +MTU: |
| 225 | +==== |
| 226 | +The driver supports an arbitrarily large MTU with a maximum that is |
| 227 | +negotiated with the device. The driver configures MTU using the |
| 228 | +SetFeature command (ENA_ADMIN_MTU property). The user can change MTU |
| 229 | +via ip(8) and similar legacy tools. |
| 230 | + |
| 231 | +Stateless Offloads: |
| 232 | +=================== |
| 233 | +The ENA driver supports: |
| 234 | +- TSO over IPv4/IPv6 |
| 235 | +- TSO with ECN |
| 236 | +- IPv4 header checksum offload |
| 237 | +- TCP/UDP over IPv4/IPv6 checksum offloads |
| 238 | + |
| 239 | +RSS: |
| 240 | +==== |
| 241 | +- The ENA device supports RSS that allows flexible Rx traffic |
| 242 | + steering. |
| 243 | +- Toeplitz and CRC32 hash functions are supported. |
| 244 | +- Different combinations of L2/L3/L4 fields can be configured as |
| 245 | + inputs for hash functions. |
| 246 | +- The driver configures RSS settings using the AQ SetFeature command |
| 247 | + (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and |
| 248 | + ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). |
| 249 | +- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash |
| 250 | + function delivered in the Rx CQ descriptor is set in the received |
| 251 | + SKB. |
| 252 | +- The user can provide a hash key, hash function, and configure the |
| 253 | + indirection table through ethtool(8). |
| 254 | + |
| 255 | +DATA PATH: |
| 256 | +========== |
| 257 | +Tx: |
| 258 | +--- |
| 259 | +end_start_xmit() is called by the stack. This function does the following: |
| 260 | +- Maps data buffers (skb->data and frags). |
| 261 | +- Populates ena_buf for the push buffer (if the driver and device are |
| 262 | + in push mode.) |
| 263 | +- Prepares ENA bufs for the remaining frags. |
| 264 | +- Allocates a new request ID from the empty req_id ring. The request |
| 265 | + ID is the index of the packet in the Tx info. This is used for |
| 266 | + out-of-order TX completions. |
| 267 | +- Adds the packet to the proper place in the Tx ring. |
| 268 | +- Calls ena_com_prepare_tx(), an ENA communication layer that converts |
| 269 | + the ena_bufs to ENA descriptors (and adds meta ENA descriptors as |
| 270 | + needed.) |
| 271 | + * This function also copies the ENA descriptors and the push buffer |
| 272 | + to the Device memory space (if in push mode.) |
| 273 | +- Writes doorbell to the ENA device. |
| 274 | +- When the ENA device finishes sending the packet, a completion |
| 275 | + interrupt is raised. |
| 276 | +- The interrupt handler schedules NAPI. |
| 277 | +- The ena_clean_tx_irq() function is called. This function handles the |
| 278 | + completion descriptors generated by the ENA, with a single |
| 279 | + completion descriptor per completed packet. |
| 280 | + * req_id is retrieved from the completion descriptor. The tx_info of |
| 281 | + the packet is retrieved via the req_id. The data buffers are |
| 282 | + unmapped and req_id is returned to the empty req_id ring. |
| 283 | + * The function stops when the completion descriptors are completed or |
| 284 | + the budget is reached. |
| 285 | + |
| 286 | +Rx: |
| 287 | +--- |
| 288 | +- When a packet is received from the ENA device. |
| 289 | +- The interrupt handler schedules NAPI. |
| 290 | +- The ena_clean_rx_irq() function is called. This function calls |
| 291 | + ena_rx_pkt(), an ENA communication layer function, which returns the |
| 292 | + number of descriptors used for a new unhandled packet, and zero if |
| 293 | + no new packet is found. |
| 294 | +- Then it calls the ena_clean_rx_irq() function. |
| 295 | +- ena_eth_rx_skb() checks packet length: |
| 296 | + * If the packet is small (len < rx_copybreak), the driver allocates |
| 297 | + a SKB for the new packet, and copies the packet payload into the |
| 298 | + SKB data buffer. |
| 299 | + - In this way the original data buffer is not passed to the stack |
| 300 | + and is reused for future Rx packets. |
| 301 | + * Otherwise the function unmaps the Rx buffer, then allocates the |
| 302 | + new SKB structure and hooks the Rx buffer to the SKB frags. |
| 303 | +- The new SKB is updated with the necessary information (protocol, |
| 304 | + checksum hw verify result, etc.), and then passed to the network |
| 305 | + stack, using the NAPI interface function napi_gro_receive(). |
0 commit comments