NAPI_HOWTO.txt
上传用户:jlfgdled
上传日期:2013-04-10
资源大小:33168k
文件大小:26k
- HISTORY:
- February 16/2002 -- revision 0.2.1:
- COR typo corrected
- February 10/2002 -- revision 0.2:
- some spell checking ;->
- January 12/2002 -- revision 0.1
- This is still work in progress so may change.
- To keep up to date please watch this space.
- Introduction to NAPI
- ====================
- NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
- to improve network performance on Linux. For more details please
- read that paper.
- NAPI provides a "inherent mitigation" which is bound by system capacity
- as can be seen from the following data collected by Robert on Gigabit
- ethernet (e1000):
- Psize Ipps Tput Rxint Txint Done Ndone
- ---------------------------------------------------------------
- 60 890000 409362 17 27622 7 6823
- 128 758150 464364 21 9301 10 7738
- 256 445632 774646 42 15507 21 12906
- 512 232666 994445 241292 19147 241192 1062
- 1024 119061 1000003 872519 19258 872511 0
- 1440 85193 1000003 946576 19505 946569 0
-
- Legend:
- "Ipps" stands for input packets per second.
- "Tput" == packets out of total 1M that made it out.
- "txint" == transmit completion interrupts seen
- "Done" == The number of times that the poll() managed to pull all
- packets out of the rx ring. Note from this that the lower the
- load the more we could clean up the rxring
- "Ndone" == is the converse of "Done". Note again, that the higher
- the load the more times we couldnt clean up the rxring.
- Observe that:
- when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
- The system cant handle the processing at 1 interrupt/packet at that load level.
- At lower rates on the other hand, rx interrupts go up and therefore the
- interrupt/packet ratio goes up (as observable from that table). So there is
- possibility that under low enough input, you get one poll call for each
- input packet caused by a single interrupt each time. And if the system
- cant handle interrupt per packet ratio of 1, then it will just have to
- chug along ....
- 0) Prerequisites:
- ==================
- A driver MAY continue using the old 2.4 technique for interfacing
- to the network stack and not benefit from the NAPI changes.
- NAPI additions to the kernel do not break backward compatibility.
- NAPI, however, requires the following features to be available:
- A) DMA ring or enough RAM to store packets in software devices.
- B) Ability to turn off interrupts or maybe events that send packets up
- the stack.
- NAPI processes packet events in what is known as dev->poll() method.
- Typically, only packet receive events are processed in dev->poll().
- The rest of the events MAY be processed by the regular interrupt handler
- to reduce processing latency (justified also because there are not that
- many of them).
- Note, however, NAPI does not enforce that dev->poll() only processes
- receive events.
- Tests with the tulip driver indicated slightly increased latency if
- all of the interrupt handler is moved to dev->poll(). Also MII handling
- gets a little trickier.
- The example used in this document is to move the receive processing only
- to dev->poll(); this is shown with the patch for the tulip driver.
- For an example of code that moves all the interrupt driver to
- dev->poll() look at the ported e1000 code.
- There are caveats that might force you to go with moving everything to
- dev->poll(). Different NICs work differently depending on their status/event
- acknowledgement setup.
- There are two types of event register ACK mechanisms.
- I) what is known as Clear-on-read (COR).
- when you read the status/event register, it clears everything!
- The natsemi and sunbmac NICs are known to do this.
- In this case your only choice is to move all to dev->poll()
- II) Clear-on-write (COW)
- i) you clear the status by writting a 1 in the bit-location you want.
- These are the majority of the NICs and work the best with NAPI.
- Put only receive events in dev->poll(); leave the rest in
- the old interrupt handler.
- ii) whatever you write in the status register clears every thing ;->
- Cant seem to find any supported by Linux which do this. If
- someone knows such a chip email us please.
- Move all to dev->poll()
- C) Ability to detect new work correctly.
- NAPI works by shutting down event interrupts when theres work and
- turning them on when theres none.
- New packets might show up in the small window while interrupts were being
- re-enabled (refer to appendix 2). A packet might sneak in during the period
- we are enabling interrupts. We only get to know about such a packet when the
- next new packet arrives and generates an interrupt.
- Essentially, there is a small window of opportunity for a race condition
- which for clarity we'll refer to as the "rotting packet".
- This is a very important topic and appendix 2 is dedicated for more
- discussion.
- Locking rules and environmental guarantees
- ==========================================
- -Guarantee: Only one CPU at any time can call dev->poll(); this is because
- only one CPU can pick the initial interrupt and hence the initial
- netif_rx_schedule(dev);
- - The core layer invokes devices to send packets in a round robin format.
- This implies receive is totaly lockless because of the guarantee only that
- one CPU is executing it.
- - contention can only be the result of some other CPU accessing the rx
- ring. This happens only in close() and suspend() (when these methods
- try to clean the rx ring);
- ****guarantee: driver authors need not worry about this; synchronization
- is taken care for them by the top net layer.
- -local interrupts are enabled (if you dont move all to dev->poll()). For
- example link/MII and txcomplete continue functioning just same old way.
- This improves the latency of processing these events. It is also assumed that
- the receive interrupt is the largest cause of noise. Note this might not
- always be true.
- [according to Manfred Spraul, the winbond insists on sending one
- txmitcomplete interrupt for each packet (although this can be mitigated)].
- For these broken drivers, move all to dev->poll().
- For the rest of this text, we'll assume that dev->poll() only
- processes receive events.
- new methods introduce by NAPI
- =============================
- a) netif_rx_schedule(dev)
- Called by an IRQ handler to schedule a poll for device
- b) netif_rx_schedule_prep(dev)
- puts the device in a state which allows for it to be added to the
- CPU polling list if it is up and running. You can look at this as
- the first half of netif_rx_schedule(dev) above; the second half
- being c) below.
- c) __netif_rx_schedule(dev)
- Add device to the poll list for this CPU; assuming that _prep above
- has already been called and returned 1.
- d) netif_rx_reschedule(dev, undo)
- Called to reschedule polling for device specifically for some
- deficient hardware. Read Appendix 2 for more details.
- e) netif_rx_complete(dev)
- Remove interface from the CPU poll list: it must be in the poll list
- on current cpu. This primitive is called by dev->poll(), when
- it completes its work. The device cannot be out of poll list at this
- call, if it is then clearly it is a BUG(). You'll know ;->
- All these above nethods are used below. So keep reading for clarity.
- Device driver changes to be made when porting NAPI
- ==================================================
- Below we describe what kind of changes are required for NAPI to work.
- 1) introduction of dev->poll() method
- =====================================
- This is the method that is invoked by the network core when it requests
- for new packets from the driver. A driver is allowed to send upto
- dev->quota packets by the current CPU before yielding to the network
- subsystem (so other devices can also get opportunity to send to the stack).
- dev->poll() prototype looks as follows:
- int my_poll(struct net_device *dev, int *budget)
- budget is the remaining number of packets the network subsystem on the
- current CPU can send up the stack before yielding to other system tasks.
- *Each driver is responsible for decrementing budget by the total number of
- packets sent.
- Total number of packets cannot exceed dev->quota.
- dev->poll() method is invoked by the top layer, the driver just sends if it
- can to the stack the packet quantity requested.
- more on dev->poll() below after the interrupt changes are explained.
- 2) registering dev->poll() method
- ===================================
- dev->poll should be set in the dev->probe() method.
- e.g:
- dev->open = my_open;
- .
- .
- /* two new additions */
- /* first register my poll method */
- dev->poll = my_poll;
- /* next register my weight/quanta; can be overriden in /proc */
- dev->weight = 16;
- .
- .
- dev->stop = my_close;
- 3) scheduling dev->poll()
- =============================
- This involves modifying the interrupt handler and the code
- path which takes the packet off the NIC and sends them to the
- stack.
- it's important at this point to introduce the classical D Becker
- interrupt processor:
- ------------------
- static void
- netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
- {
- struct net_device *dev = (struct net_device *)dev_instance;
- struct my_private *tp = (struct my_private *)dev->priv;
- int work_count = my_work_count;
- status = read_interrupt_status_reg();
- if (status == 0)
- return; /* Shared IRQ: not us */
- if (status == 0xffff)
- return; /* Hot unplug */
- if (status & error)
- do_some_error_handling()
-
- do {
- acknowledge_ints_ASAP();
- if (status & link_interrupt) {
- spin_lock(&tp->link_lock);
- do_some_link_stat_stuff();
- spin_unlock(&tp->link_lock);
- }
-
- if (status & rx_interrupt) {
- receive_packets(dev);
- }
- if (status & rx_nobufs) {
- make_rx_buffs_avail();
- }
-
- if (status & tx_related) {
- spin_lock(&tp->lock);
- tx_ring_free(dev);
- if (tx_died)
- restart_tx();
- spin_unlock(&tp->lock);
- }
- status = read_interrupt_status_reg();
- } while (!(status & error) || more_work_to_be_done);
- }
- ----------------------------------------------------------------------
- We now change this to what is shown below to NAPI-enable it:
- ----------------------------------------------------------------------
- static void
- netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
- {
- struct net_device *dev = (struct net_device *)dev_instance;
- struct my_private *tp = (struct my_private *)dev->priv;
- status = read_interrupt_status_reg();
- if (status == 0)
- return; /* Shared IRQ: not us */
- if (status == 0xffff)
- return; /* Hot unplug */
- if (status & error)
- do_some_error_handling();
-
- do {
- /************************ start note *********************************/
- acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
- /************************ end note *********************************/
- if (status & link_interrupt) {
- spin_lock(&tp->link_lock);
- do_some_link_stat_stuff();
- spin_unlock(&tp->link_lock);
- }
- /************************ start note *********************************/
- if (status & rx_interrupt || (status & rx_nobuffs)) {
- if (netif_rx_schedule_prep(dev)) {
- /* disable interrupts caused
- * by arriving packets */
- disable_rx_and_rxnobuff_ints();
- /* tell system we have work to be done. */
- __netif_rx_schedule(dev);
- } else {
- printk("driver bug! interrupt while in polln");
- /* FIX by disabling interrupts */
- disable_rx_and_rxnobuff_ints();
- }
- }
- /************************ end note note *********************************/
-
- if (status & tx_related) {
- spin_lock(&tp->lock);
- tx_ring_free(dev);
- if (tx_died)
- restart_tx();
- spin_unlock(&tp->lock);
- }
- status = read_interrupt_status_reg();
- /************************ start note *********************************/
- } while (!(status & error) || more_work_to_be_done(status));
- /************************ end note note *********************************/
- }
- ---------------------------------------------------------------------
- We note several things from above:
- I) Any interrupt source which is caused by arriving packets is now
- turned off when it occurs. Depending on the hardware, there could be
- several reasons that arriving packets would cause interrupts; these are the
- interrupt sources we wish to avoid. The two common ones are a) a packet
- arriving (rxint) b) a packet arriving and finding no DMA buffers available
- (rxnobuff) .
- This means also acknowledge_ints_ASAP() will not clear the status
- register for those two items above; clearing is done in the place where
- proper work is done within NAPI; at the poll() and refill_rx_ring()
- discussed further below.
- netif_rx_schedule_prep() returns 1 if device is in running state and
- gets successfully added to the core poll list. If we get a zero value
- we can _almost_ assume are already added to the list (instead of not running.
- Logic based on the fact that you shouldnt get interrupt if not running)
- We rectify this by disabling rx and rxnobuf interrupts.
- II) that receive_packets(dev) and make_rx_buffs_avail() may have dissapeared.
- These functionalities are still around actually......
- infact, receive_packets(dev) is very close to my_poll() and
- make_rx_buffs_avail() is invoked from my_poll()
- 4) converting receive_packets() to dev->poll()
- ===============================================
- We need to convert the classical D Becker receive_packets(dev) to my_poll()
- First the typical receive_packets() below:
- -------------------------------------------------------------------
- /* this is called by interrupt handler */
- static void receive_packets (struct net_device *dev)
- {
- struct my_private *tp = (struct my_private *)dev->priv;
- rx_ring = tp->rx_ring;
- cur_rx = tp->cur_rx;
- int entry = cur_rx % RX_RING_SIZE;
- int received = 0;
- int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
- while (rx_ring_not_empty) {
- u32 rx_status;
- unsigned int rx_size;
- unsigned int pkt_size;
- struct sk_buff *skb;
- /* read size+status of next frame from DMA ring buffer */
- /* the number 16 and 4 are just examples */
- rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
- rx_size = rx_status >> 16;
- pkt_size = rx_size - 4;
- /* process errors */
- if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
- (!(rx_status & RxStatusOK))) {
- netdrv_rx_err (rx_status, dev, tp, ioaddr);
- return;
- }
- if (--rx_work_limit < 0)
- break;
- /* grab a skb */
- skb = dev_alloc_skb (pkt_size + 2);
- if (skb) {
- .
- .
- netif_rx (skb);
- .
- .
- } else { /* OOM */
- /*seems very driver specific ... some just pass
- whatever is on the ring already. */
- }
- /* move to the next skb on the ring */
- entry = (++tp->cur_rx) % RX_RING_SIZE;
- received++ ;
- }
- /* store current ring pointer state */
- tp->cur_rx = cur_rx;
- /* Refill the Rx ring buffers if they are needed */
- refill_rx_ring();
- .
- .
- }
- -------------------------------------------------------------------
- We change it to a new one below; note the additional parameter in
- the call.
- -------------------------------------------------------------------
- /* this is called by the network core */
- static void my_poll (struct net_device *dev, int *budget)
- {
- struct my_private *tp = (struct my_private *)dev->priv;
- rx_ring = tp->rx_ring;
- cur_rx = tp->cur_rx;
- int entry = cur_rx % RX_BUF_LEN;
- /* maximum packets to send to the stack */
- /************************ note note *********************************/
- int rx_work_limit = dev->quota;
- /************************ end note note *********************************/
- do { // outer beggining loop starts here
- clear_rx_status_register_bit();
- while (rx_ring_not_empty) {
- u32 rx_status;
- unsigned int rx_size;
- unsigned int pkt_size;
- struct sk_buff *skb;
- /* read size+status of next frame from DMA ring buffer */
- /* the number 16 and 4 are just examples */
- rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
- rx_size = rx_status >> 16;
- pkt_size = rx_size - 4;
- /* process errors */
- if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
- (!(rx_status & RxStatusOK))) {
- netdrv_rx_err (rx_status, dev, tp, ioaddr);
- return;
- }
- /************************ note note *********************************/
- if (--rx_work_limit < 0) { /* we got packets, but no quota */
- /* store current ring pointer state */
- tp->cur_rx = cur_rx;
- /* Refill the Rx ring buffers if they are needed */
- refill_rx_ring(dev);
- goto not_done;
- }
- /********************** end note **********************************/
- /* grab a skb */
- skb = dev_alloc_skb (pkt_size + 2);
- if (skb) {
- .
- .
- /************************ note note *********************************/
- netif_receive_skb (skb);
- /********************** end note **********************************/
- .
- .
- } else { /* OOM */
- /*seems very driver specific ... common is just pass
- whatever is on the ring already. */
- }
- /* move to the next skb on the ring */
- entry = (++tp->cur_rx) % RX_RING_SIZE;
- received++ ;
- }
- /* store current ring pointer state */
- tp->cur_rx = cur_rx;
- /* Refill the Rx ring buffers if they are needed */
- refill_rx_ring(dev);
-
- /* no packets on ring; but new ones can arrive since we last
- checked */
- status = read_interrupt_status_reg();
- if (rx status is not set) {
- /* If something arrives in this narrow window,
- an interrupt will be generated */
- goto done;
- }
- /* done! at least thats what it looks like ;->
- if new packets came in after our last check on status bits
- they'll be caught by the while check and we go back and clear them
- since we havent exceeded our quota */
- } while (rx_status_is_set);
- done:
- /************************ note note *********************************/
- dev->quota -= received;
- *budget -= received;
- /* If RX ring is not full we are out of memory. */
- if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
- goto oom;
- /* we are happy/done, no more packets on ring; put us back
- to where we can start processing interrupts again */
- netif_rx_complete(dev);
- enable_rx_and_rxnobuf_ints();
- /* The last op happens after poll completion. Which means the following:
- * 1. it can race with disabling irqs in irq handler (which are done to
- * schedule polls)
- * 2. it can race with dis/enabling irqs in other poll threads
- * 3. if an irq raised after the begining of the outer beginning
- * loop(marked in the code above), it will be immediately
- * triggered here.
- *
- * Summarizing: the logic may results in some redundant irqs both
- * due to races in masking and due to too late acking of already
- * processed irqs. The good news: no events are ever lost.
- */
- return 0; /* done */
- not_done:
- if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
- tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
- refill_rx_ring(dev);
- if (!received) {
- printk("received==0n");
- received = 1;
- }
- dev->quota -= received;
- *budget -= received;
- return 1; /* not_done */
- oom:
- /* Start timer, stop polling, but do not enable rx interrupts. */
- start_poll_timer(dev);
- return 0; /* we'll take it from here so tell core "done"*/
- /************************ End note note *********************************/
- }
- -------------------------------------------------------------------
- From above we note that:
- 0) rx_work_limit = dev->quota
- 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
- it does the work.
- 2) We have a done and not_done state.
- 3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
- 4) we have a new way of handling oom condition
- 5) A new outer for (;;) loop has been added. This serves the purpose of
- ensuring that if a new packet has come in, after we are all set and done,
- and we have not exceeded our quota that we continue sending packets up.
-
- -----------------------------------------------------------
- Poll timer code will need to do the following:
- a)
- if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
- tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
- refill_rx_ring(dev);
- /* If RX ring is not full we are still out of memory.
- Restart the timer again. Else we re-add ourselves
- to the master poll list.
- */
- if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
- restart_timer();
- else netif_rx_schedule(dev); /* we are back on the poll list */
-
- 5) dev->close() and dev->suspend() issues
- ==========================================
- The driver writter neednt worry about this. The top net layer takes
- care of it.
- 6) Adding new Stats to /proc
- =============================
- In order to debug some of the new features, we introduce new stats
- that need to be collected.
- TODO: Fill this later.
- APPENDIX 1: discussion on using ethernet HW FC
- ==============================================
- Most chips with FC only send a pause packet when they run out of Rx buffers.
- Since packets are pulled off the DMA ring by a softirq in NAPI,
- if the system is slow in grabbing them and we have a high input
- rate (faster than the system's capacity to remove packets), then theoretically
- there will only be one rx interrupt for all packets during a given packetstorm.
- Under low load, we might have a single interrupt per packet.
- FC should be programmed to apply in the case when the system cant pull out
- packets fast enough i.e send a pause only when you run out of rx buffers.
- Note FC in itself is a good solution but we have found it to not be
- much of a commodity feature (both in NICs and switches) and hence falls
- under the same category as using NIC based mitigation. Also experiments
- indicate that its much harder to resolve the resource allocation
- issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness
- proved harder. In any case, FC works even better with NAPI but is not
- necessary.
- APPENDIX 2: the "rotting packet" race-window avoidance scheme
- =============================================================
- There are two types of associations seen here
- 1) status/int which honors level triggered IRQ
- If a status bit for receive or rxnobuff is set and the corresponding
- interrupt-enable bit is not on, then no interrupts will be generated. However,
- as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
- generated. [assuming the status bit was not turned off].
- Generally the concept of level triggered IRQs in association with a status and
- interrupt-enable CSR register set is used to avoid the race.
- If we take the example of the tulip:
- "pending work" is indicated by the status bit(CSR5 in tulip).
- the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
- the CSR5 will continue to be turned on with new packet arrivals even if
- we clear it the first time)
- Very important is the fact that if we turn on the interrupt bit on when
- status is set that an immediate irq is triggered.
-
- If we cleared the rx ring and proclaimed there was "no more work
- to be done" and then went on to do a few other things; then when we enable
- interrupts, there is a possibility that a new packet might sneak in during
- this phase. It helps to look at the pseudo code for the tulip poll
- routine:
- --------------------------
- do {
- ACK;
- while (ring_is_not_empty()) {
- work-work-work
- if quota is exceeded: exit, no touching irq status/mask
- }
- /* No packets, but new can arrive while we are doing this*/
- CSR5 := read
- if (CSR5 is not set) {
- /* If something arrives in this narrow window here,
- * where the comments are ;-> irq will be generated */
- unmask irqs;
- exit poll;
- }
- } while (rx_status_is_set);
- ------------------------
- CSR5 bit of interest is only the rx status.
- If you look at the last if statement:
- you just finished grabbing all the packets from the rx ring .. you check if
- status bit says theres more packets just in ... it says none; you then
- enable rx interrupts again; if a new packet just came in during this check,
- we are counting that CSR5 will be set in that small window of opportunity
- and that by re-enabling interrupts, we would actually triger an interrupt
- to register the new packet for processing.
- [The above description nay be very verbose, if you have better wording
- that will make this more understandable, please suggest it.]
- 2) non-capable hardware
- These do not generally respect level triggered IRQs. Normally,
- irqs may be lost while being masked and the only way to leave poll is to do
- a double check for new input after netif_rx_complete() is invoked
- and re-enable polling (after seeing this new input).
- Sample code:
- ---------
- .
- .
- restart_poll:
- while (ring_is_not_empty()) {
- work-work-work
- if quota is exceeded: exit, not touching irq status/mask
- }
- .
- .
- .
- enable_rx_interrupts()
- netif_rx_complete(dev);
- if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
- disable_rx_and_rxnobufs()
- goto restart_poll
- } while (rx_status_is_set);
- ---------
-
- Basically netif_rx_complete() removes us from the poll list, but because a
- new packet which will never be caught due to the possibility of a race
- might come in, we attempt to re-add ourselves to the poll list.
- --------------------------------------------------------------------
- relevant sites:
- ==================
- ftp://robur.slu.se/pub/Linux/net-development/NAPI/
- --------------------------------------------------------------------
- TODO: Write net-skeleton.c driver.
- -------------------------------------------------------------
- Authors:
- ========
- Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
- Jamal Hadi Salim <hadi@cyberus.ca>
- Robert Olsson <Robert.Olsson@data.slu.se>
- Acknowledgements:
- ================
- People who made this document better:
- Lennert Buytenhek <buytenh@gnu.org>
- Andrew Morton <akpm@zip.com.au>
- Manfred Spraul <manfred@colorfullife.com>
- Donald Becker <becker@scyld.com>
- Jeff Garzik