irqbalance on HT capable PE4600?

Arjan van de Ven arjanv at redhat.com
Sun Oct 19 04:01:00 CDT 2003


On Sun, Oct 19, 2003 at 08:19:12AM +1000, jason andrade wrote:
> On Sat, 18 Oct 2003, Arjan van de Ven wrote:

> the current network is gigE which is pushing around 100-140Mbit/sec on
> an Intel Pro/1000.  it is anticipated this may need to double over the
> next year.  my own testing of the pro/1000 is that you can achieve full
> line rate gigE on the card (if the machine is doing nothing else) so
> i'm pretty confident it can handle 300-400Mbit/sec in normal operation.
> however i need to try to do some more work to see why the current
> config spends 70-80% of time in 'system' - i assume a chunk of that
> is the nfsd but wanted to make sure there wasn't other things i'd
> been overlooking previously (like the irq rate)

Reducing the irq rate via mitigation has several effects:
1) IRQ entry/exit (which can be a context switch) is not too cheap
2) Without mitigation the pattern for networking can be
   <userspace>
   IRQ entry
   retrieve one packet from the NIC, queue it
   hardIRQ exit
   SoftIRQ context entry
   inspect packet, discover it's a fragment of an UDP packet, keep it on a
			fragment list
   SoftIRQ context exit back to original context
   <userspace, but briefly>
   IRQ entry
   retrieve one packet from the NIC, queue it
   hardIRQ exit
   SoftIRQ context entry
   inspect packet, discover it's a fragment of an UDP packet, find
		it's the second half of the first packet, merge and send to app
   SoftIRQ context exit back to original context
   <userspace>

With mitigation the NIC will delay notifying the
OS about new packets for a small amount of time in the hope of getting
multiple packets ready:
   <userspace>
   IRQ entry
   retrieve one packet from the NIC, queue it
   retrieve second packet from the NIC, queue it
   hardIRQ exit
   SoftIRQ context entry
   inspect packet, discover it's a fragment of an UDP packet, keep it on a
			fragment list
   inspect second packet, merge it to first one and queue the UDP packet
 			to userspace
   SoftIRQ context exit back to original context
   <userspace>

this batching is clearly more efficient (far less switches) and in
addition to the explicit steps shown above, it also improves cache
locality much.

About system time: you can't assume system time will scale linear with
network rate due to batching; the higher the system time, the better
things will get batched (and thus far less additional cost for
an extra packet). This can be to the point where even at
100% utilisation you can keep increasing the load for a while.



> > kernel-utils rpm and make sure the irqbalance service is started during
> > boot (default for RHL9, the RHL8 erratum defaults to off)
> 
> (i assume the values below have wrapped)
> 
>   8:17am  up 59 days, 18:59,  5 users,  load average: 11.25, 11.44, 12.11
> 
>            CPU0       CPU1
>   0: 1322107066 1322871970    IO-APIC-edge  timer
>   1:        273        264    IO-APIC-edge  keyboard
>   2:          0          0          XT-PIC  cascade
>   8:          0          1    IO-APIC-edge  rtc
>  16:  253487196  253651272   IO-APIC-level  eth0
>  20:  799452491  799875894   IO-APIC-level  qla2200
>  22: 2366882687 2368370608   IO-APIC-level  qla2200
>  24: 1316729851 1317548294   IO-APIC-level  qla2200
>  26: 1147601730 1148380357   IO-APIC-level  eth2
>  28:  959647645 1005132173   IO-APIC-level  eth1
>  30:       2022       1991   IO-APIC-level  aic7xxx
>  31:  518955144  519175867   IO-APIC-level  aacraid
> NMI:          0          0

this behavior may be the least optimal possible ;)
if the irq's alternate between cpu's it's quite likely that fragments of
your UDP packets (NFS sends 4Kb or 8Kb UDP packets over ethernet with a
1500 byte mtu so those UDP packets get split up) end up on alternating
CPU's, which means that in order to combine them, half the
packet (well technically only the metadata) needs to be transported
from one cpu to the other... which is relatively expensive.

Greetings,
    Arjan van de Ven








More information about the Linux-PowerEdge mailing list