[Keepalived-devel] ksoftirqd/0 100% Processor Usage

Discussion:

David Rusenko

2005-09-28 05:16:03 UTC

Hello Alexandre and friends,

I am experiencing a very weird issue with keepalived 1.1.11 and
ksoftirqd. This issue did not crop up in testing, and has only
manifested once the servers were put in production (!).

In the interest of time, for you and others, I have included the details
of the Keepalived setup, and server setup, at the bottom of the message.

Here is what happens: When keepalived is started on the second director,
ksoftirqd/0 slowly begins to use increasingly more processor, and the
load slowly begins to rise. If keepalived is not shut off on the second
server, the two will eventually become unresponsive, and "sweat it out"
with loads of 100+ for about 45 mins, until the load comes back down.
During the periods of high load, services become unresponsive and don't
reply to queries. The servers operate fine if keepalived is only running
on the Master Director.

advert_int is set to 1 -- could this be a problem? I've included the
vrrp_instance definition at the bottom of this message as well.

All interrupts look good for systems with disk usage and high network
usage, except perhaps the "timer" interrupt. Does Keepalived use any
soft interrupts? I've included a listing of /proc/interrupts at the
bottom of the message.

Is there any other information that might be useful in debugging the
problem? I have thought of emailing the linux-kernel mailing list, but
would prefer to ask here first.

Thanks in advance for your help on this issue, and for your great work
on Keepalived. I look forward to hearing from you soon.

Sincerely,

David Rusenko
President/CEO
Aderes, Inc - www.aderes.net

KEEPALIVED SETUP
2 servers both configured as Real Servers, and as Directors. Localnode
is used to point services from the director back to itself, and LVS-DR
is used for the services. With the new 2.6 kernel, and a recent
Keepalived code base, the VIP is added to the NIC on MASTER transition,
and a gARP is sent out -- all is fine. No ARPs are sent out by the
backup director, as it does not have the IP address on its interface.
All is well at this point, and the setup worked perfectly in testing.

SERVER SETUP
The systems in question are IBM xSeries 305 servers, with "Broadcom
Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 02)" network cards.
Both have sufficient RAM (1.5+ GB) and CPU (2.4 Ghz). They are both
running SuSE 9.1 Linux.

VRRP_INSTANCE DEFINITION
vrrp_instance VI_1 {
state MASTER
interface eth1
virtual_router_id 51
priority 200
advert_int 1
authentication {
auth_type PASS
auth_pass somepassword
}
virtual_ipaddress {
x.x.x.x
}
}

LISTING OF /proc/interrupts
# cat /proc/interrupts
CPU0
0: 1250767820 XT-PIC timer
2: 0 XT-PIC cascade
5: 99823249 XT-PIC eth1
7: 42909170 XT-PIC eth0
8: 2 XT-PIC rtc
9: 14 XT-PIC acpi
10: 0 XT-PIC ohci_hcd
14: 34660278 XT-PIC ide0
NMI: 5844
LOC: 0
ERR: 0
MIS: 0

David Rusenko

2005-10-11 20:43:07 UTC

Permalink

Dear Alexandre et al:

I haven't heard back from you on this subject, so I am considering
taking it to the linux-kernel mailing list. I understand you may not
have any real idea of what is going on, but if you have any quick tips,
ideas, or suggestions, it would all be very helpful.

Thanks again,

David Rusenko
Aderes, Inc - www.aderes.net

Post by David Rusenko
Hello Alexandre and friends,
I am experiencing a very weird issue with keepalived 1.1.11 and
ksoftirqd. This issue did not crop up in testing, and has only
manifested once the servers were put in production (!).
In the interest of time, for you and others, I have included the details
of the Keepalived setup, and server setup, at the bottom of the message.
Here is what happens: When keepalived is started on the second director,
ksoftirqd/0 slowly begins to use increasingly more processor, and the
load slowly begins to rise. If keepalived is not shut off on the second
server, the two will eventually become unresponsive, and "sweat it out"
with loads of 100+ for about 45 mins, until the load comes back down.
During the periods of high load, services become unresponsive and don't
reply to queries. The servers operate fine if keepalived is only running
on the Master Director.
advert_int is set to 1 -- could this be a problem? I've included the
vrrp_instance definition at the bottom of this message as well.
All interrupts look good for systems with disk usage and high network
usage, except perhaps the "timer" interrupt. Does Keepalived use any
soft interrupts? I've included a listing of /proc/interrupts at the
bottom of the message.
Is there any other information that might be useful in debugging the
problem? I have thought of emailing the linux-kernel mailing list, but
would prefer to ask here first.
Thanks in advance for your help on this issue, and for your great work
on Keepalived. I look forward to hearing from you soon.
Sincerely,
David Rusenko
President/CEO
Aderes, Inc - www.aderes.net
KEEPALIVED SETUP
2 servers both configured as Real Servers, and as Directors. Localnode
is used to point services from the director back to itself, and LVS-DR
is used for the services. With the new 2.6 kernel, and a recent
Keepalived code base, the VIP is added to the NIC on MASTER transition,
and a gARP is sent out -- all is fine. No ARPs are sent out by the
backup director, as it does not have the IP address on its interface.
All is well at this point, and the setup worked perfectly in testing.
SERVER SETUP
The systems in question are IBM xSeries 305 servers, with "Broadcom
Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 02)" network cards.
Both have sufficient RAM (1.5+ GB) and CPU (2.4 Ghz). They are both
running SuSE 9.1 Linux.
VRRP_INSTANCE DEFINITION
vrrp_instance VI_1 {
state MASTER
interface eth1
virtual_router_id 51
priority 200
advert_int 1
authentication {
auth_type PASS
auth_pass somepassword
}
virtual_ipaddress {
x.x.x.x
}
}
LISTING OF /proc/interrupts
# cat /proc/interrupts
CPU0
0: 1250767820 XT-PIC timer
2: 0 XT-PIC cascade
5: 99823249 XT-PIC eth1
7: 42909170 XT-PIC eth0
8: 2 XT-PIC rtc
9: 14 XT-PIC acpi
10: 0 XT-PIC ohci_hcd
14: 34660278 XT-PIC ide0
NMI: 5844
LOC: 0
ERR: 0
MIS: 0
-------------------------------------------------------
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Keepalived-devel mailing list
https://lists.sourceforge.net/lists/listinfo/keepalived-devel

Alexandre Cassen

2005-10-12 06:03:28 UTC

Permalink

Hi David,

I am intimely convinced that this is due to a system miss/bad config. I
never received any issues on ksoftirq.

I would recommand to profile your kernel to trace back the source of the
trouble. I would recommand too to use a different hardware,...

Best regards,
Alexandre

Post by David Rusenko
I haven't heard back from you on this subject, so I am considering
taking it to the linux-kernel mailing list. I understand you may not
have any real idea of what is going on, but if you have any quick tips,
ideas, or suggestions, it would all be very helpful.
Thanks again,
David Rusenko
Aderes, Inc - www.aderes.net

-------------------------------------------------------
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Keepalived-devel mailing list
https://lists.sourceforge.net/lists/listinfo/keepalived-devel

--
Alexandre Cassen <***@freebox.fr>
Freebox SA

David Rusenko

2006-02-20 01:22:02 UTC

Permalink

Hello,

I've since found a solution to the problem mentioned in this thread, so
I thought I'd share it with the rest of the list, for completeness purposes.

Essentially, what was happening can be described as a "packet storm" or
infinite redirection loop, where one director would forward a packet to
another, who would forward it back to the original, and so on. This
caused the high load, when this situation arose.

The solution is to create a script to be called by keepalived on state
change, as described in this message [
http://sourceforge.net/mailarchive/message.php?msg_id=12901268 ]. This
backs up and flushes the ipvs table when a director enters BACKUP state,
to prevent the infinite redirection loop problem.

Thanks again Alexandre for this great software!

David Rusenko
Aderes, Inc. - www.aderes.net

Francois JEANMOUGIN

2005-10-12 06:12:44 UTC

Permalink

Post by Alexandre Cassen
Hi David,
I am intimely convinced that this is due to a system miss/bad config. I
never received any issues on ksoftirq.
I would recommand to profile your kernel to trace back the source of the
trouble. I would recommand too to use a different hardware,...

Post by David Rusenko
SERVER SETUP
The systems in question are IBM xSeries 305 servers, with "Broadcom
Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 02)" network

cards.

Post by David Rusenko
Both have sufficient RAM (1.5+ GB) and CPU (2.4 Ghz). They are both
running SuSE 9.1 Linux.

X305 is quite a good server, a little bit oversized for a LVS director.
Anyway, always be careful with Broadcom NICs, you can find it on HP and IBM
servers, but it is not always well supported. Sometimes, I had to use the
official bcm5700 driver from Broadcom, sometimes an older one from HP, and
sometimes the official TG3. Even if theorically they all share the same code,
you can get errors if you do not use the good one. Ksoftirqd using CPU is
clearly a problem with a piece of hardware for me.

François.

j***@sitadelle.com

2005-10-12 08:23:15 UTC

Permalink

Hi David,

Post by David Rusenko
I am experiencing a very weird issue with keepalived 1.1.11 and
ksoftirqd. This issue did not crop up in testing, and has only
manifested once the servers were put in production (!).

I have been faced against the same problem once. This happened only
in production because NetFilter was generating logs that were sent
to the console which wasn't a CRT but a serial link. Given the
bandwidth of a serial link, ksoftirqd ate the whole CPU.

Regards,

--
Jeremie LE HEN aka TtZ/TataZ jeremie.le-***@sitadelle.com
***@sitadelle.com
Q: Because it reverses the logical flow of conversation.
A: Why is putting a reply at the top of the message frowned upon?