David Rusenko
2005-09-28 05:16:03 UTC
Hello Alexandre and friends,
I am experiencing a very weird issue with keepalived 1.1.11 and
ksoftirqd. This issue did not crop up in testing, and has only
manifested once the servers were put in production (!).
In the interest of time, for you and others, I have included the details
of the Keepalived setup, and server setup, at the bottom of the message.
Here is what happens: When keepalived is started on the second director,
ksoftirqd/0 slowly begins to use increasingly more processor, and the
load slowly begins to rise. If keepalived is not shut off on the second
server, the two will eventually become unresponsive, and "sweat it out"
with loads of 100+ for about 45 mins, until the load comes back down.
During the periods of high load, services become unresponsive and don't
reply to queries. The servers operate fine if keepalived is only running
on the Master Director.
advert_int is set to 1 -- could this be a problem? I've included the
vrrp_instance definition at the bottom of this message as well.
All interrupts look good for systems with disk usage and high network
usage, except perhaps the "timer" interrupt. Does Keepalived use any
soft interrupts? I've included a listing of /proc/interrupts at the
bottom of the message.
Is there any other information that might be useful in debugging the
problem? I have thought of emailing the linux-kernel mailing list, but
would prefer to ask here first.
Thanks in advance for your help on this issue, and for your great work
on Keepalived. I look forward to hearing from you soon.
Sincerely,
David Rusenko
President/CEO
Aderes, Inc - www.aderes.net
KEEPALIVED SETUP
2 servers both configured as Real Servers, and as Directors. Localnode
is used to point services from the director back to itself, and LVS-DR
is used for the services. With the new 2.6 kernel, and a recent
Keepalived code base, the VIP is added to the NIC on MASTER transition,
and a gARP is sent out -- all is fine. No ARPs are sent out by the
backup director, as it does not have the IP address on its interface.
All is well at this point, and the setup worked perfectly in testing.
SERVER SETUP
The systems in question are IBM xSeries 305 servers, with "Broadcom
Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 02)" network cards.
Both have sufficient RAM (1.5+ GB) and CPU (2.4 Ghz). They are both
running SuSE 9.1 Linux.
VRRP_INSTANCE DEFINITION
vrrp_instance VI_1 {
state MASTER
interface eth1
virtual_router_id 51
priority 200
advert_int 1
authentication {
auth_type PASS
auth_pass somepassword
}
virtual_ipaddress {
x.x.x.x
}
}
LISTING OF /proc/interrupts
# cat /proc/interrupts
CPU0
0: 1250767820 XT-PIC timer
2: 0 XT-PIC cascade
5: 99823249 XT-PIC eth1
7: 42909170 XT-PIC eth0
8: 2 XT-PIC rtc
9: 14 XT-PIC acpi
10: 0 XT-PIC ohci_hcd
14: 34660278 XT-PIC ide0
NMI: 5844
LOC: 0
ERR: 0
MIS: 0
I am experiencing a very weird issue with keepalived 1.1.11 and
ksoftirqd. This issue did not crop up in testing, and has only
manifested once the servers were put in production (!).
In the interest of time, for you and others, I have included the details
of the Keepalived setup, and server setup, at the bottom of the message.
Here is what happens: When keepalived is started on the second director,
ksoftirqd/0 slowly begins to use increasingly more processor, and the
load slowly begins to rise. If keepalived is not shut off on the second
server, the two will eventually become unresponsive, and "sweat it out"
with loads of 100+ for about 45 mins, until the load comes back down.
During the periods of high load, services become unresponsive and don't
reply to queries. The servers operate fine if keepalived is only running
on the Master Director.
advert_int is set to 1 -- could this be a problem? I've included the
vrrp_instance definition at the bottom of this message as well.
All interrupts look good for systems with disk usage and high network
usage, except perhaps the "timer" interrupt. Does Keepalived use any
soft interrupts? I've included a listing of /proc/interrupts at the
bottom of the message.
Is there any other information that might be useful in debugging the
problem? I have thought of emailing the linux-kernel mailing list, but
would prefer to ask here first.
Thanks in advance for your help on this issue, and for your great work
on Keepalived. I look forward to hearing from you soon.
Sincerely,
David Rusenko
President/CEO
Aderes, Inc - www.aderes.net
KEEPALIVED SETUP
2 servers both configured as Real Servers, and as Directors. Localnode
is used to point services from the director back to itself, and LVS-DR
is used for the services. With the new 2.6 kernel, and a recent
Keepalived code base, the VIP is added to the NIC on MASTER transition,
and a gARP is sent out -- all is fine. No ARPs are sent out by the
backup director, as it does not have the IP address on its interface.
All is well at this point, and the setup worked perfectly in testing.
SERVER SETUP
The systems in question are IBM xSeries 305 servers, with "Broadcom
Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 02)" network cards.
Both have sufficient RAM (1.5+ GB) and CPU (2.4 Ghz). They are both
running SuSE 9.1 Linux.
VRRP_INSTANCE DEFINITION
vrrp_instance VI_1 {
state MASTER
interface eth1
virtual_router_id 51
priority 200
advert_int 1
authentication {
auth_type PASS
auth_pass somepassword
}
virtual_ipaddress {
x.x.x.x
}
}
LISTING OF /proc/interrupts
# cat /proc/interrupts
CPU0
0: 1250767820 XT-PIC timer
2: 0 XT-PIC cascade
5: 99823249 XT-PIC eth1
7: 42909170 XT-PIC eth0
8: 2 XT-PIC rtc
9: 14 XT-PIC acpi
10: 0 XT-PIC ohci_hcd
14: 34660278 XT-PIC ide0
NMI: 5844
LOC: 0
ERR: 0
MIS: 0