On 6/15/2014 3:39 PM, Mark van der Meulen wrote:
> Hi List,
>
> I=B9m wondering if anyone can help me with this problem or at least help
> point me in the direction of where to start looking? I have FreeBSD 9
> based servers which are crashing every 4-10 days and producing crash dumps
> similar to this one: http://pastebin.com/F82Jc08C
>
> All crash dumps seem to involve the net graph code and the current process
> is always ng_queueX.
>
> In summary, we have 4 x FreeBSD server running as LNS(MPD5) for around
> 2000 subscribers with 3 of the servers running a modified version of
> BSDRP, the fourth running a FreeBSD 9 install with what I thought was the
> latest stable source for the kernel because I fetched it from stable/9
> however it shows up as 9.3-BETA in uname(the linked crash dump is from
> that server).
>
> 3 x LNS running modified BSDRP: DELL PowerEdge 2950, 2 x Xeon E5320, 4GB
> RAM, igb Quad Port NIC in LAGG, Quagga, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
> 1 x LNS running latest FreeBSD 9 code: HP ProLiant DL380, 2 x Xeon X5465,
> 36GB RAM, em Quad Port NIC in LAGG, BIRD, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
>
> The reason I built the fresh server on FreeBSD 9 is because I cannot save
> crash dumps for BSDRP easily. In short the problem is this =AD servers wi=
th
> 10-50 clients will run indefinitely(as long as we have had them, which is
> probably about 1.5 years) without errors and serve clients fine, however
> any with over 300 clients appear to only stay online for 4-10 days maximum
> before crashing and rebooting. I have attached the crash file from the
> latest crash on the LNS running the latest FreeBSD 9 code however unsure
> what to do with it and where to look?
>
> When these devices crash they are often doing in excess of
> 200Mbps(anywhere between 200Mbps and 450Mbps), very little load(3-4.5 on
> the first 3, less than 2 on the fourth).
>
> Things I=B9ve done to attempt resolution:
>
> - Replaced bce network cards with em network cards. This produced far less
> errors on the interfaces(was many before, now none) and I think caused the
> machines to stay up longer between reboots as before it would happen up to
> once a day.
> - Replaced em network cards with igb network cards. All this did was lower
> load and give us a little more time between reboots.
> - Tried an implementation using FreeBSD 10(this lasted less than 4 hours
> before reboots when under load)
> - Replaced memory
> - Increased memory on LNS4 to 36GB.
> - Various kernel rebuilds
> - Tweaked various kernel settings. This appears to have helped a little
> and given us more time between reboots.
> - Disabled IPv6
> - Disabled IPFW
> - Disabled BSNMPD
> - Disabled Netflow
> - Versions 5.6 and 5.7 of MPD5
>
> Anyone able to help me work out what the crash dump means? It only happens
> on servers running MPD5 (eg. Exact same boxes, exact same code pushing
> 800Mbps+ of routing and no crashes) and I can see the crash relates to net
> graph, however unsure where to go from there=8A
>
> Thanks,
>
> Mark
>
>
> Relevant Current Settings:
>
> net.inet.ip.fastforwarding=3D1
> net.inet.ip.fw.default_to_accept=3D1
> net.bpf.zerocopy_enable=3D1
> net.inet.raw.maxdgram=3D16384
> net.inet.raw.recvspace=3D16384
> hw.intr_storm_threshold=3D64000
> net.inet.ip.fastforwarding=3D1
> net.inet.ip.fw.default_to_accept=3D1
> net.inet.ip.intr_queue_maxlen=3D10240
> net.inet.ip.redirect=3D0
> net.inet.ip.sourceroute=3D0
> net.inet.ip.rtexpire=3D2
> net.inet.ip.rtminexpire=3D2
> net.inet.ip.rtmaxcache=3D256
> net.inet.ip.accept_sourceroute=3D0
> net.inet.ip.process_options=3D0
> net.inet.icmp.log_redirect=3D0
> net.inet.icmp.drop_redirect=3D1
> net.inet.tcp.drop_synfin=3D1
> net.inet.tcp.blackhole=3D2
> net.inet.tcp.sendbuf_max=3D16777216
> net.inet.tcp.recvbuf_max=3D16777216
> net.inet.tcp.sendbuf_auto=3D1
> net.inet.tcp.recvbuf_auto=3D1
> net.inet.udp.recvspace=3D262144
> net.inet.udp.blackhole=3D0
> net.inet.udp.maxdgram=3D57344
> net.route.netisr_maxqlen=3D4096
> net.local.stream.recvspace=3D65536
> net.local.stream.sendspace=3D65536
> net.graph.maxdata=3D65536
> net.graph.maxalloc=3D65536
> net.graph.maxdgram=3D2096000
> net.graph.recvspace=3D2096000
> kern.ipc.somaxconn=3D32768
> kern.ipc.nmbclusters=3D524288
> kern.ipc.maxsockbuf=3D26214400
> kern.ipc.shmmax=3D=B32147483648"
> kern.ipc.nmbjumbop=3D=B353200"
> kern.ipc.maxpipekva=3D=B3536870912"
> kern.random.sys.harvest.ethernet=3D"0"
> kern.random.sys.harvest.interrupt=3D"0"
> vm.kmem_size=3D=B34096M=B2 # Only on box with over 12G RAM. Otherwise 2G.
>
>
> vm.kmem_size_max=3D=B38192M" # Only on box with over 12G RAM.
> hw.igb.rxd=3D"4096"
> hw.igb.txd=3D"4096"
> hw.em.rxd=3D"4096"
> hw.em.txd=3D"4096"
> hw.igb.max_interrupt_rate=3D=B332000"
>
> hw.igb.rx_process_limit=3D"4096"
> hw.em.rx_process_limit=3D"500"
> net.link.ifqmaxlen=3D"20480"
> net.isr.dispatch=3D"direct"
> net.isr.direct_force=3D"1"
> net.isr.direct=3D"1"
> net.isr.maxthreads=3D"8"
> net.isr.numthreads=3D"4"
> net.isr.bindthreads=3D"1"
> net.isr.maxqlimit=3D"20480"
> net.isr.defaultqlimit=3D"8192"
>
>
The following workarounds have worked for some people.
They may not solve your problem, but are worth giving a try:
1. Increases netgraph limits:
net.graph.maxdata=3D262140 # /boot/loader.conf
net.graph.maxalloc=3D262140 # /boot.loader.conf
2. Remove FLOWTABLE kernel option.
It would also help if you put your kernel and core dump somewhere for downl=
oad so we can have a closer look at panic trace.
-- =
Best regards.
Hooman Fazaeli
_______________________________________________
freebsd-bugs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscribe@freebsd.org"