How to analyze kernel Oops dump

Wed Feb 6 14:44:16 UTC 2013

On 5 February 2013 13:27, Manavendra Nath Manav <mnm.kernel at gmail.com> wrote:
> I am running Linux 3.4.0 on embedded ARM target and getting following
> Oops every-time. I am not able to pin-point the reason for crash and
> which driver module triggered it. How can I decode the values in the
> registers at the time of crash. It's showing all in hex.
>
> [  492.713897] ------------[ cut here ]------------
> [  492.718841] WARNING: at mm/slub.c:3415 ksize+0x70/0xc4()
> [  492.725311] ---[ end trace 90a5ae2bdb3ab657 ]---
> [  492.915618] ------------[ cut here ]------------
> [  492.920593] WARNING: at mm/slub.c:3415 ksize+0x70/0xc4()
> [  492.927032] ---[ end trace 90a5ae2bdb3ab658 ]---
> [  493.113464] Unable to handle kernel paging request at virtual
> address f6b9f777
<snip>
> [  494.068664] Backtrace:
> [  494.071289] [<80109434>] (__kmalloc_track_caller+0x0/0x1ec) from
> [<80335ec0>] (__alloc_skb+0x60/0xfc)
> [  494.081085] [<80335e60>] (__alloc_skb+0x0/0xfc) from [<80336530>]
> (__netdev_alloc_skb+0x2c/0x54)
> [  494.090423] [<80336504>] (__netdev_alloc_skb+0x0/0x54) from
> [<7f078788>] (stmmac_poll+0x590/0x794 [stmmac])
> [  494.100738]  r4:ed0b84c0 r3:00000000
> [  494.104553] [<7f0781f8>] (stmmac_poll+0x0/0x794 [stmmac]) from
> [<8033f23c>] (net_rx_action+0x88/0x1f0)
> [  494.114440] [<8033f1b4>] (net_rx_action+0x0/0x1f0) from
> [<80045fb4>] (__do_softirq+0x12c/0x260)
> [  494.123657] [<80045e88>] (__do_softirq+0x0/0x260) from [<8004659c>]
> (irq_exit+0x58/0xb0)
> [  494.132263] [<80046544>] (irq_exit+0x0/0xb0) from [<8000fa08>]
> (handle_IRQ+0x8c/0xc8)
> [  494.140563]  r4:00000078 r3:0000020c
> [  494.144378] [<8000f97c>] (handle_IRQ+0x0/0xc8) from [<80008658>]
> (gic_handle_irq+0x48/0x6c)
> [  494.153228]  r5:80569f40 r4:fa212000
> [  494.157043] [<80008610>] (gic_handle_irq+0x0/0x6c) from
> [<8000e600>] (__irq_svc+0x40/0x70)
<snip>

I have experienced such errors before on embedded platforms. The
following may not be the problem at all, but -

One thing to bear in mind is that memory errors such as these
(especially on embedded platforms) can be caused by bad memory
read/writes - usually due to hardware bugs or incorrectly setup memory
chips.

Because network devices can use a lot of memory caching in the
background and perform a large number of read/writes, statistically,
they are usually the first to trip up over such issues. If the
hardware platform is new / largely untested, I would double check.

Cheers,

Mark