How to Troubleshoot a Hard Kernel Panic
Hard Panics Symptoms:
- Machine is completely locked up and unusable.
- Num Lock / Caps Lock / Scroll Lock keys usually blink.
- If in console mode, dump is displayed on monitor (including the phrase Aieee!).
- Similar to Windows Blue Screen.
Hard panics causes:
- The most common cause of a hard kernel panic is when a driver crashes within an interrupt handler,
- Usually because it tried to access a null pointer within the interrupt handler.
- When this happens, that driver cannot handle any new interrupts and eventually the system crashes.
- This is not exclusive to Dialogic drivers.
Hard panics information
to collect:
- Depending on the nature of the panic, the kernel will log all information it can prior to locking up.
- Since a kernel panic is a drastic failure, it is uncertain how much information will be logged.
- Below are key pieces of information to collect.
- It is important to collect as many of these as
possible, but there is no
guarantee that all of them will be available,
Especially the first time a panic is seen.
- /var/log/messages sometimes the entire kernel panic stack trace will be logged there
- Application / Library logs (RTF, cheetah, etc.) may show what was happening before the panic other information about what happened just prior to the panic, or how to reproduce
- Screen dump from console. Since the OS is locked, you cannot cut and paste from the screen.
There are two
common ways to get this info:
- Digital
Picture of screen (preferred, since it’s quicker and easier)
Copying screen with pen and paper or typing to another computer. - If the dump is not available either in /var/log/message or on the screen,
follow
these tips to get a dump:
- If in GUI mode, switch to full console mode no dump info is passed to the GUI (not even to GUI shell).
- Make sure screen stays on during full test run
if a screen saver kicks in, the screen wont return
after a kernel panic.
Use these settings to ensure the screen stays
on.
setterm -blank 0
setterm -powerdown 0
setvesablank off
setterm -powerdown 0
setvesablank off
- From console, copy dump from screen (see above).
- Hard panics Troubleshooting when a full trace is available
- The stack trace is the most important piece of information to use in troubleshooting a kernel panic.
- It is often crucial to have a full stack trace,something that may not be available if only a screen dump is provided the top of the stack may scroll off the screen, leaving only a partial stack trace.
- If a full trace is available, it is usually sufficient to isolate root cause.
- To identify whether or not you have a large enough stack trace, look for a line with EIP, which will show what function call and module caused the panic.
In the
example below, this is shown in the following line:
Hard
panics Troubleshooting when a full trace is not available
If only a partial stack trace is available, it can be tricky to isolate the root cause, since there is no explicit information about what module of function
call caused the panic.
Instead, only commands
leading up to the final command will be seen in a partial
stack trace.
stack trace.
In this case, it is very
important to collect as much information as possible about what happened
leading up to the kernel panic (application
logs, library traces, steps to reproduce, etc).
logs, library traces, steps to reproduce, etc).
Hard
panic partial trace example (note there is no line with EIP information)
[] ip_rcv [kernel] 0¡Á357
[] sramintr [streams_dlgnDriver ] 0¡Á32d
[] lis_spin_lock_ irqsave_fcn [streams] 0¡Á7d
[] inthw_lock [streams_dlgnDriver ] 0¡Á1c
[] pwswtbl [streams_dlgnDriver ] 0¡Á0
[] dlgnintr [streams_dlgnDriver ] 0¡Á4b
[] Gn_Maxpm [streams_dlgnDriver ] 0¡Á7ae
[] __run_timers [kernel] 0xd1
[] handle_IRQ_event [kernel] 0¡Á5e
[] do_IRQ [kernel] 0xa4
[] default_idle [kernel] 0¡Á0
[] default_idle [kernel] 0¡Á0
[] call_do_IRQ [kernel] 0¡Á5
[] default_idle [kernel] 0¡Á0
[] default_idle [kernel] 0¡Á0
[] default_idle [kernel] 0¡Á2d
[] cpu_idle [kernel] 0¡Á2d
[] __call_console_ drivers [kernel] 0¡Á4b
[] call_console_ drivers [kernel] 0xeb
Code: 8b 50 0c 85 d2 74 31 f6 42 0a 02 74 04 89 44 24
08 31 f6 0f
<0> Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
Hard
panics using kernel debugger (KDB)
- If only a partial trace is available and the supporting information is not sufficient to isolate root cause,
- it may be useful to use KDB. KDB is a tool that is compiled into the kernel that causes the kernel to break into a shell rather than lock up when a panic occurs.
- This enables you to collect additional information about the panic, which is often useful in determining root cause.
Some important things to note about using KDB:
- If this is a potential Dialogic issue, technicalsupport should be contacted prior to the to use of KDB Must use base kernel i.e. 2.4.18 kernel insteadof 2.4.18-5 from RedHat.
- This is because KDB is only available for the base kernels, and not the builds created by RedHat. While this does create a slight deviation from the original configuration, it usually does not interfere with root cause analysis.
- Need different Dialogic drivers compiled to handle the specific kernel.
No comments:
Post a Comment