Hi, My name is Kim and I?m a support engineer in the Core Performance team. I recently joined this group about six months ago and noticed that there are a variety of ways to gather the same data. I?d thought I share some of the tricks I?ve learned on gathering memory dumps.
You could say I'm bad at fractions. My last job I liked to explain as:
1. Half technical.
2. Half process.
3. Half babysitting (no one mentions the third half. Fight Club Rules I guess)
I recently tried to explain my new role as:
1. One-third "knowing how things Should work"
2. One-third "knowing what data to capture when its Not working"
3. One-third trying to figure out the difference.
4. (Fight club rules apply for the fourth-third)
Knowing how things should work depends on the issue at hand. Typically you can be as generic or granular as necessitated by the situation. For example: All I need to know about my car is that I turn the key and it starts. That?s how it should work. When it doesn?t start I would need to know more; Is it out of gas? Bad spark plug? Alternator? Battery? Bad key-chip? The more you know the more you can eliminate until you pin-point an area to dig into.
Once you have an area to focus on we typically need to run capture data. Capturing data can be broken down into two areas; A snapshot of a single moment and a collection of snapshots over time. The best example of single snapshot is a memory dump.
The concept is basic: whatever the computer is doing at any one moment in time, freeze it, and put all that info into a file. In practice there are several ways to take that picture.
The old dog - Control Scroll Scroll
The most common, hands on way to force a memory dump is to configure the server to dump on a specific keystroke combination. Specifically by hitting the right CTRL key and pressing the SCROLL LOCK key two times.
PS2 keyboard
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters
Name : CrashOnCtrlScroll
Data Type : REG_DWORD
Value : 1
- A reboot is required before this becomes active.
USB Keyboard
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters
Name : CrashOnCtrlScroll
Data Type : REG_DWORD
Value : 1
- A reboot is not required, but unplugging and plugging in the keyboard is needed.
- Control Scroll Scroll can be used anytime but typically when the server is non-responsive, or "in-state".
- In general this should work for any OS. Some operating systems may need hotfixes
- The dump that is created is a Stop 0xE2.
Remote Old Dog
When we don?t have the option to connect a keyboard directly was can configure the system to reboot remotely via Non Maskable Interface (NMI).
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl
Name : NMICrashDump
Data Type : REG_DWORD
Value : 1
- A reboot is required before this becomes active
- The NMI must be enabled in the BIOS or via Integrated Lights Out (iLO) Web interface.
- Since there is no local connection to the server some typical behaviors are:
- Not responsive to ping
- Cannot connect to the server via terminal services
- Cannot connect via UNC path
- Cannot connect via remote registry (connecting to the server via event viewer or server management console etc)
-
- It's always good to try multiple methods of connecting to the server and note what works and what doesn?t.
- In general this should work for any OS that has compatible hardware (NMI aware/iLO)
- The dump that is created is a Stop 0x80
Just Do It!
There's a tool called NotMyFault that will crash the box on demand.
Click Start
Locate and right-click Command Prompt
Select Run as administrator.
Type NotMyfault.exe /crash
- No registry changes are needed to get the server to crash.
- This is used when the server is behaving badly
- Out of resource errors
- Very slow to respond
- Generally, anything that is not normal server operation
- The server isn't completely locked up (we need to run the executable to force the dump)
- In general this should work for any OS.
- The dump that is created is a 0xD1
Whatchu Talkin About?
NotMyFault.exe can be triggered automatically when an event is recorded in the event log. For example, if you're getting
Event ID: 2019
Source: Srv
Description: The server was unable to allocate from the system nonpaged pool because the pool was empty.
intermittently, when we get around to forcing the dump the issue may not be present. In that case we can setup a trigger to call NotMyFault as soon as 2019 pops its pretty little head.
Example 1:
Setting up event triggering on 2003
- In Notepad type the line NotMayFault.exe /crash
- Save the file as NMF.Bat
- Open a Command Prompt
- Run the command
?eventtriggers /create /tr "Non Paged Pool Event" /eid 2019 /so SRV /tk \\server\share\NMF.bat?
http://technet.microsoft.com/en-us/library/bb490901.aspx
Example 2:
Setting up event triggering on 2008
- In Notepad type the line NotMayFault.exe /crash
- Save the file as NMF.Bat
- In Administrative Tools open Task Scheduler
- Select Create Basic Task
- On the Task Trigger page select "When a specific event is logged"
- On the When a Specific Event Is Logged page select
Log: System
Source: SRV
Event ID: 2019
- Acton page select "Start a program"
- Select the location of the NMF.Bat file.
- No registry changes are needed to get the server to crash.
- This is used when we want to capture a dump as soon as a specific event is triggered
- In general this should work for any OS.
- The dump that is created is a 0xD1
That special case: Event 333
The server can be setup to reboot automatically when an Event 333 is triggered on Windows 2003 servers via hotfix and a registry setting:
- Registry change:
Location: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager
Name: RegistryFlushErrorSubside
Type: REG_DWORD
Value: 2
Virtual Environments:
Hyper-V:
When running a Hyper-V server with a problematic Hyper-V guest machine we can do the following to generate a memory dump from that guest:
- NotMyFault can be used on the system
- NotMyFault via EventTriggers/Scheduled Task can be used
- Both those options will dump the box.
- A less invasive method is to Save State on the Guest and convert the vsv file to a dmp file. The benefit here is that the Guest is not shutdown forcefully.
VMWare:
When running a VMWare server with a problematic Vmware guest machine we can do the following to generate a memory dump from that guest:
- NotMyFault can be used on the system
- NotMyFault via EventTriggers/Scheduled Task can be used
- NMI
- Log in as root in a terminal session on the ESX host where the virtual machine is running.
- Run the following command to determine the world ID number for the virtual machine ?vm-support ?x?
- Match the name of the virtual machine with the world ID.
- Run the following command to cause the virtual machine to fail: ?/usr/lib/vmware/bin/vmdumper <world_id> nmi?
- We can also save state on the Vmware guest and convert that into a memory dump for review
http://www.vmware.com/pdf/snapshot2core_technote.pdf
- Create a snapshot or suspend the virtual machine.
- Locate the snapshot (.vmsn) or suspend file (.vmss) in the virtual machine directory. The vmss2core tool also accepts monolithic or non?monolithic memory (.vmem) files. Copy the checkpoint file to a Workstation 7.1 or Fusion 3.1 host for debugging.
- Run the vmss2core tool with options selecting your preferred output type. Run this command to generate a memory.dmp file: ?vmss2core -W <vmName>.vmss?
Disclaimers:
All the above is dependent on having the machine configured correctly to save the memory.dmp file. The server may reboot when triggered but if it does not write a memory.dmp file then something is preventing the data from being transferred from memory to be written to disk. This includes but is not limited to:
- Pagefile is too small (should be ~110% RAM)
- Pagefile is not located on the system drive (DedicatedDumpFile setting may work on 2008 to get around this)
- Not enough hard drive space available on the system drive (need to be able to write the memory.dmp file which may be the full amount of RAM)
- ASR is enabled (can prevent the server from rebooting or saving a dmp file)
- Hard drive is corrupt (unable to actually write the file)
- Memory is corrupt (information paged is corrupt and unreadable)
A great tool to use is dumpconfigurator which automates much of the above.
Are there other ways to trigger the server to crash and save a memory dump? Sure! But hopefully everything listed above would cover most situations as to make the other options unnecessary. In the last few months of ramping up in Perf, I've used each of the above at least once to get the data needed to fix the issues at hand :)
Kim Johnson
Senior Support Escalation Engineer
Microsoft Customer Services and Support
COMPAL ELECTRONICS COSMOTE MOBILE TELECOM DLINK DIGITAL CHINA HOLDINGS DIRECTV GROUP
No comments:
Post a Comment