0CCh Blog

调试挂死的Explorer

一个同事前几天告诉我说,他的explorer.exe总是挂死,不知道是什么情况导致的。于是我让他下次挂死的时候抓个dump我。抓Dump的工具很多,例如用Win7的TaskMgr,sysinternals的Procexp,或者Windbg本身。不过考虑到explorer挂死了,操作桌面起来不方便,所以最好选择能够自动检测挂死并且抓住dump的工具。这里比较推荐的是sysinternals的Procdump以及我开发的proc_dump_study(带UI)。

20130706003632

第二天,同事把explorer.exe挂死的Dump传给了我,200多MB。用Windbg打开Dump文件,第一反应就是看看有多少线程再说吧。

0:000> ~
. 0 Id: b14.b18 Suspend: 0 Teb: 7ffde000 Unfrozen
1 Id: b14.b1c Suspend: 0 Teb: 7ffdd000 Unfrozen
...
54 Id: b14.1a94 Suspend: 0 Teb: 7ff75000 Unfrozen
55 Id: b14.18c8 Suspend: 0 Teb: 7ff74000 Unfrozen

56个线程,肯定不能依次看栈回溯。按照尝试判断,explorer界面挂死,肯定是刷新界面的线程挂死了。所以栈回溯里肯定有explorer的身影。于是找找哪个线程有explorer模块。

0:000> !findstack explorer!
Thread 000, 2 frame(s) match
* 04 001bf924 0087aa50 explorer!wWinMain+0x54a
* 05 001bf9b8 75771154 explorer!_initterm_e+0x1b1

Thread 003, 2 frame(s) match
* 11 0324f714 008757a6 explorer!CTray::_MessageLoop+0x265
* 12 0324f724 75b346bc explorer!CTray::MainThreadProc+0x8a

Thread 008, 1 frame(s) match
* 03 04a1fc0c 75b346bc explorer!CSoundWnd::s_ThreadProc+0x3a

从上面的结果看来,3号线程最可疑,于是看看完整的堆栈情况。

0:003> kv
# ChildEBP RetAddr Args to Child
00 0324f508 76eb5aec 75236924 00000002 0324f55c ntdll!KiFastSystemCallRet (FPO: [0,0,0])
01 0324f50c 75236924 00000002 0324f55c 00000001 ntdll!NtWaitForMultipleObjects+0xc (FPO: [5,0,0])
02 0324f5a8 7576f10a 0324f55c 0324f5d0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x100 (FPO: [Non-Fpo])
03 0324f5f0 75fa90be 00000002 7ffdf000 00000000 kernel32!WaitForMultipleObjectsExImplementation+0xe0 (FPO: [Non-Fpo])
04 0324f644 73d51717 000002fc 0324f678 ffffffff user32!RealMsgWaitForMultipleObjectsEx+0x13c (FPO: [Non-Fpo])
05 0324f664 73d517b8 000024ff ffffffff 00000000 duser!CoreSC::Wait+0x59 (FPO: [Non-Fpo])
06 0324f68c 73d51757 000024ff 00000000 0324f6b8 duser!CoreSC::WaitMessage+0x54 (FPO: [Non-Fpo])
07 0324f69c 75fa949f 000024ff 00000000 0324f68c duser!MphWaitMessageEx+0x2b (FPO: [Non-Fpo])
08 0324f6b8 76eb60ce 0324f6d0 00000008 0324f7e8 user32!__ClientWaitMessageExMPH+0x1e (FPO: [Non-Fpo])
09 0324f6d4 75fa93f3 00851dee 00000000 80000000 ntdll!KiUserCallbackDispatcher+0x2e (FPO: [0,0,0])
0a 0324f6d8 00851dee 00000000 80000000 00901180 user32!NtUserWaitMessage+0xc (FPO: [0,0,0])
0b 0324f714 008757a6 00000000 75b318f2 0324f7ac explorer!CTray::_MessageLoop+0x265 (FPO: [Non-Fpo])
0c 0324f724 75b346bc 00901180 00000000 00000000 explorer!CTray::MainThreadProc+0x8a (FPO: [Non-Fpo])
0d 0324f7ac 75771154 001bf810 0324f7f8 76ecb299 shlwapi!WrapperThreadProc+0x1b5 (FPO: [Non-Fpo])
0e 0324f7b8 76ecb299 001bf810 75d00467 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
0f 0324f7f8 76ecb26c 75b345e9 001bf810 00000000 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
10 0324f810 00000000 75b345e9 001bf810 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

可以看出线程正在调用WaitForMultipleObjectsEx等待两个内核对象。那么看看这两个内核对象是什么吧。

0:003> dp 0324f55c L2
0324f55c 00000318 000002fc
0:003> !handle 00000318
Handle 00000318
Type Event
0:003> !handle 000002fc
Handle 000002fc
Type Event

很不幸,两个内核对象都是Event,这样就没有什么可参考的价值了,因为我们没办法知道谁应该去设置两个event。那么好吧,从其他方面下手,看能不能发现问题。看看关键区的情况。

0:003> !cs -l
-----------------------------------------
DebugInfo = 0x76f47540
Critical section = 0x76f47340 (ntdll!LdrpLoaderLock+0x0)
LOCKED
LockCount = 0x6
WaiterWoken = No
OwningThread = 0x000013e0
RecursionCount = 0x1
LockSemaphore = 0x220
SpinCount = 0x00000000
-----------------------------------------
DebugInfo = 0x002ac6e0
Critical section = 0x765ea0f0 (shell32!CMountPoint::_csDL+0x0)
LOCKED
LockCount = 0x0
WaiterWoken = No
OwningThread = 0x000013e0
RecursionCount = 0x1
LockSemaphore = 0xA50
SpinCount = 0x00000000

看到一个很可疑的情况了,两个cs都被一个线程占用,更可疑的是这个线程居然还占用了LdrpLoaderLock。这就很有可能是引起死锁的原因了。来看看这个线程的完整堆栈。

0:053> ~~[13e0]s
eax=06d02f00 ebx=00000000 ecx=03920000 edx=06ce0000 esi=000015ac edi=00000000
eip=76eb6194 esp=0b1ad1cc ebp=0b1ad238 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
ntdll!KiFastSystemCallRet:
76eb6194 c3 ret
0:053> k
# ChildEBP RetAddr
00 0b1ad1c8 76eb5b0c ntdll!KiFastSystemCallRet
01 0b1ad1cc 7523179c ntdll!ZwWaitForSingleObject+0xc
02 0b1ad238 7576efe3 KERNELBASE!WaitForSingleObjectEx+0x98
03 0b1ad250 7576ef92 kernel32!WaitForSingleObjectExImplementation+0x75
04 0b1ad264 7622399a kernel32!WaitForSingleObject+0x12
05 0b1ad294 7622299c shell32!CMountPoint::_InitLocalDrives+0xcd
...
14 0b1ad81c 762aa690 shell32!SHGetFolderLocation+0x121
15 0b1ad838 0a5c0bf5 shell32!SHGetSpecialFolderLocation+0x17
WARNING: Stack unwind information not available. Following frames may be wrong.
16 0b1ae0bc 0a5bfceb HaoZipExt!DllUnregisterServer+0x1cd04
17 0b1ae240 0a5e0200 HaoZipExt!DllUnregisterServer+0x1bdfa
18 0b1ae284 0a5e02b9 HaoZipExt!DllUnregisterServer+0x3c30f
19 0b1ae2ac 76ecfbdf HaoZipExt!DllUnregisterServer+0x3c3c8
1a 0b1ae3a0 76ed008b ntdll!LdrpRunInitializeRoutines+0x26f
1b 0b1ae50c 76ecf499 ntdll!LdrpLoadDll+0x4d1
1c 0b1ae540 7523b96d ntdll!LdrLoadDll+0x92
1d 0b1ae57c 7534a333 KERNELBASE!LoadLibraryExW+0x1d3
1e 0b1ae598 7534a2b8 ole32!LoadLibraryWithLogging+0x16
...
39 0b1afbac 76241ee6 shell32!CShellExecute::_DoExecute+0x5a
3a 0b1afbc0 75b346bc shell32!CShellExecute::s_ExecuteThreadProc+0x30
3b 0b1afc48 75771154 shlwapi!WrapperThreadProc+0x1b5
3c 0b1afc54 76ecb299 kernel32!BaseThreadInitThunk+0xe
3d 0b1afc94 76ecb26c ntdll!__RtlUserThreadStart+0x70
3e 0b1afcac 00000000 ntdll!_RtlUserThreadStart+0x1b

首先一眼就看到了一个非系统模块HaoZipExt。再扫一眼,发现他在LdrpLoaderLock的时候又去等待了某个内核对象。那么再来看看这个内核对象是什么吧。

0:053> kv L5
# ChildEBP RetAddr Args to Child
00 0b1ad1c8 76eb5b0c 7523179c 000015ac 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
01 0b1ad1cc 7523179c 000015ac 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
02 0b1ad238 7576efe3 000015ac ffffffff 00000000 KERNELBASE!WaitForSingleObjectEx+0x98 (FPO: [Non-Fpo])
03 0b1ad250 7576ef92 000015ac ffffffff 00000000 kernel32!WaitForSingleObjectExImplementation+0x75 (FPO: [Non-Fpo])
04 0b1ad264 7622399a 000015ac ffffffff 00000000 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])
0:053> !handle 000015ac f
Handle 000015ac
Type Thread
Attributes 0
GrantedAccess 0x1fffff:
Delete,ReadControl,WriteDac,WriteOwner,Synch
Terminate,Suspend,Alert,GetContext,SetContext,SetInfo,QueryInfo,SetToken,Impersonate,DirectImpersonate
HandleCount 4
PointerCount 7
Name <none>
Object specific information
Thread Id b14.1a94
Priority 10
Base Priority 0

原来他在等待1a94这个线程结束啊,那么这个1a94线程又在干嘛呢?

0:054> ~~[1a94]s
eax=0bbdfbb4 ebx=00000000 ecx=00000000 edx=00000000 esi=76f47340 edi=00000000
eip=76eb6194 esp=0bbdfa24 ebp=0bbdfa88 iopl=0 nv up ei pl nz ac pe cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000217
ntdll!KiFastSystemCallRet:
76eb6194 c3 ret
0:054> kv
# ChildEBP RetAddr Args to Child
00 0bbdfa20 76eb5b0c 76e9f98e 00000220 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
01 0bbdfa24 76e9f98e 00000220 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
02 0bbdfa88 76e9f872 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
03 0bbdfab0 76ecb31d 76f47340 7d4908db 7ff75000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
04 0bbdfb44 76ecb13c 0bbdfbb4 7d49080f 00000000 ntdll!LdrpInitializeThread+0xc6 (FPO: [Non-Fpo])
05 0bbdfb90 76ecb169 0bbdfbb4 76e70000 00000000 ntdll!_LdrpInitialize+0x1ad (FPO: [Non-Fpo])
06 0bbdfba0 00000000 0bbdfbb4 76e70000 00000000 ntdll!LdrInitializeThunk+0x10 (FPO: [Non-Fpo])

原来这个线程在等待LdrpLoaderLock这个锁啊,真相大白了。这里理一下思路,线程13e0,创建后,调用Loadlibrary,装载HaoZipExt。这个时候HaoZipExt获得LdrpLoaderLock,但是HaoZipExt犯了编写DLL的大忌。在DLLMain里面做了一些不能预期的事情。HaoZipExt调用了SHGetSpecialFolderLocation,这个函数在内部会创建一个线程,运行一个叫做FirstHardwareEnumThreadProc 的子过程。这个线程起来之后,就会通知所用的DllMain,告诉他们DLL_THREAD_ATTACH的消息。但是告诉他们这个消息之前,首先要获得LdrpLoaderLock这个锁。但是LdrpLoaderLock这个锁正在被创建他的线程使用,而且还在等自己结束,就这样死锁了。这也是MSDN特别强调告诉我们,不要在DllMain里有过多自己不能预期的操作的原因。

那么看看这个罪魁祸首是什么模块吧。

0:054> lmvm HaoZipExt
Browse full module list
start end module name
0a5a0000 0a605000 HaoZipExt (export symbols) HaoZipExt.dll
Loaded symbol image file: HaoZipExt.dll
Image path: C:\Program Files\HaoZip\HaoZipExt.dll
Image name: HaoZipExt.dll
Browse all global symbols functions data
Timestamp: Wed Jul 25 17:16:06 2012 (500FB956)
CheckSum: 0006C14B
ImageSize: 00065000
File version: 3.0.1.9002
Product version: 3.0.1.9002
File flags: 0 (Mask 3F)
File OS: 40004 NT Win32
File type: 2.0 Dll
File date: 00000000.00000000
Translations: 0804.04b0
CompanyName: 瑞创网络
ProductName: 2345好压(HaoZip)
InternalName: HaoZipExt
OriginalFilename: HaoZipExt.dll
ProductVersion: 3.0
FileVersion: 3.0.1.9002
FileDescription: 2345好压-Windows扩展模块
LegalCopyright: 版权所有(c) 2012 瑞创网络
Comments: www.haozip.com

知道问题后,我感觉这应该就是explorer挂死的原因,虽然没有100%的证据,但是至少也是一个造成死锁的程序,早卸载为妙,于是我告诉了同事,卸载了这个叫做好压的软件。之后几天,explorer运行正常,再也没有出现过挂死现象了。

最后总结HaoZipExt犯的错误
1.在DllMain里面的做了线程创建的操作。
2.跟挂死无关,只是吐槽一下他在DllMain里面调用了SHGetSpecialFolderLocation这个函数。因为这个函数已经不被支持,而且有可能在将来被废弃。以下是MSDN的原话:[SHGetSpecialFolderLocation is not supported and may be altered or unavailable in the future. Instead, useSHGetFolderLocation.]

感叹一下,写一个健壮的程序真的不是件容易的事啊。