TA的每日心情 | 擦汗 4 小时前 |
---|
签到天数: 1047 天 连续签到: 5 天 [LV.10]测试总司令
|
一、背景
1.讲故事
前段时间有位朋友加微信,说他的程序直接 CPU=100%,每次只能手工介入重启,让我帮忙看下到底怎么回事,哈哈,这种CPU打满的事故,程序员压力会非常大, 我让朋友在 CPU 高的时候抓 2 个 dump 下来,然后发给我分析。
二、WinDbg 分析
1. CPU 真的被打满了吗?
为了防止南辕北辙,一定要用 !tp 命令去验证下是不是真的 CPU 爆高。
- <font size="3">0:000> !tp
- CPU utilization: 100%
- Worker Thread: Total: 21 Running: 7 Idle: 0 MaxLimit: 32767 MinLimit: 4
- Work Request in Queue: 3
- AsyncTimerCallbackCompletion TimerInfo@00000000042d2430
- AsyncTimerCallbackCompletion TimerInfo@00000000042d2f90
- AsyncTimerCallbackCompletion TimerInfo@000000000420c150
- --------------------------------------
- Number of Timers: 0
- --------------------------------------
- Completion Port Thread:Total: 18 Free: 9 MaxFree: 8 CurrentLimit: 18 MaxLimit: 1000 MinLimit: 4</font>
复制代码
从代码中看确实 100%,太牛了,而且 WorkRequest 还有任务堆积现象,确认无疑后,接下来看下是谁引发的?
2. 谁导致的 CPU 爆高
根据惯例首先怀疑是不是 GC 触发所致,可以用 !t? 查看下线程列表,观察下有没有 GC 字样。
- <font size="3">:000> !t
- ThreadCount: 53
- UnstartedThread: 0
- BackgroundThread: 53
- PendingThread: 0
- DeadThread: 0
- Hosted Runtime: no
- Lock
- ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
- 4 1 1240 00000000021cdf30 2a220 Preemptive 0000000000000000:0000000000000000 00000000021d94c0 0 MTA
- 23 2 4db4 00000000041cdaa0 2b220 Preemptive 0000000000000000:0000000000000000 00000000021d94c0 0 MTA (Finalizer)
- ...
- 65 156 22f4 000000000b1a3f60 8029220 Preemptive 00000004527751F0:0000000452775EE8 00000000021d94c0 0 MTA (Threadpool Completion Port)
- 66 205 2ef8 000000000b1a1080 8029220 Preemptive 0000000157641DE0:00000001576435B0 00000000021d94c0 0 MTA (Threadpool Completion Port)
- ...</font>
复制代码
从代码中看没有 GC字样,也表明这个程序并不是由 GC 触发所致,接下来该怎么排查呢?一般来说 CPU 的爆高是由线程抬起来的,所以接下来就是看下 CPU 的档次和各个线程栈,看有没有什么新线索, 可以使用 ~*e !clrstack 。
- <font size="3">0:000> !cpuid
- CP F/M/S Manufacturer MHz
- 0 6,79,1 <unavailable> 2299
- 1 6,79,1 <unavailable> 2299
- 2 6,79,1 <unavailable> 2299
- 3 6,79,1 <unavailable> 2299
- 0:000> ~*e !clrstack
- OS Thread Id: 0x2cc4 (68)
- Child SP IP Call Site
- 000000000c14e758 00007ffadeb86e4a [GCFrame: 000000000c14e758]
- 000000000c14e840 00007ffadeb86e4a [GCFrame: 000000000c14e840]
- 000000000c14e878 00007ffadeb86e4a [HelperMethodFrame: 000000000c14e878] System.Threading.Monitor.Enter(System.Object)
- 000000000c14e970 00007ffaceb40491 System.Net.ConnectionGroup.Disassociate(System.Net.Connection) [f:\dd\NDP\fx\src\net\System\Net\_ConnectionGroup.cs @ 148]
- 000000000c14e9d0 00007ffaceb3fc93 System.Net.Connection.PrepareCloseConnectionSocket(System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3048]
- 000000000c14eaa0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3119]
- 000000000c14eb00 00007ffacebc4118 System.Net.Connection.ReadComplete(Int32, System.Net.WebExceptionStatus) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3387]
- 000000000c14eb80 00007ffacebe3dc5 System.Net.LazyAsyncResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 415]
- 000000000c14ebe0 00007ffacebe3d07 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 368]
- 000000000c14ec20 00007ffacf3a476f System.Net.Security._SslStream.StartFrameBody(Int32, Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
- 000000000c14ec80 00007ffacebb3ed8 System.Net.Security._SslStream.ReadHeaderCallback(System.Net.AsyncProtocolRequest) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_SslStream.cs @ 1007]
- 000000000c14ece0 00007ffacebae5ee System.Net.AsyncProtocolRequest.CompleteRequest(Int32) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_HelperAsyncResults.cs @ 142]
- 000000000c14ed10 00007ffacf3a3567 System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32)
- 000000000c14ed40 00007ffacebae507 System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult) [f:\dd\NDP\fx\src\net\System\Net\SecureProtocols\_FixedSizeReader.cs @ 148]
- 000000000c14ed90 00007ffacebe3dc5 System.Net.LazyAsyncResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 415]
- 000000000c14edf0 00007ffadcbe3a63 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 954]
- 000000000c14eec0 00007ffadcbe38f4 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 902]
- 000000000c14eef0 00007ffadcbe38c2 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 891]
- 000000000c14ef40 00007ffaceba60cf System.Net.ContextAwareResult.Complete(IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_ContextAwareResult.cs @ 463]
- 000000000c14ef90 00007ffacebe3d07 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr) [f:\dd\NDP\fx\src\net\System\Net\_LazyAsyncResult.cs @ 368]
- 000000000c14efd0 00007ffaceba5e2f System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\NDP\fx\src\net\System\Net\Sockets\_BaseOverlappedAsyncResult.cs @ 399]
- 000000000c14f040 00007ffadcc2ffef System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [f:\dd\ndp\clr\src\BCL\system\threading\overlapped.cs @ 135]
- 000000000c14f1f0 00007ffade9a6d93 [GCFrame: 000000000c14f1f0]
- OS Thread Id: 0x5d6c (70)
- Child SP IP Call Site
- 000000000943e920 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef)
- OS Thread Id: 0x4ad4 (75)
- Child SP IP Call Site
- ...
- 000000000c94e5a0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef)
- ...
- OS Thread Id: 0x1d70 (80)
- Child SP IP Call Site
- ...
- 000000000d24e3a0 00007ffacf139bfb System.Net.Connection.HandleError(Boolean, Boolean, System.Net.WebExceptionStatus, System.Net.ConnectionReturnResult ByRef) [f:\dd\NDP\fx\src\net\System\Net\_Connection.cs @ 3119]
- ...</font>
复制代码
从线程栈上看,这个 CPU 是 4个核,刚好对应着 4 个 HandleError? 报错,看样子是什么网络出问题了,接下来切到 80 号线程看一下有没有什么异常类。
- <font size="3">0:000> ~80s
- clr!AwareLock::Contention+0x194:
- 00007ffa`deb86e40 4883e801 sub rax,1
- 0:080> !mdso
- Thread 80:
- Location Object Type
- ------------------------------------------------------------
- 000000000d24e098 000000015765e028 System.Net.WebException
- 000000000d24e0f8 0000000340b07110 System.Collections.ArrayList
- 000000000d24e110 000000015765e2b8 System.Net.HttpWebRequest[]
- 000000000d24e1c0 0000000340b070b8 System.Net.ConnectionGroup
- 000000000d24e258 0000000144a79678 System.Net.Connection
- 0:080> !mdt 000000015765e028
- 000000015765e028 (System.Net.WebException)
- _className:NULL (System.String)
- _exceptionMethod:NULL (System.Reflection.MethodBase)
- _exceptionMethodString:NULL (System.String)
- _message:000000015765df70 (System.String) Length=77, String="The underlying connection was closed: The connection was closed unexpectedly."
- ...</font>
复制代码
果然看到了 System.Net.WebException?, 从异常信息看貌似是 连接关闭了?,到这里我就有了一个大胆的猜测,是不是高频的异常输出导致的 CPU 爆高呢?为了验证,可以到托管堆上找下 WebException 的个数。
- <font size="3">0:080> !dumpheap -stat
- Statistics:
- MT Count TotalSize Class Name
- ...
- 00007ffacecc38b0 13315 2343440 System.Net.WebException
- 00007ffadcdf6570 11369 1909992 System.IO.IOException
- 00007ffadcdf5fb8 13380 2247840 System.ObjectDisposedException
- ...</font>
复制代码
看到这么多异常还是挺吓人的,刚好朋友抓了两个dump可以做个比较。
- <font size="3">0:048> !dumpheap -stat
- Statistics:
- MT Count TotalSize Class Name
- 00007ffacecc38b0 26745 4707120 System.Net.WebException
- 00007ffadcdf6570 26722 4489296 System.IO.IOException
- 00007ffadcdf5fb8 28745 4829160 System.ObjectDisposedException</font>
复制代码
可以看到,2 min 之内异常增加了合计 4w? 左右,这就验证了程序确实是疯狂的抛异常,在 Windows 平台上不管是硬件异常?还是软件异常都由 Windows SEH 异常处理框架统一处理,会出现用户态和内核态的切换,这样疯狂的抛出,必然会导致 CPU 爆高,终于找到原因了,接下来就是寻找诱发因素。
3. 异常是谁诱发的
再回头看 HandleError? 函数的调用栈都是底层的库函数,从线程栈的 PerformIOCompletionCallback? 函数来看是 IO线程? 诱发的,能被 IO 线程兜到是因为这是做了异步处理,既然是 异步?,自然 OverlappedData 也会非常多。
- <font size="3"> 0:080> !gchandles -stat
- Statistics:
- MT Count TotalSize Class Name
- 00007ffadc6f7b98 14511 1625232 System.Threading.OverlappedData
- Total 17550 objects
- Handles:
- Strong Handles: 426
- Pinned Handles: 23
- Async Pinned Handles: 14511
- Ref Count Handles: 24
- Weak Long Handles: 2430
- Weak Short Handles: 132
- SizedRef Handles: 4</font>
复制代码
说明此时有大概 1.5w 的异步请求待回头,请求量还是蛮大的,但还是没找到异常的用户代码,只能找下到底是谁发起了什么请求。
- <font size="3">0:080> !mdso
- Thread 80:
- Location Object Type
- ------------------------------------------------------------
- ...
- 000000000d24e488 0000000358c57918 System.Net.HttpWebRequest
- 000000000d24e2e8 00000001407b5b40 System.String "net_io_readfailure"
- ...
- 0:080> !mdt -r:2 0000000358c57918
- 0000000358c57918 (System.Net.HttpWebRequest)
- _Uri:0000000358c57210 (System.Uri)
- m_String:00000002407ee430 (System.String) Length=98, String="https://api.xxxx/peer_messages"
- ....</font>
复制代码
可以看到请求的连接是 https://api.xxxx/peer_messages,是一个第三方的API接口, 由于底层的连接关闭,导致了最后 net_io_readfailure。
把所有的信息整合一下就是:
当请求量大了之后,访问 https://api.xxxx/peer_messages 会出问题,对方关闭了底层连接,导致客户端这边请求出现了大量 IO 回调异常:IOException: Unable to read data from the transport connection: The connection was closed.?,2min之间多达合计 4w 的异常抛出,进而引发 CPU 爆高,将信息告诉了朋友,让朋友重点关注下 https://api.xxxx/peer_messages 这个连接。
三、总结
这次生产事故主要是由于高峰期请求量过大,由于某种原因 Socket 连接关闭,导致了大量的异步回调异常。
解决方法在调用端做好限流,据朋友说减少了不必要的 https://api.xxxx/peer_messages 调用,目前没有出现 CPU 爆高现象。
|
|