问题描述
操作系统长时间运行达到208.5天后,会因为sched_clock()计数器溢出而导致内核崩溃kernel panic。如果操作系统配置了kdump,到时候会触发kdump机制而自动重启。
而且某些进程还会在日志中产生类似如下报错:
BUG: soft lockup – CPU#N stuck for 4278190091s!
问题原因
该问题的产生与操作系统Time Stamp Counter (TSC) clock source有关,采用TSC时钟源的系统会触发此Bug。
使用下面的命令可以查看当前系统所使用的时钟源:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Redhat官方描述如下:
An insufficiently designed calculation in the CPU accelerator in the previous kernel caused an arithmetic overflow in thesched_clock()function
This overflow led to a kernel panic or any other unpredictable trouble on the systems using the TSC clock source
This problem will occur only when system uptime reaches or exceeds 208.5 days
This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer occur under these circumstances
On RHEL5, this problem is a timing issue and is very unlikely to be encountered.
Switching to another clocksource is usually not a workaroundfor most workloads
The TSC is a fast access clock, whereas the HPET and PMTimer are both slow access clocks
Using notsc would be a significant performance hit
In RHEL5, the affected sched_clock()uses the TSC regardless of clock source selection.
Also, in some situation, the system may hit this issue even if you set notsc to current_clocksource.
解决方案
此问题属于Redhat系统Bug,需要联系Redhat获取内核更新(只有购买了正版授权才能获取更新资料),升级内核解决。