A wee while ago (yes, I’m catching up on things I’d hoped to blog about for a while!), I had a problem with my home PC. This culminated in a post to the Ubuntu Forums.
General stability of this machine is great – it’s normally on for weeks at a time serving the familys various document/web/email/printing needs – and has done this for about four years with the only major hardware change being a new 7600GT graphics card (most recently – about 12 months ago) and a new Socket 478 P4 Extreme Edition CPU about 18 months ago).
So, what do you guys think? Hardware or software? And how do I troubleshoot this one further? (BTW, I’ve been a linux user for about 8 years now, so I’m not really a guru and definitely not a noob. Perhaps more of a goob. 😀 )
Basically, I had an issue where, whenever I’d do some ‘heavy lifting’ tasks – like audio or video encoding, the app would just disappear. Very odd it was – I tried all sorts of things to fix it. New linux distro’s, replacement RAM etc.
Starting the processes from the command line, I was able to see that the app termination was actually a segfault – which I subsequently found in the dmesg log. That and two other distro’s (lenny and a Fedora Core live CD) gave errors in dmesg about the CPU overheating:
Turned out to be the CPU overheating. Interesting, there was nothing in dmesg about the CPU overheating – though, when I had Debian on, it did show messages about that – and, when I booted into the Fedora 10 Live CD, it also complained about the CPU overheating in dmesg.
So, to solve the problem, I transplanted the guts of my box into a new case which breathes better and also used the correct heatsink for my CPU (one with a copper core).
The problem was that I was using the same case and heatsink from my old P4 2.8Ghz which wasn’t cutting it with the new P4EE 3.4Ghz and the amount of heat it generates.
Once the correct heatsink and better case with more efficient thermal dynamics were in place, the differences in internal temperature were quite remarkable:
If anyone is interested, here’s some temps from lm-sensors that show the difference in internal temps between the two cases and heatsinks. These are both just at system idle with no loading.
Before
SDA: 37C | SDB: 34C | GPU: 57C | CPU: 40C
After
SDA: 33C | SDB: 28C | GPU: 40C | CPU: 23C
During loading, the CPU was getting to around 70C, now its able to stay around the 57C – and with no segfaulting going on! Yaay!
The rather cool thing – from my point of view anyway – is that Windows would merely have blue screened under the same circumstances (or just rebooted as the default blue screen setting dictates). Obviously that would make things much harder to troubleshoot.
So linux dealt with the overheating by terminating the offending process. A much more elegant way of handling things – don’t you think?