Disk troubles
Well today we had rather unpleasant news. One of the webpages on http://hants.lug.org.uk was returning:
Creating cache file (20080331.224307.b25e9866.en.html:Read-only file system):
We could still log into the box, but on typing “dmesg” to see the kernel messages we were greated by pages of:
EXT3-fs error (device ide0(3,1)) in start_transaction: Journal has aborted
Sure enough, / had been mounted read-only (NB: “cat /proc/mounts” will said read-only, but “mount” - which reads /etc/mtab will be outdated and hence incorrect). This normally happens when there is a bad filesystem bug - that’s why /etc/fstab normally says:
/dev/hda1 / ext3 errors=remount-ro 0 1
Next thing to look at was the logs - the last file was:
-rw-r----- 1 root adm 4828 2008-04-01 04:17 daemon.log
There was nothing relevant in the logs. We have backups from 4am and files checksum logs from the same time.
The filesystem may have been remounted read-only but the disk was okay - we could run programs etc. In particular we could run “smartctl -a /dev/hda” which showed something like this (truncated):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE RAW_VALUE ... 5 Reallocated_Sector_Ct 0x0033 079 079 005 Pre-fail 526 9 Power_On_Hours 0x0012 095 095 000 Old_age 36261 196 Reallocated_Event_Count 0x0032 079 079 000 Old_age 513 ... Error 7 occurred at disk power-on lifetime: 36245 hours (1510 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 59 08 2f 14 01 e0 Error: UNC at LBA = 0x0001142f = 70703 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 24 00 08 2f 14 01 e0 00 21d+16:07:34.100 READ SECTOR(S) EXT 34 00 08 08 ec 2a e0 00 21d+16:07:34.100 WRITE SECTORS(S) EXT 34 00 08 bf 1a 08 e0 00 21d+16:07:34.100 WRITE SECTORS(S) EXT 34 00 08 b7 2f 23 e0 00 21d+16:07:34.000 WRITE SECTORS(S) EXT 34 00 28 07 66 19 e0 00 21d+16:07:34.000 WRITE SECTORS(S) EXT
If you look at the time now from power_on_hours (36261) and the time of that last error (36245), we can see that it failed to write and read some sectors 16 hours ago - which is 4am +/- 1 hour. Which agrees with the logs.
So we have good confirmation now that there was a disk error (no RAID unfortunately) which corrupted the ext3 journal and caused the filesystem to be mounted read-only.
I wondered if we should remove the ext3 journal or leave it, but a bit of googling for this error shows that most people remove it first, so we do that:
- tune2fs -O ^has_journal /dev/hda1 (removes journal)
- (reboot, hope fsck doesn’t ask “are you sure?”)
- we should have set “FSCKFIX=yes” in /etc/default/rcS but as the filesystem is read-only we can’t
- NB: do _not_ set FSCKFIX=yes if you use reiserfs as it then forces a fsck IIRC)
- tune2fs -j /dev/hda1 (adds journal back in)
Posted: April 1st, 2008 under HantsLUG, Linux.
Comments: 1
Comments
Comment from Andy Smith
Time: Wednesday 2 April, 2008, 08:43
The offer of a free VPS for Hants LUG is still there btw.
Cheers,
Andy
Write a comment