Site menu:

Categories

Tags

Site search

 

April 2008
M T W T F S S
« Mar   May »
 123456
78910111213
14151617181920
21222324252627
282930  

Archives

Links:

Disk troubles

Well today we had rather unpleasant news. One of the webpages on http://hants.lug.org.uk was returning:

Creating cache file (20080331.224307.b25e9866.en.html:Read-only file system):

We could still log into the box, but on typing “dmesg” to see the kernel messages we were greated by pages of:

EXT3-fs error (device ide0(3,1)) in start_transaction: Journal has aborted

Sure enough, / had been mounted read-only (NB: “cat /proc/mounts” will said read-only, but “mount” - which reads /etc/mtab will be outdated and hence incorrect). This normally happens when there is a bad filesystem bug - that’s why /etc/fstab normally says:

/dev/hda1	/		ext3	errors=remount-ro	0	1

Next thing to look at was the logs - the last file was:

  -rw-r-----  1 root        adm        4828 2008-04-01 04:17 daemon.log

There was nothing relevant in the logs. We have backups from 4am and files checksum logs from the same time.

The filesystem may have been remounted read-only but the disk was okay - we could run programs etc. In particular we could run “smartctl -a /dev/hda” which showed something like this (truncated):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE       RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0033   079   079   005    Pre-fail    526
  9 Power_On_Hours          0x0012   095   095   000    Old_age     36261
196 Reallocated_Event_Count 0x0032   079   079   000    Old_age     513
...
Error 7 occurred at disk power-on lifetime: 36245 hours (1510 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 59 08 2f 14 01 e0  Error: UNC at LBA = 0x0001142f = 70703

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  24 00 08 2f 14 01 e0 00  21d+16:07:34.100  READ SECTOR(S) EXT
  34 00 08 08 ec 2a e0 00  21d+16:07:34.100  WRITE SECTORS(S) EXT
  34 00 08 bf 1a 08 e0 00  21d+16:07:34.100  WRITE SECTORS(S) EXT
  34 00 08 b7 2f 23 e0 00  21d+16:07:34.000  WRITE SECTORS(S) EXT
  34 00 28 07 66 19 e0 00  21d+16:07:34.000  WRITE SECTORS(S) EXT

If you look at the time now from power_on_hours (36261) and the time of that last error (36245), we can see that it failed to write and read some sectors 16 hours ago - which is 4am +/- 1 hour. Which agrees with the logs.

So we have good confirmation now that there was a disk error (no RAID unfortunately) which corrupted the ext3 journal and caused the filesystem to be mounted read-only.

I wondered if we should remove the ext3 journal or leave it, but a bit of googling for this error shows that most people remove it first, so we do that:

  • tune2fs -O ^has_journal /dev/hda1 (removes journal)
  • (reboot, hope fsck doesn’t ask “are you sure?”)
    • we should have set “FSCKFIX=yes” in /etc/default/rcS but as the filesystem is read-only we can’t
    • NB: do _not_ set FSCKFIX=yes if you use reiserfs as it then forces a fsck IIRC)
  • tune2fs -j /dev/hda1 (adds journal back in)
We are, however, waiting until tomorrow to do this so that we have remote support available just in case.

Comments

Comment from Andy Smith
Time: Wednesday 2 April, 2008, 08:43

The offer of a free VPS for Hants LUG is still there btw.

Cheers,
Andy

Write a comment