Home lab & data vault
Share
Explore
Data recovery stories

Recovery of store5 corruption

Date 2023-10-08
store5 was a single vdev, single drive zpool on a proxmox hypervisor, used to provision kvm volumes. The single drive was an SMR 5TB Seagate Barracuda 2.5” 5600 rpm. OpenZFS detected uncorrectable I/O failures on the drive an suspended the zpool. The following journals the recovery process.

How the issue was detected - a kvm using the disk hung

Under normal kvm usage conditions there were kernel errors regarding store5 and the related issues causing the kvm to hang waiting for io from that zpool:
Oct 08 15:57:03 viper kernel: WARNING: Pool 'store5' has encountered an uncorrectable I/O failure and has been suspended.
Oct 08 15:59:34 viper kernel: INFO: task kvm:405360 blocked for more than 120 seconds.
During monthly scrubs, store5 ran into issues towards the end of the scrub
Oct 08 07:15:20 viper kernel: sd 11:0:5:0: attempting task abort!scmd(0x000000006f922384), outstanding for 31500 ms & timeout 30000 ms
Oct 08 07:15:20 viper kernel: sd 11:0:5:0: [sdk] tag#257 CDB: Read(16) 88 00 00 00 00 02 3e f2 c6 f8 00 00 08 00 00 00
Oct 08 07:15:20 viper kernel: scsi target11:0:5: handle(0x001f), sas_address(0x4433221110000000), phy(16)
Oct 08 07:15:20 viper kernel: scsi target11:0:5: enclosure logical id(0x500062b200f20aa0), slot(11)
Oct 08 07:15:20 viper kernel: scsi target11:0:5: enclosure level(0x0000), connector name( )
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: task abort: SUCCESS scmd(0x000000006f922384)
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#257 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#257 CDB: Read(16) 88 00 00 00 00 02 3e f2 c6 f8 00 00 08 00 00 00
Oct 08 07:15:24 viper kernel: I/O error, dev sdk, sector 9646032632 op 0x0:(READ) flags 0x700 phys_seg 8 prio class 2
Oct 08 07:15:24 viper kernel: zio pool=store5 vdev=/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXD6-part1 error=5 type=1 offset=4938767659008 size=1048576 flags=40080cb0
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: attempting task abort!scmd(0x00000000b44a3020), outstanding for 33520 ms & timeout 30000 ms
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#271 CDB: Read(16) 88 00 00 00 00 02 3e f2 ce f8 00 00 08 00 00 00
Oct 08 07:15:24 viper kernel: scsi target11:0:5: handle(0x001f), sas_address(0x4433221110000000), phy(16)
Oct 08 07:15:24 viper kernel: scsi target11:0:5: enclosure logical id(0x500062b200f20aa0), slot(11)
Oct 08 07:15:24 viper kernel: scsi target11:0:5: enclosure level(0x0000), connector name( )
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: No reference found at driver, assuming scmd(0x00000000b44a3020) might have completed
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: task abort: SUCCESS scmd(0x00000000b44a3020)
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#271 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#271 CDB: Read(16) 88 00 00 00 00 02 3e f2 ce f8 00 00 08 00 00 00
...
...
Oct 08 07:15:25 viper kernel: I/O error, dev sdk, sector 9646040824 op 0x0:(READ) flags 0x700 phys_seg 8 prio class 2
Oct 08 07:15:25 viper kernel: zio pool=store5 vdev=/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXD6-part1 error=5 type=1 offset=4938771853312 size=1048576 flags=40080cb0
Oct 08 07:15:25 viper kernel: WARNING: Pool 'store5' has encountered an uncorrectable I/O failure and has been suspended.
Oct 08 07:16:01 viper kernel: sd 11:0:5:0: device_block, handle(0x001f)
Oct 08 07:16:03 viper kernel: sd 11:0:5:0: device_unblock and setting to running, handle(0x001f)
Oct 08 07:16:03 viper kernel: sd 11:0:5:0: [sdk] Synchronizing SCSI cache
Oct 08 07:16:03 viper kernel: sd 11:0:5:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221110000000)
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: removing handle(0x001f), sas_addr(0x4433221110000000)
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: enclosure logical id(0x500062b200f20aa0), slot(11)
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: enclosure level(0x0000), connector name( )
Oct 08 07:17:58 viper kernel: INFO: task txg_sync:10274 blocked for more than 120 seconds.
...
...
Oct 08 07:28:42 viper kernel: mpt3sas_cm1: handle(0x1f) sas_address(0x4433221110000000) port_type(0x1)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: Direct-Access ATA ST5000LM000-2AN1 0001 PQ: 0 ANSI: 6
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: SATA: handle(0x001f), sas_addr(0x4433221110000000), phy(16), device_name(0x0000000000000000)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: enclosure logical id (0x500062b200f20aa0), slot(11)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: enclosure level(0x0000), connector name( )
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
Oct 08 07:28:47 viper kernel: sd 11:0:8:0: Attached scsi generic sg10 type 0
Oct 08 07:28:47 viper kernel: sd 11:0:8:0: Power-on or device reset occurred
Oct 08 07:28:47 viper kernel: end_device-11:8: add: handle(0x001f), sas_addr(0x4433221110000000)
Oct 08 07:28:52 viper kernel: sd 11:0:8:0: [sdo] 9767541168 512-byte logical blocks: (5.00 TB/4.55 TiB)
Oct 08 07:28:52 viper kernel: sd 11:0:8:0: [sdo] 4096-byte physical blocks
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Write Protect is off
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Mode Sense: 9b 00 10 08
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA
Oct 08 07:28:54 viper kernel: sdo: sdo1 sdo9
Oct 08 07:28:54 viper kernel: sd 11:0:8:0: [sdo] Attached SCSI disk
It is possible to see when zfs encountered read errors during the scrub, and also possible to see when the disk disappeared and reappeared, suggesting a hardware fault.
At this point no alerts had been sent from the node.
This is how I found the zpool when inspecting it around 16:10 UTC:
root@viper:~# zpool status store5 store5-backup
pool: store5
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
scan: scrub in progress since Sun Oct 8 00:24:38 2023
2.29T scanned at 0B/s, 2.27T issued at 41.3M/s, 2.29T total
0B repaired, 98.93% done, 00:10:23 to go
config:

NAME STATE READ WRITE CKSUM
store5 ONLINE 0 0 0
ata-ST5000LM000-2AN170_WCXXXXD6 ONLINE 10 0 48
errors: List of errors unavailable: pool I/O is currently suspended

errors: 67 data errors, use '-v' for a list

# store5-backup as comparison

pool: store5-backup
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 04:59:44 with 0 errors on Mon Mar 13 02:42:57 2023
config:

NAME STATE READ WRITE CKSUM
store5-backup ONLINE 0 0 0
ata-ST5000LM000-2U8170_WCXXXX30 ONLINE 0 0 0

errors: No known data errors

Why didn’t zed raise the alert on the suspended pool?

Only once I cleared the errors on store5 pool did zed send it first email
2023-10-08 16:33:33 zpool clear store5
image.png
and then the pool entered the following state:
root@viper:~# zpool status store5
pool: store5
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
scan: scrub in progress since Sun Oct 8 00:24:38 2023
2.29T scanned at 0B/s, 2.27T issued at 41.0M/s, 2.29T total
0B repaired, 98.93% done, 00:10:28 to go
config:

NAME STATE READ WRITE CKSUM
store5 UNAVAIL 0 0 0 insufficient replicas
ata-ST5000LM000-2AN170_WCXXXXD6 FAULTED 1 2 0 too many errors
errors: List of errors unavailable: pool I/O is currently suspended

errors: 67 data errors, use '-v' for a list
Around the same time the pool disk become unresponsive and disappeared from the kernel partition list. This may or may not of been triggered by some smartctl --all requests to the disk but I think it was more likely when clearing the zpool errors.
I went to the physical machine and used sas3ircu LIST and sas3ircu 1 LOCATE to illuminate the drive bay with the faulting disk. I pulled the disk, verified the serial number and re-inserted it. Then I cleared the zpool errors again. Which gives the following status:
root@viper:~# zpool status -v store5
pool: store5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub paused since Sun Oct 8 16:54:23 2023
scrub started on Sun Oct 8 00:24:38 2023
2.29T scanned, 2.27T issued, 2.29T total
0B repaired, 98.95% done
config:

NAME STATE READ WRITE CKSUM
store5 ONLINE 0 0 0
ata-ST5000LM000-2AN170_WCXXXXD6 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

<metadata>:<0x0>
<metadata>:<0x1>
<metadata>:<0x34>
<metadata>:<0x84>
<metadata>:<0x87>
<metadata>:<0xb2>
<metadata>:<0x12c8>
/store5/data/vm/raw/
/store5/data/vm/raw/images/102/vm-102-disk-0.raw
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.