store5 was a single vdev, single drive zpool on a proxmox hypervisor, used to provision kvm volumes. The single drive was an SMR 5TB Seagate Barracuda 2.5” 5600 rpm. OpenZFS detected uncorrectable I/O failures on the drive an suspended the zpool. The following journals the recovery process.
How the issue was detected - a kvm using the disk hung
Under normal kvm usage conditions there were kernel errors regarding store5 and the related issues causing the kvm to hang waiting for io from that zpool:
Oct 08 15:57:03 viper kernel: WARNING: Pool 'store5' has encountered an uncorrectable I/O failure and has been suspended.
Oct 08 15:59:34 viper kernel: INFO: task kvm:405360 blocked for more than 120 seconds.
During monthly scrubs, store5 ran into issues towards the end of the scrub
Oct 08 07:15:20 viper kernel: sd 11:0:5:0: attempting task abort!scmd(0x000000006f922384), outstanding for 31500 ms & timeout 30000 ms
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA
Oct 08 07:28:54 viper kernel: sdo: sdo1 sdo9
Oct 08 07:28:54 viper kernel: sd 11:0:8:0: [sdo] Attached SCSI disk
It is possible to see when zfs encountered read errors during the scrub, and also possible to see when the disk disappeared and reappeared, suggesting a hardware fault.
At this point no alerts had been sent from the node.
This is how I found the zpool when inspecting it around 16:10 UTC:
root@viper:~# zpool status store5 store5-backup
pool: store5
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
scan: scrub in progress since Sun Oct 8 00:24:38 2023
2.29T scanned at 0B/s, 2.27T issued at 41.0M/s, 2.29T total
0B repaired, 98.93% done, 00:10:28 to go
config:
NAME STATE READ WRITE CKSUM
store5 UNAVAIL 0 0 0 insufficient replicas
ata-ST5000LM000-2AN170_WCXXXXD6 FAULTED 1 2 0 too many errors
errors: List of errors unavailable: pool I/O is currently suspended
errors: 67 data errors, use '-v' for a list
Around the same time the pool disk become unresponsive and disappeared from the kernel partition list. This may or may not of been triggered by some smartctl --all requests to the disk but I think it was more likely when clearing the zpool errors.
I went to the physical machine and used sas3ircu LIST and sas3ircu 1 LOCATE to illuminate the drive bay with the faulting disk. I pulled the disk, verified the serial number and re-inserted it. Then I cleared the zpool errors again. Which gives the following status:
root@viper:~# zpool status -v store5
pool: store5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the