store5 was a single vdev, single drive zpool on a proxmox hypervisor, used to provision kvm volumes. The single drive was an SMR 5TB Seagate Barracuda 2.5” 5600 rpm. OpenZFS detected uncorrectable I/O failures on the drive an suspended the zpool. The following journals the recovery process.
How the issue was detected - a kvm using the disk hung
Under normal kvm usage conditions there were kernel errors regarding store5 and the related issues causing the kvm to hang waiting for io from that zpool:
Oct 08 15:57:03 viper kernel: WARNING: Pool 'store5' has encountered an uncorrectable I/O failure and has been suspended.
Oct 08 15:59:34 viper kernel: INFO: task kvm:405360 blocked for more than 120 seconds.
During monthly scrubs, store5 ran into issues towards the end of the scrub
Oct 08 07:15:20 viper kernel: sd 11:0:5:0: attempting task abort!scmd(0x000000006f922384), outstanding for 31500 ms & timeout 30000 ms
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA
Oct 08 07:28:54 viper kernel: sdo: sdo1 sdo9
Oct 08 07:28:54 viper kernel: sd 11:0:8:0: [sdo] Attached SCSI disk
It is possible to see when zfs encountered read errors during the scrub, and also possible to see when the disk disappeared and reappeared, suggesting a hardware fault.
At this point no alerts had been sent from the node.
This is how I found the zpool when inspecting it around 16:10 UTC:
root@viper:~# zpool status store5 store5-backup
pool: store5
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
scan: scrub in progress since Sun Oct 8 00:24:38 2023
2.29T scanned at 0B/s, 2.27T issued at 41.0M/s, 2.29T total
0B repaired, 98.93% done, 00:10:28 to go
config:
NAME STATE READ WRITE CKSUM
store5 UNAVAIL 0 0 0 insufficient replicas
ata-ST5000LM000-2AN170_WCXXXXD6 FAULTED 1 2 0 too many errors
errors: List of errors unavailable: pool I/O is currently suspended
errors: 67 data errors, use '-v' for a list
Around the same time the pool disk become unresponsive and disappeared from the kernel partition list. This may or may not of been triggered by some smartctl --all requests to the disk but I think it was more likely when clearing the zpool errors.
I went to the physical machine and used sas3ircu LIST and sas3ircu 1 LOCATE to illuminate the drive bay with the faulting disk. I pulled the disk, verified the serial number and re-inserted it. Then I cleared the zpool errors again. Which gives the following status:
root@viper:~# zpool status -v store5
pool: store5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
After re-plugging the drive smartctl sent some alerts via email:
ID# ATTRIBUTE_NAME RAW_VALUE
197 Current_Pending_Sector 760
198 Offline_Uncorrectable 760
How old is the failing disk
The failure occurred 2023-Oct-08. The DOM is 2016-DEC and the drive has recorded ~5.12 years of Spin_Up_Time (44895 hours) and ~2.99 years of Head_Flying_Hours (26220 hours).
SMART Attributes for disk WCXXXXD6
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
syncoid reported ~103.4 GB delta between the last know good snapshot store5-backup/data@2023-07-24-summer-hol and the reported failing disk store5/data@2023-10-08-ioerrors. So this means the ~100 GiB data delta must be either:
Recovered from the failing drive for example with syncoid
OR
Recovered from snapraid parity
After recovery snapraid check -a -d store5 can be used to verify checksums of the recovered data/disk.
Prerequisite: I need to interrogate the most recent snapraid sync and scrub outcomes on the omv kvm. Assuming they are healthy, then I will attempt the following:
✅ Boot the omv kvm and put the failing disk into read-only mode. Then check the relevant logs.
alt: if booting is not possible, it would be possible to mount the root filesystem of the kvm and check the logs this way.
✅ The latest snapraid sync and scrub were healthy, this means snapraid parity is healthy and actual.
✅ Check snapraid status and snapraid diff and take any required actions.
e.g. move any added files to different storage and revert any changes so there are no deltas
✅ Create a new zpool store5b and seed it with data from store5-backup
✅ Attempt to create a new snapshot on failing store5 disk:
zfs snapshot -r store5/data@manual_viper_$(date +%Y-%m-%d)-ioerrors
✅ Attempt to mount -o ro,norecovery the store5 vs. store5b zfs snapshots for the related XFS partitions on viper and share via smbd to use a compare tool to see the delta.
I was able to produce a list of the filesystem deltas in a html report.
I was able to produce a list of paths of orphans on both sides and newer files.
I completed these actions to create a manifest of the delta which could be used for troubleshooting and sanity checks later on.
✅ umount the compared filesystems and clean-up any smbd cfg etc.
✅ Then I will attempt to syncoid the latest snapshot delta between store5/data@2023-07-24-summer-hol ... store5/data@23-10-08-ioerrors to store5b and see what errors come up. This zfs send block level approach is preferred over the snapraid filesystem level approach, it should be more efficient and throw errors if some block checksums fail as they are read.
Outcome: 103.4 GiBreplicated to store5b.
🟢 No errors detected on the zpool, or the syncoid job or the kernel log.
✅ Using the mount -o ro,norecovery of the relevant store5 vs. store5b zfs snapshots for the related XFS partitions on viper and share via smbd to use a compare tool to see if the delta was resolved.
🟢 The delta was resolved, at least on the filesystem tree/object level.
✅ umount the compared filesystems and clean-up any smbd cfg etc.
✅ If the amount of data corruption is NOT too much then I will then introduce store5b to proxmox storage and configure the kvm to use store5b storage.
It is important to consider that the restored zfs block stream to store5b zpool will mean the XFS sub volumes (raw GUID partition table in my case) will have identical UUID’s to data from zpool store5. This means its important to remove the failing store5 disk from the kvm whilst also adding the new storage store5b. This avoids having to regenerate the UUID’s. When the kvm boots, because the UUID’s are identical (in theory the GUID partition table and XFS partition(s) are binary same as the source) so, no changes will be required in mounts or snapraid.conf.
✅ Finally:snapraid will be used to verify the integrity (checksums) of store5b data at the filesystem level: time snapraid check -a -d store5
🟢 The snapraid scrub operation completed without errors.
✅ As a best practice: remember to delete snapraid.content on the restored disk, so it can be regenerated by snapraid during the next sync. This assumes this is not the only disk that is configured to contain the snapraid.content file.
⏩ If it looks bad i.e. the send won’t work or the majority of the blocks could not be sent, the blocks from the failing disk must be ignored.
So, then introduce store5b to proxmox storage and configure the kvm to use store5b storage. Then swap store5b with the failing store5 disk in snapraid, then use snapraid to recover the missing blocks, and finally snapraid will be used to verify the integrity of the restored data.
⚠ Noting that some files are excluded from snapraid and such files will not be recovered.
⏩ In the worse case and I have to start store5 from zero, store5 can be recreated from snapraid parity.
⚠ Noting that some files are excluded from snapraid and such files will not be recovered.
✅ Whichever process is used... snapraid will be used to verify the integrity of the restored disk. (for the the files not excluded by snapraid.conf)
There are also the standalone sfv checksums available in the root of the XFS filesystem but that shouldn’t be required as those checksums are are older than the latest known good store5-backup/data@2023-07-24-summer-hol snapshot. i.e. OpenZFS integrity logic should negate the need to rely on standalone checksums. For glacial content it doesn’t hurt to run them.
There is also zfs-check (part of zfs-autobackup) which does zfs block content comparison of two snapshots. It looks like it still doesn’t handle sparse data efficiently and would compare all blocks in a given dataset. This method might be worth a sanity check verification. For example between:
store5-backup/data@2023-07-24-summer-hol and store5b/data@2023-07-24-summer-hol which should be binary same.
💡 I could update the code to handle sparse files/blocks and submit a pull request. I created an issue to track this:
🔴 There is a chance that scrubbing store5 (or resuming the paused scrub) will resolve issues (make the pool healthy again) until the hardware faults again but I find this unlikely given SMART on the disk now reports 760 Offline_Uncorrectable sectors/LBAs. The failing disk also seems to be responding slower during my diagnosis of the issue. The disk also seemed notably slower (bursty) during the syncoid job to recover data. Its possible another scrub will cause the drive to fully fail. The previous monthly ZFS scrub (which uncovered/detected the hardware fault) suspended the vdev/pool at 98.93% of the scrub.
Outcome: resuming the 98% completed scrub didn’t heal the errors - which suggests the errors were not transient.
scrub repaired 0B in 1 days 23:05:09 with 49 errors on Mon Oct 9 23:29:47 2023
that a pool may require multiple scrubs to clear errors but for this dying disk it isn’t worth it, now that the data has been recovered.
✅ Update backups - run syncoid to update the backup pools.
Observations
zfs send from a pool with corrupt data
I was wondering about what happens when zfs send encounters corruption. It would be logical that the send is aborted and a fatal error be thrown. A little bit of research uncovered a module parameter called zfs_send_corrupt_data (
) which toggles the self describing tuning setting. This could of been very useful as blocks that are corrupt would be marked as zfs bad blocks but the send would not abort. The corrupt blocks could then be recovered/handled by other means like snapraid parity.
🟢 snapraid sync and scrubs were healthy
OK snapraid-manualrun-20231007-212450-sync.log
OK snapraid-manualrun-20231007-213926-scrub.log
OK snapraid-manualrun-20231008-023002-scrub.log
# The highlevel snapraid status was OK
The oldest block was scrubbed 34 days ago, the median 23, the newest 0.
No sync is in progress.
The full array was scrubbed at least one time.
No rehash is in progress or needed.
No error detected.
🔴 Pool status after resuming the store5 scrub
root@viper:/etc/pve/qemu-server# zpool status store5
pool: store5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
scan: scrub repaired 0B in 1 days 23:05:09 with 49 errors on Mon Oct 9 23:29:47 2023
config:
NAME STATE READ WRITE CKSUM
store5 ONLINE 0 0 0
ata-ST5000LM000-2AN170_WCXXXXD6 ONLINE 0 0 0
errors: 10 data errors, use '-v' for a list
🟢 snapraid scrub operation result on the restored disk
Summary: This check operation computes and compares the checksums of all files on store5 vs. the known checksums in the snapraid.content file. Any errors or missing files would be reported.
root@node:~# time snapraid --log ">>/var/log/snapraid/snapraid-manualrun-%D-%T-check.log" check -a -d store5
Self test...
Loading state from /srv/dev-disk-by-uuid-1d5722e2-a5ac-4f64-9161-392290480c23/snapraid.content...