Gallery
Home lab & data vault
Share
Explore
Data recovery stories

Recovery of store5 corruption

Date 2023-10-08
store5 was a single vdev, single drive zpool on a proxmox hypervisor, used to provision kvm volumes. The single drive was an SMR 5TB Seagate Barracuda 2.5” 5600 rpm. OpenZFS detected uncorrectable I/O failures on the drive an suspended the zpool. The following journals the recovery process.

How the issue was detected - a kvm using the disk hung

Under normal kvm usage conditions there were kernel errors regarding store5 and the related issues causing the kvm to hang waiting for io from that zpool:
Oct 08 15:57:03 viper kernel: WARNING: Pool 'store5' has encountered an uncorrectable I/O failure and has been suspended.
Oct 08 15:59:34 viper kernel: INFO: task kvm:405360 blocked for more than 120 seconds.
During monthly scrubs, store5 ran into issues towards the end of the scrub
Oct 08 07:15:20 viper kernel: sd 11:0:5:0: attempting task abort!scmd(0x000000006f922384), outstanding for 31500 ms & timeout 30000 ms
Oct 08 07:15:20 viper kernel: sd 11:0:5:0: [sdk] tag#257 CDB: Read(16) 88 00 00 00 00 02 3e f2 c6 f8 00 00 08 00 00 00
Oct 08 07:15:20 viper kernel: scsi target11:0:5: handle(0x001f), sas_address(0x4433221110000000), phy(16)
Oct 08 07:15:20 viper kernel: scsi target11:0:5: enclosure logical id(0x500062b200f20aa0), slot(11)
Oct 08 07:15:20 viper kernel: scsi target11:0:5: enclosure level(0x0000), connector name( )
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: task abort: SUCCESS scmd(0x000000006f922384)
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#257 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#257 CDB: Read(16) 88 00 00 00 00 02 3e f2 c6 f8 00 00 08 00 00 00
Oct 08 07:15:24 viper kernel: I/O error, dev sdk, sector 9646032632 op 0x0:(READ) flags 0x700 phys_seg 8 prio class 2
Oct 08 07:15:24 viper kernel: zio pool=store5 vdev=/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXD6-part1 error=5 type=1 offset=4938767659008 size=1048576 flags=40080cb0
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: attempting task abort!scmd(0x00000000b44a3020), outstanding for 33520 ms & timeout 30000 ms
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#271 CDB: Read(16) 88 00 00 00 00 02 3e f2 ce f8 00 00 08 00 00 00
Oct 08 07:15:24 viper kernel: scsi target11:0:5: handle(0x001f), sas_address(0x4433221110000000), phy(16)
Oct 08 07:15:24 viper kernel: scsi target11:0:5: enclosure logical id(0x500062b200f20aa0), slot(11)
Oct 08 07:15:24 viper kernel: scsi target11:0:5: enclosure level(0x0000), connector name( )
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: No reference found at driver, assuming scmd(0x00000000b44a3020) might have completed
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: task abort: SUCCESS scmd(0x00000000b44a3020)
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#271 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Oct 08 07:15:24 viper kernel: sd 11:0:5:0: [sdk] tag#271 CDB: Read(16) 88 00 00 00 00 02 3e f2 ce f8 00 00 08 00 00 00
...
...
Oct 08 07:15:25 viper kernel: I/O error, dev sdk, sector 9646040824 op 0x0:(READ) flags 0x700 phys_seg 8 prio class 2
Oct 08 07:15:25 viper kernel: zio pool=store5 vdev=/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXD6-part1 error=5 type=1 offset=4938771853312 size=1048576 flags=40080cb0
Oct 08 07:15:25 viper kernel: WARNING: Pool 'store5' has encountered an uncorrectable I/O failure and has been suspended.
Oct 08 07:16:01 viper kernel: sd 11:0:5:0: device_block, handle(0x001f)
Oct 08 07:16:03 viper kernel: sd 11:0:5:0: device_unblock and setting to running, handle(0x001f)
Oct 08 07:16:03 viper kernel: sd 11:0:5:0: [sdk] Synchronizing SCSI cache
Oct 08 07:16:03 viper kernel: sd 11:0:5:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221110000000)
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: removing handle(0x001f), sas_addr(0x4433221110000000)
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: enclosure logical id(0x500062b200f20aa0), slot(11)
Oct 08 07:16:03 viper kernel: mpt3sas_cm1: enclosure level(0x0000), connector name( )
Oct 08 07:17:58 viper kernel: INFO: task txg_sync:10274 blocked for more than 120 seconds.
...
...
Oct 08 07:28:42 viper kernel: mpt3sas_cm1: handle(0x1f) sas_address(0x4433221110000000) port_type(0x1)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: Direct-Access ATA ST5000LM000-2AN1 0001 PQ: 0 ANSI: 6
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: SATA: handle(0x001f), sas_addr(0x4433221110000000), phy(16), device_name(0x0000000000000000)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: enclosure logical id (0x500062b200f20aa0), slot(11)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: enclosure level(0x0000), connector name( )
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
Oct 08 07:28:47 viper kernel: scsi 11:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
Oct 08 07:28:47 viper kernel: sd 11:0:8:0: Attached scsi generic sg10 type 0
Oct 08 07:28:47 viper kernel: sd 11:0:8:0: Power-on or device reset occurred
Oct 08 07:28:47 viper kernel: end_device-11:8: add: handle(0x001f), sas_addr(0x4433221110000000)
Oct 08 07:28:52 viper kernel: sd 11:0:8:0: [sdo] 9767541168 512-byte logical blocks: (5.00 TB/4.55 TiB)
Oct 08 07:28:52 viper kernel: sd 11:0:8:0: [sdo] 4096-byte physical blocks
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Write Protect is off
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Mode Sense: 9b 00 10 08
Oct 08 07:28:53 viper kernel: sd 11:0:8:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA
Oct 08 07:28:54 viper kernel: sdo: sdo1 sdo9
Oct 08 07:28:54 viper kernel: sd 11:0:8:0: [sdo] Attached SCSI disk
It is possible to see when zfs encountered read errors during the scrub, and also possible to see when the disk disappeared and reappeared, suggesting a hardware fault.
At this point no alerts had been sent from the node.
This is how I found the zpool when inspecting it around 16:10 UTC:
root@viper:~# zpool status store5 store5-backup
pool: store5
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
scan: scrub in progress since Sun Oct 8 00:24:38 2023
2.29T scanned at 0B/s, 2.27T issued at 41.3M/s, 2.29T total
0B repaired, 98.93% done, 00:10:23 to go
config:

NAME STATE READ WRITE CKSUM
store5 ONLINE 0 0 0
ata-ST5000LM000-2AN170_WCXXXXD6 ONLINE 10 0 48
errors: List of errors unavailable: pool I/O is currently suspended

errors: 67 data errors, use '-v' for a list

# store5-backup as comparison

pool: store5-backup
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 04:59:44 with 0 errors on Mon Mar 13 02:42:57 2023
config:

NAME STATE READ WRITE CKSUM
store5-backup ONLINE 0 0 0
ata-ST5000LM000-2U8170_WCXXXX30 ONLINE 0 0 0

errors: No known data errors

Why didn’t zed raise the alert on the suspended pool?

Only once I cleared the errors on store5 pool did zed send it first email
2023-10-08 16:33:33 zpool clear store5
image.png
and then the pool entered the following state:
root@viper:~# zpool status store5
pool: store5
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
scan: scrub in progress since Sun Oct 8 00:24:38 2023
2.29T scanned at 0B/s, 2.27T issued at 41.0M/s, 2.29T total
0B repaired, 98.93% done, 00:10:28 to go
config:

NAME STATE READ WRITE CKSUM
store5 UNAVAIL 0 0 0 insufficient replicas
ata-ST5000LM000-2AN170_WCXXXXD6 FAULTED 1 2 0 too many errors
errors: List of errors unavailable: pool I/O is currently suspended

errors: 67 data errors, use '-v' for a list
Around the same time the pool disk become unresponsive and disappeared from the kernel partition list. This may or may not of been triggered by some smartctl --all requests to the disk but I think it was more likely when clearing the zpool errors.
I went to the physical machine and used sas3ircu LIST and sas3ircu 1 LOCATE to illuminate the drive bay with the faulting disk. I pulled the disk, verified the serial number and re-inserted it. Then I cleared the zpool errors again. Which gives the following status:
root@viper:~# zpool status -v store5
pool: store5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub paused since Sun Oct 8 16:54:23 2023
scrub started on Sun Oct 8 00:24:38 2023
2.29T scanned, 2.27T issued, 2.29T total
0B repaired, 98.95% done
config:

NAME STATE READ WRITE CKSUM
store5 ONLINE 0 0 0
ata-ST5000LM000-2AN170_WCXXXXD6 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

<metadata>:<0x0>
<metadata>:<0x1>
<metadata>:<0x34>
<metadata>:<0x84>
<metadata>:<0x87>
<metadata>:<0xb2>
<metadata>:<0x12c8>
/store5/data/vm/raw/
/store5/data/vm/raw/images/102/vm-102-disk-0.raw
store5/data/vm/raw@manual_viper_2023-07-24-summer-hol:/images/102/vm-102-disk-0.raw
After re-plugging the drive smartctl sent some alerts via email:
ID# ATTRIBUTE_NAME RAW_VALUE
197 Current_Pending_Sector 760
198 Offline_Uncorrectable 760

How old is the failing disk

The failure occurred 2023-Oct-08. The DOM is 2016-DEC and the drive has recorded ~5.12 years of Spin_Up_Time (44895 hours) and ~2.99 years of Head_Flying_Hours (26220 hours).

SMART Attributes for disk WCXXXXD6

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 057 006 Pre-fail Always 71312192
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always 0
4 Start_Stop_Count 0x0032 098 098 020 Old_age Always 2619
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always 0
7 Seek_Error_Rate 0x000f 083 060 045 Pre-fail Always 213397488
9 Power_On_Hours 0x0032 049 049 000 Old_age Always 44920 (239 215 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always 75
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always 4295032833
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always 0
190 Airflow_Temperature_Cel 0x0022 070 060 040 Old_age Always 30 (Min/Max 28/31)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always 1692
193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always 3954
194 Temperature_Celsius 0x0022 030 040 000 Old_age Always 30 (0 7 0 0 0)
195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always 71312192
197 Current_Pending_Sector 0x0012 098 098 000 Old_age Always 760
198 Offline_Uncorrectable 0x0010 098 098 000 Old_age Offline 760
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline 26246 (114 146 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline 15153884803
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline 341519508374
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always 0

Recovery strategy

syncoid reported ~103.4 GB delta between the last know good snapshot store5-backup/data@2023-07-24-summer-hol and the reported failing disk store5/data@2023-10-08-ioerrors. So this means the ~100 GiB data delta must be either:
Recovered from the failing drive for example with syncoid OR
Recovered from snapraid parity
After recovery snapraid check -a -d store5 can be used to verify checksums of the recovered data/disk.
Prerequisite: I need to interrogate the most recent snapraid sync and scrub outcomes on the omv kvm. Assuming they are healthy, then I will attempt the following:
✅ Boot the omv kvm and put the failing disk into read-only mode. Then check the relevant logs. ​alt: if booting is not possible, it would be possible to mount the root filesystem of the kvm and check the logs this way.
✅ The latest snapraid sync and scrub were healthy, this means snapraid parity is healthy and actual.
✅ Check snapraid status and snapraid diff and take any required actions. e.g. move any added files to different storage and revert any changes so there are no deltas
✅ Create a new zpool store5b and seed it with data from store5-backup
✅ Attempt to create a new snapshot on failing store5 disk: ​zfs snapshot -r store5/data@manual_viper_$(date +%Y-%m-%d)-ioerrors
✅ Attempt to mount -o ro,norecovery the store5 vs. store5b zfs snapshots for the related XFS partitions on viper and share via smbd to use a compare tool to see the delta. I was able to produce a list of the filesystem deltas in a html report. I was able to produce a list of paths of orphans on both sides and newer files. I completed these actions to create a manifest of the delta which could be used for troubleshooting and sanity checks later on. ​
image.png
umount the compared filesystems and clean-up any smbd cfg etc.
✅ Then I will attempt to syncoid the latest snapshot delta between store5/data@2023-07-24-summer-hol ... store5/data@23-10-08-ioerrors to store5b and see what errors come up. This zfs send block level approach is preferred over the snapraid filesystem level approach, it should be more efficient and throw errors if some block checksums fail as they are read. ​Outcome: 103.4 GiB replicated to store5b. 🟢 No errors detected on the zpool, or the syncoid job or the kernel log. ✅ Using the mount -o ro,norecovery of the relevant store5 vs. store5b zfs snapshots for the related XFS partitions on viper and share via smbd to use a compare tool to see if the delta was resolved. 🟢 The delta was resolved, at least on the filesystem tree/object level. ✅ umount the compared filesystems and clean-up any smbd cfg etc.
✅ If the amount of data corruption is NOT too much then I will then introduce store5b to proxmox storage and configure the kvm to use store5b storage. It is important to consider that the restored zfs block stream to store5b zpool will mean the XFS sub volumes (raw GUID partition table in my case) will have identical UUID’s to data from zpool store5. This means its important to remove the failing store5 disk from the kvm whilst also adding the new storage store5b. This avoids having to regenerate the UUID’s. When the kvm boots, because the UUID’s are identical (in theory the GUID partition table and XFS partition(s) are binary same as the source) so, no changes will be required in mounts or snapraid.conf. ✅ Finally: snapraid will be used to verify the integrity (checksums) of store5b data at the filesystem level: time snapraid check -a -d store5 🟢 The snapraid scrub operation completed without errors. ✅ As a best practice: remember to delete snapraid.content on the restored disk, so it can be regenerated by snapraid during the next sync. This assumes this is not the only disk that is configured to contain the snapraid.content file.
If it looks bad i.e. the send won’t work or the majority of the blocks could not be sent, the blocks from the failing disk must be ignored. So, then introduce store5b to proxmox storage and configure the kvm to use store5b storage. Then swap store5b with the failing store5 disk in snapraid, then use snapraid to recover the missing blocks, and finally snapraid will be used to verify the integrity of the restored data. ⚠ Noting that some files are excluded from snapraid and such files will not be recovered.
In the worse case and I have to start store5 from zero, store5 can be recreated from snapraid parity. ⚠ Noting that some files are excluded from snapraid and such files will not be recovered.
✅ Whichever process is used... snapraid will be used to verify the integrity of the restored disk. (for the the files not excluded by snapraid.conf)
There are also the standalone sfv checksums available in the root of the XFS filesystem but that shouldn’t be required as those checksums are are older than the latest known good store5-backup/data@2023-07-24-summer-hol snapshot. i.e. OpenZFS integrity logic should negate the need to rely on standalone checksums. For glacial content it doesn’t hurt to run them.
There is also zfs-check (part of zfs-autobackup) which does zfs block content comparison of two snapshots. It looks like it still doesn’t handle sparse data efficiently and would compare all blocks in a given dataset. This method might be worth a sanity check verification. For example between: ​store5-backup/data@2023-07-24-summer-hol and store5b/data@2023-07-24-summer-hol which should be binary same. 💡 I could update the code to handle sparse files/blocks and submit a pull request. I created an issue to track this:
🔴 There is a chance that scrubbing store5 (or resuming the paused scrub) will resolve issues (make the pool healthy again) until the hardware faults again but I find this unlikely given SMART on the disk now reports 760 Offline_Uncorrectable sectors/LBAs. The failing disk also seems to be responding slower during my diagnosis of the issue. The disk also seemed notably slower (bursty) during the syncoid job to recover data. Its possible another scrub will cause the drive to fully fail. The previous monthly ZFS scrub (which uncovered/detected the hardware fault) suspended the vdev/pool at 98.93% of the scrub. ​Outcome: resuming the 98% completed scrub didn’t heal the errors - which suggests the errors were not transient. ​scrub repaired 0B in 1 days 23:05:09 with 49 errors on Mon Oct 9 23:29:47 2023 that a pool may require multiple scrubs to clear errors but for this dying disk it isn’t worth it, now that the data has been recovered.
✅ Update backups - run syncoid to update the backup pools.

Observations

zfs send from a pool with corrupt data

I was wondering about what happens when zfs send encounters corruption. It would be logical that the send is aborted and a fatal error be thrown. A little bit of research uncovered a module parameter called zfs_send_corrupt_data (
) which toggles the self describing tuning setting. This could of been very useful as blocks that are corrupt would be marked as zfs bad blocks but the send would not abort. The corrupt blocks could then be recovered/handled by other means like snapraid parity.

🟢 snapraid sync and scrubs were healthy

OK snapraid-manualrun-20231007-212450-sync.log
OK snapraid-manualrun-20231007-213926-scrub.log
OK snapraid-manualrun-20231008-023002-scrub.log

# The highlevel snapraid status was OK
The oldest block was scrubbed 34 days ago, the median 23, the newest 0.

No sync is in progress.
The full array was scrubbed at least one time.
No rehash is in progress or needed.
No error detected.

🔴 Pool status after resuming the store5 scrub

root@viper:/etc/pve/qemu-server# zpool status store5
pool: store5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 1 days 23:05:09 with 49 errors on Mon Oct 9 23:29:47 2023
config:

NAME STATE READ WRITE CKSUM
store5 ONLINE 0 0 0
ata-ST5000LM000-2AN170_WCXXXXD6 ONLINE 0 0 0

errors: 10 data errors, use '-v' for a list

🟢 snapraid scrub operation result on the restored disk

Summary: This check operation computes and compares the checksums of all files on store5 vs. the known checksums in the snapraid.content file. Any errors or missing files would be reported.
root@node:~# time snapraid --log ">>/var/log/snapraid/snapraid-manualrun-%D-%T-check.log" check -a -d store5
Self test...
Loading state from /srv/dev-disk-by-uuid-1d5722e2-a5ac-4f64-9161-392290480c23/snapraid.content...
Filtering...
Using 1117 MiB of memory for the file-system.
Initializing...
Hashing...
100% completed, 2083621 MB accessed in 5:12
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.