Home lab & data vault

Explore

Pages

Data recovery stories

Data loss/corruption of CnMemory archive

Date 2021-07-16

System spec

The

filer⁠

is a kvm acting as a NAS data vault (

Debian⁠

omv⁠

) using XFS formatted virtual volumes for storage. Those storage partitions are merged into a single FUSE’ed hierarchy via

mergerfs⁠

with a category.create: lus policy. lus stands for least used, so writes happen on the partition (branch) with the least used space.

snapraid⁠

is used to maintain a triple parity of the store volumes.

CnMemory was an archive taken of portal USB drive.

The event

During an rsync of data between disks store4 > store4-backup (~2.7TiB), rsync logged an input/output error for file CnMemory-USB.tar.gz.aa.

The archive total size is 166GiB containing 53,072 file/dirs, and split into 1GiB chunks to distribute the data across multiple disks avoiding uneven disk fill levels (see system spec for more info).

rsync: read errors mapping "/srv/dev-disk-by-label-store4/!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa": Input/output error (5)

WARNING: !backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa failed verification -- update discarded (will try again).

rsync: read errors mapping "/srv/dev-disk-by-label-store4/!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa": Input/output error (5)

ERROR: !backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa failed verification -- update discarded.

rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1207) [sender=3.1.3]

Closer inspection determined that one 4k block was not readable in CnMemory-USB.tar.gz.aa.

4k is the bsize=4096 of the underlying XFS filesystem, verified with xfs_info.

Using dd to read the file in 512 blocks (sectors), the error was found to start at file block 4544 and ended at 4552. The eight 512 sectors would failed to read with dd (one 4k filesystem block).

With this info it was possible to read the file and skip the bad 4k block, writing two parts to another disk to be concatenated later.

Corrupt file file info based on 512 sector/block size:

CnMemory-USB.tar.gz.aa bad blocks:

sector byte kib mib

start 4544 2326528 2272 2.2

end 4552 2330624 2276 2.2

delta 8 4096 4

time dd conv=noerror bs=512 count=4544 status=progress iflag=direct if=CnMemory-USB.tar.gz.aa oflag=direct of=/srv/dev-disk-by-label-store5/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa.dd-part1

time dd conv=noerror bs=2276k skip=1 status=progress iflag=direct if=CnMemory-USB.tar.gz.aa oflag=direct of=/srv/dev-disk-by-label-store5/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa.dd-part2

in hindsight, the final output might of been easier using conv=noerror to skip the read errors:

time dd conv=noerror bs=4k status=progress iflag=direct if=CnMemory-USB.tar.gz.aa oflag=direct of=/srv/dev-disk-by-label-store5/\!backups/.../CnMemory-USB-compressed/recovery/CnMemory-USB.tar.gz.aa.recovery

⁉ question: with this approach, what is written for the unreadable 4k block? I would guess nothing or zero bytes. Something to consider checking next time.

I was successful in using the gzrecover util to recover the tar archive.

One expects at least 4k damage to one or more files because of the 4k hole in the gzip compression stream. At that moment I was under the impression I did not have backup/parity of this file (excluded from snapraid). It was an interesting experiment to see what was possible.

I combined the parts from the dd commands in the relevant dir and moved the corrupt file out of the way, the idea being the byte stream would be complete aside from the 4k hole:

pv *.tar.gz* | gzrecover -v -o /mnt/store4-backup/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa.recovery

with tar option --ignore-zeros I was then able to skip the tar error(s) and extract the files:

pv /mnt/store4-backup/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.recovery | tar vxf - --ignore-zeros

All the file/dirs appeared to be present.

Currently a recursive checksum file does not exist for the full archive content, so it was not possible at that moment to see which file(s) were corrupt.

TODO: extract archive and make a recursive checksum for the files. Given I know the exact byte offset of the missing 4k block, I could even reproduce the scenario to see what file inside the archive would of been corrupt.

smartd alerts

Fri, 16 Jul, 02:41 CEST the following warning/error was logged and alerted by the smartd daemon on the hypervisor:

Device: /dev/sdl [SAT], 8 Currently unreadable (pending) sectors

Device: /dev/sdl [SAT], 8 Offline uncorrectable sectors

Device: /dev/sdl [SAT], ATA error count increased from 0 to 3

Device info:

ST5000LM000-2AN170, S/N:WCJ_____, WWN:5-000c50-0aa2ba9ee, FW:0001, 5.00 TB

Examining smart details for this device revealed there had been a number of UNCorrectable errors logged.

link:

useful reading on UNC errors⁠

root@viper:~# smartctl --all /dev/disk/by-id/ata-ST5000LM000-2AN170_WCJ_____ | less

Error: UNC at LBA = 0x0fffffff = 268435455

Note that the number of pending sectors 8 in the alert matched the number of 512 sectors in the file that could not be read. Probably not a coincidence.

snapraid parity to the rescue

The corrupt file was fortunately included in snapraid parity (initially I believed it was excluded) and it was possible to use snapraid to fully recover the file.

snapraid --log ">>/var/log/snapraid/snapraid-manualrun-%D-%T-fix.log" fix -f '/!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa'

Afterwards I performed a full data extraction without errors and created a file list to provide to the archive data owner, in hindsight I should of made a recursive checksum file list at the same time. 😲😊

health checks on the disk

First of all I rsynced all data from store4 to a new disk, there were no errors logged, and followed the procedure to replace store4 disk in the snapraid array. This included verifying all the files bytes copied to the new disk matched the snapraid checksums.

Once the disk was out, I ran the short and long SMART self tests. These tests were clean and the errors logged by smartd were not committed to the “SMART Attributes Data Structure”.

⚠ This is actually a curious point, it seems like smartd is reading an online/live set of attributes, and smartctl is reading stored/committed set of attributes? Here are the corresponding attributes per the email alerts after the self tests:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 0

(omitted WHEN_FAILED column for readability)

Here we can see that the value, worst, thresh look nominal, and don’t reflect the email alerts, even after short and long self tests.

a normalized value, which ranges from 1 to 253 (with 1 representing the worst case and 253 representing the best) and a worst value, which represents the lowest recorded normalized value. The initial default value of attributes is 100 but can vary between manufacturer.

cite:

S.M.A.R.T attributes⁠

Two more attributes to study, the Reported_Uncorrect doesn’t look so good:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE

187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always 120

195 Hardware_ECC_Recovered 0x001a 077 064 000 Old_age Always 45615178

Observation ATA Error Count was equal to 120, I’m assuming that is a correlation to the Reported_Uncorrect.

Hypothesis on the raw value 120 and ATA Error Count: I was trying different ways to diagnose and read the file with the error, producing more errors as a result.

To cite a 3rd party,

BackBlaze⁠

blog on SMART stats in this case:

SMART 187: Reported_Uncorrect:

Backblaze uses this one. Number 187 reports the number of reads that could not be corrected using hardware ECC. Drives with zero uncorrectable errors hardly ever fail. This is one of the SMART stats we use to determine hard drive failure; once SMART 187 goes above zero, we schedule the drive for replacement.

This first chart shows the failure rates by number of errors. Because this is one of the attributes we use to decide whether a drive has failed, there has to be a strong correlation:

⁠
⁠
⁠

Backblaze replace disks that have a non zero for 187 Reported_Uncorrect which is the action I took. However I’m still really curious about this disk real health status because once I fixed the file and did a full bytes rsync of the disk content the disk seemed fine and the long SMART self test was also fine...

As suggested in

this insightful reddit comment⁠

, I performed a ddrescue to check for issues, reading the full disk and writing to /dev/null and this ~14 hours of stress didn’t reveal anything:

root@viper:~# ddrescue -f /dev/disk/by-uuid/ef51ecbd-b57e-42c3-8e73-dc56fdab454b /dev/null /root/ef51ecbd-b57e-42c3-8e73-dc56fdab454b.ddrescue.log

GNU ddrescue 1.23

Press Ctrl-C to interrupt

ipos: 5000 GB, non-trimmed: 0 B, current rate: 12869 kB/s

opos: 5000 GB, non-scraped: 0 B, average rate: 103 MB/s

non-tried: 0 B, bad-sector: 0 B, error rate: 0 B/s

rescued: 5000 GB, bad areas: 0, run time: 13h 27m 34s

pct rescued: 100.00%, read errors: 0, remaining time: n/a

time since last successful read: n/a

Finished

root@viper:~# less /root/ef51ecbd-b57e-42c3-8e73-dc56fdab454b.ddrescue.log

I use dd to skip to the LBA with the reported UNC error and read the bytes:

Error 120 occurred at disk power-on lifetime: 25726 hours (1071 days + 22 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

root@viper:~# dd iflag=direct oflag=nocache if=$device bs=512 skip=268435455 count=8 |stdbuf -o0 xxd | less

Note: The LBA 268435455 of 9767541168 is located at the start of the disk approx 2.7% of 100%

There were no issues, no new errors logged.

Next I wanted to test writing to the LBA with issues. i.e. overwriting the LBA with new data. Using dd I wrote a 4k block starting at the LBA reporting issues, and then checked the results were as expected:

root@viper:~# dd iflag=nocache oflag=direct status=progress if=/dev/zero of=$device bs=512 seek=268435455 count=8

8+0 records in

8+0 records out

4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00267102 s, 1.5 MB/s

root@viper:~# dd iflag=direct oflag=nocache status=progress if=$device bs=512 skip=268435455 count=8 |stdbuf -o0 od | head

0000000 000000 000000 000000 000000 000000 000000 000000 000000

8+0 records in

8+0 records out

0010000

4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00188757 s, 2.2 MB/s

Note: the od command writes * when it encounters repeated bytes, therefore its possible to pipe it to head to monitor that all bytes in the stream are zero. If more than the default 10 lines of output are received by head, the pipeline will be aborted.

More info⁠

I repeated the dd write a handful of times, no errors.

Next I wanted to stress write IO on the drive and check for any issues, I used dd to zero the drive. The dd block size was a calculated factor of the disks 9767541168 sectors close to 1MiB. I did this to ensure that the last block written is a full block. I’m not sure of the behaviour of dd if there is a remainder/partial write on the last block, if it really zeros every byte in that case? I would assume there could be trailing bytes left unzeroed.

⚠⚠⚠ WARNING / ACHTUNG: This is a destructive command and fills the provided $device with zeros.

root@viper:~# fdisk -l $device

Disk /dev/disk/by-id/ata-ST5000LM000-2AN170_WCJ_____: 4.6 TiB, 5000981078016 bytes, 9767541168 sectors

Disk model: ST5000LM000-2AN1

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

root@viper:~# dd iflag=nocache oflag=direct status=progress if=/dev/zero of=$device bs=1064448

5000922533376 bytes (5.0 TB, 4.5 TiB) copied, 47318 s, 106 MB/s

dd: error writing '/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXX32': No space left on device

4698193+0 records in

4698192+0 records out

5000981078016 bytes (5.0 TB, 4.5 TiB) copied, 47318.9 s, 106 MB/s

bs=1064448 * records=4698192 = bytes=5000981078016 which is equal with the device total bytes.

The No space left on device error is expected as I did not limit the dd on purpose.

checking the start and end of the disk, it has been zeroed:

root@viper:~# pv $device | stdbuf -o0 od | head

0000000 000000 000000 000000 000000 000000 000000 000000 000000

...

root@viper:~# tail -c 4096 $device | stdbuf -o0 od | head

0000000 000000 000000 000000 000000 000000 000000 000000 000000

0010000

# alternative with dd

root@viper:~# dd iflag=direct oflag=nocache if=$device bs=512 skip=$(( $(blockdev --getsz $device) - 8)) |stdbuf -o0 od | head

0000000 000000 000000 000000 000000 000000 000000 000000 000000

8+0 records in

8+0 records out

0010000

4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00187961 s, 2.2 MB/s

conclusions

The disk has been stressed with:

a long smart self test

a full ddrescue read

and a full zero wipe write pass

No further errors detected or logged. Drive performance seemed nominal.

I think its safe to use the drive, I wouldn’t recommend to a friend or customer, but for my use case the risk is OK.

The disk will be monitored by its internal error checking and correcting capabilities, and OpenZFS. Alerting via smartd and zed. Any data stored will be backed up in two ways:

snapraid triple parity. Can scrub and fix corruption and up to 3 disk failures.

DR full backup online/offline. Can restore files or disks as needed.

I expect the disk to fail in the coming months, and can replaced as needed.

TODO’s

⁠

order some spare disks.

⁠

ensure zfs is scrubbing. (September scrubs ran as expected)

⁠

ensure zed monitoring is working and alerting. (didn’t get Septembers alerts?)

⁠

implement

sanoid⁠

(or similar) for zfs snapshots (already using synciod).

misc notes

⁠

Here⁠

and

here⁠

mention that the current default LUKS header is 16MiB. Here is the example from one of the the store drives, only 1028KiB.

root@node:/dev/shm# cryptsetup luksHeaderBackup /dev/vdd --header-backup-file header.test

root@node:/dev/shm# du -BK header.test

1028K header.test

root@node:/dev/shm# rm header.test

useful reading / references

⁠

https://www.smartmontools.org/wiki/BadBlockHowto⁠

⁠

https://wiki.archlinux.org/title/badblocks⁠

⁠

Update 2021-09-13

I had reused the drive as a single drive zfs pool for store3 and all seemed well until Septembers scheduled zpool scrub:

⁠

According to netdata disk utilisation alerts the zfs scrubs for September started circa Sun Sep 12 00:34:21 UTC 2021.

I did get a zed email with this too BUT it was delayed

⁠

I checked the netdata graphs for the following time range:

⁠

a normal disk during scrub:

⁠

the problem disk during scrub, note how the data collection ends nearly straight away:

⁠

sensors

⁠

ups

⁠

netdata observations

the scrub ran in parallel on all (12?) zpools? 😡 [i wrote a

patch⁠

to control scrub concurrency]

this makes things hotter 🥵 >14.5° on the hottest CPU at peak. >13.5° on the cooler CPU.

🌡3.8° was the delta between the hottest and cooler CPU at peak. Suspect the difference in age/type of thermal compound.

and louder, and used ~13% more power at peak? ⚡

Peak disk read was 1.65GiB/s (~14 Gigabit/s) and dropped off as more blocks/disks were completed.

the problem disk data collection stop early on in the scrub

something stopped netdata collection at ~04:00 UTC on the 13th. This was me preventing netdata trying to monitoring the unhealthy disk.

iowait remained after the healthy disk scrubs (assume because of the unhealthy disk).

tty kernel errors

note /dev/sdn

⁠

This outcome means the drive is not fit for purpose and should be physically destroyed. I ran zpool destory store3, made a new store3 pool and have started to restore the data from store3-backup pool.

Note its not possible to offline a disk in a pool with a single disk, which makes sense:

cannot offline ata-ST5000LM000-2AN170_WCXXXX32: no valid replicas

Note to self: I could of saved some time/efforts here if I’d created this higher risk pool as a mirror.

smartctl now shows the errors for the disk

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 100 064 006 Pre-fail Always 0

3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always 798

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always 0

7 Seek_Error_Rate 0x000f 080 060 045 Pre-fail Always 92801920

9 Power_On_Hours 0x0032 070 070 000 Old_age Always 26545 (6 71 0)

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always 40

183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always 0

184 End-to-End_Error 0x0032 100 100 099 Old_age Always 0

187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always 120

188 Command_Timeout 0x0032 100 099 000 Old_age Always 4295032833

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always 0

190 Airflow_Temperature_Cel 0x0022 070 060 040 Old_age Always 30 (Min/Max 23/39)

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always 0

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always 137

193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always 1025

194 Temperature_Celsius 0x0022 030 040 000 Old_age Always 30 (0 8 0 0 0)

195 Hardware_ECC_Recovered 0x001a 100 064 000 Old_age Always 0

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 8

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 8

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline 19296 (70 171 0)

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline 24263598112

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline 122160393414

254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always 0

some related kern.log which I’m guessing occurred during the zfs scrub of store3:

Sep 12 00:00:01 viper kernel: [1424329.110315] Process accounting resumed

Sep 12 01:03:38 viper kernel: [1428146.578577] sd 11:0:4:0: [sdn] tag#2274 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00

Sep 12 01:03:38 viper kernel: [1428146.578584] scsi target11:0:4: enclosure level(0x0000), connector name( )

Sep 12 01:03:43 viper kernel: [1428151.311105] sd 11:0:4:0: [sdn] tag#2259 Sense Key : Not Ready [current]

Sep 12 01:03:43 viper kernel: [1428151.311129] blk_update_request: I/O error, dev sdn, sector 684081864 op 0x0:(READ) flags 0x700 phys_seg 8 prio class 0

Sep 12 01:03:43 viper kernel: [1428151.311295] sd 11:0:4:0: [sdn] tag#2260 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Sep 12 01:03:43 viper kernel: [1428151.311300] sd 11:0:4:0: [sdn] tag#2260 Add. Sense: Logical unit not ready, cause not reportable

Sep 12 01:03:43 viper kernel: [1428151.311304] blk_update_request: I/O error, dev sdn, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Sep 12 01:03:43 viper kernel: [1428151.311417] sd 11:0:4:0: [sdn] tag#2261 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Sep 12 01:03:43 viper kernel: [1428151.311420] sd 11:0:4:0: [sdn] tag#2261 Add. Sense: Logical unit not ready, cause not reportable

Sep 12 01:03:43 viper kernel: [1428151.311429] blk_update_request: I/O error, dev sdn, sector 9767523344 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Sep 12 01:03:48 viper kernel: [1428156.164876] WARNING: Pool 'store3' has encountered an uncorrectable I/O failure and has been suspended.

System spec

The event

smartd alerts

snapraid parity to the rescue

health checks on the disk

conclusions

TODO’s

misc notes

useful reading / references

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.