Gallery
Home lab & data vault
Share
Explore
Data recovery stories

Data loss/corruption of CnMemory archive

Date 2021-07-16

System spec

The is a kvm acting as a NAS data vault (+
) using XFS formatted virtual volumes for storage. Those storage partitions are merged into a single FUSE’ed hierarchy via with a category.create: lus policy. lus stands for least used, so writes happen on the partition (branch) with the least used space. is used to maintain a triple parity of the store volumes.
CnMemory was an archive taken of portal USB drive.

The event

During an rsync of data between disks store4 > store4-backup (~2.7TiB), rsync logged an input/output error for file CnMemory-USB.tar.gz.aa.
The archive total size is 166GiB containing 53,072 file/dirs, and split into 1GiB chunks to distribute the data across multiple disks avoiding uneven disk fill levels (see system spec for more info).
rsync: read errors mapping "/srv/dev-disk-by-label-store4/!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa": Input/output error (5)
WARNING: !backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa failed verification -- update discarded (will try again).
rsync: read errors mapping "/srv/dev-disk-by-label-store4/!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa": Input/output error (5)
ERROR: !backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa failed verification -- update discarded.
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1207) [sender=3.1.3]
Closer inspection determined that one 4k block was not readable in CnMemory-USB.tar.gz.aa.
4k is the bsize=4096 of the underlying XFS filesystem, verified with xfs_info.
Using dd to read the file in 512 blocks (sectors), the error was found to start at file block 4544 and ended at 4552. The eight 512 sectors would failed to read with dd (one 4k filesystem block).
With this info it was possible to read the file and skip the bad 4k block, writing two parts to another disk to be concatenated later.
Corrupt file file info based on 512 sector/block size:
CnMemory-USB.tar.gz.aa bad blocks:

sector byte kib mib
start 4544 2326528 2272 2.2
end 4552 2330624 2276 2.2
delta 8 4096 4

time dd conv=noerror bs=512 count=4544 status=progress iflag=direct if=CnMemory-USB.tar.gz.aa oflag=direct of=/srv/dev-disk-by-label-store5/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa.dd-part1

time dd conv=noerror bs=2276k skip=1 status=progress iflag=direct if=CnMemory-USB.tar.gz.aa oflag=direct of=/srv/dev-disk-by-label-store5/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa.dd-part2
in hindsight, the final output might of been easier using conv=noerror to skip the read errors:
time dd conv=noerror bs=4k status=progress iflag=direct if=CnMemory-USB.tar.gz.aa oflag=direct of=/srv/dev-disk-by-label-store5/\!backups/.../CnMemory-USB-compressed/recovery/CnMemory-USB.tar.gz.aa.recovery
question: with this approach, what is written for the unreadable 4k block? I would guess nothing or zero bytes. Something to consider checking next time.
I was successful in using the gzrecover util to recover the tar archive.
One expects at least 4k damage to one or more files because of the 4k hole in the gzip compression stream. At that moment I was under the impression I did not have backup/parity of this file (excluded from snapraid). It was an interesting experiment to see what was possible.
I combined the parts from the dd commands in the relevant dir and moved the corrupt file out of the way, the idea being the byte stream would be complete aside from the 4k hole:
pv *.tar.gz* | gzrecover -v -o /mnt/store4-backup/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa.recovery
with tar option --ignore-zeros I was then able to skip the tar error(s) and extract the files:
pv /mnt/store4-backup/\!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.recovery | tar vxf - --ignore-zeros
All the file/dirs appeared to be present.
Currently a recursive checksum file does not exist for the full archive content, so it was not possible at that moment to see which file(s) were corrupt.
TODO: extract archive and make a recursive checksum for the files. Given I know the exact byte offset of the missing 4k block, I could even reproduce the scenario to see what file inside the archive would of been corrupt.

smartd alerts

Fri, 16 Jul, 02:41 CEST the following warning/error was logged and alerted by the smartd daemon on the hypervisor:
Device: /dev/sdl [SAT], 8 Currently unreadable (pending) sectors
Device: /dev/sdl [SAT], 8 Offline uncorrectable sectors
Device: /dev/sdl [SAT], ATA error count increased from 0 to 3

Device info:
ST5000LM000-2AN170, S/N:WCJ_____, WWN:5-000c50-0aa2ba9ee, FW:0001, 5.00 TB
Examining smart details for this device revealed there had been a number of UNCorrectable errors logged.
root@viper:~# smartctl --all /dev/disk/by-id/ata-ST5000LM000-2AN170_WCJ_____ | less

Error: UNC at LBA = 0x0fffffff = 268435455
Note that the number of pending sectors 8 in the alert matched the number of 512 sectors in the file that could not be read. Probably not a coincidence.

snapraid parity to the rescue

The corrupt file was fortunately included in snapraid parity (initially I believed it was excluded) and it was possible to use snapraid to fully recover the file.
snapraid --log ">>/var/log/snapraid/snapraid-manualrun-%D-%T-fix.log" fix -f '/!backups/.../CnMemory-USB-compressed/CnMemory-USB.tar.gz.aa'
Afterwards I performed a full data extraction without errors and created a file list to provide to the archive data owner, in hindsight I should of made a recursive checksum file list at the same time. 😲😊

health checks on the disk

First of all I rsynced all data from store4 to a new disk, there were no errors logged, and followed the procedure to replace store4 disk in the snapraid array. This included verifying all the files bytes copied to the new disk matched the snapraid checksums.
Once the disk was out, I ran the short and long SMART self tests. These tests were clean and the errors logged by smartd were not committed to the “SMART Attributes Data Structure”.
⚠ This is actually a curious point, it seems like smartd is reading an online/live set of attributes, and smartctl is reading stored/committed set of attributes? Here are the corresponding attributes per the email alerts after the self tests:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 0
(omitted WHEN_FAILED column for readability)
Here we can see that the value, worst, thresh look nominal, and don’t reflect the email alerts, even after short and long self tests.
a normalized value, which ranges from 1 to 253 (with 1 representing the worst case and 253 representing the best) and a worst value, which represents the lowest recorded normalized value. The initial default value of attributes is 100 but can vary between manufacturer.
Two more attributes to study, the Reported_Uncorrect doesn’t look so good:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always 120
195 Hardware_ECC_Recovered 0x001a 077 064 000 Old_age Always 45615178
Observation ATA Error Count was equal to 120, I’m assuming that is a correlation to the Reported_Uncorrect.
Hypothesis on the raw value 120 and ATA Error Count: I was trying different ways to diagnose and read the file with the error, producing more errors as a result.
To cite a 3rd party, blog on SMART stats in this case:
SMART 187: Reported_Uncorrect:
Backblaze uses this one. Number 187 reports the number of reads that could not be corrected using hardware ECC. Drives with zero uncorrectable errors hardly ever fail. This is one of the SMART stats we use to determine hard drive failure; once SMART 187 goes above zero, we schedule the drive for replacement.
This first chart shows the failure rates by number of errors. Because this is one of the attributes we use to decide whether a drive has failed, there has to be a strong correlation:
image.png
Backblaze replace disks that have a non zero for 187 Reported_Uncorrect which is the action I took. However I’m still really curious about this disk real health status because once I fixed the file and did a full bytes rsync of the disk content the disk seemed fine and the long SMART self test was also fine...
As suggested in , I performed a ddrescue to check for issues, reading the full disk and writing to /dev/null and this ~14 hours of stress didn’t reveal anything:
root@viper:~# ddrescue -f /dev/disk/by-uuid/ef51ecbd-b57e-42c3-8e73-dc56fdab454b /dev/null /root/ef51ecbd-b57e-42c3-8e73-dc56fdab454b.ddrescue.log

GNU ddrescue 1.23
Press Ctrl-C to interrupt
ipos: 5000 GB, non-trimmed: 0 B, current rate: 12869 kB/s
opos: 5000 GB, non-scraped: 0 B, average rate: 103 MB/s
non-tried: 0 B, bad-sector: 0 B, error rate: 0 B/s
rescued: 5000 GB, bad areas: 0, run time: 13h 27m 34s
pct rescued: 100.00%, read errors: 0, remaining time: n/a
time since last successful read: n/a
Finished

root@viper:~# less /root/ef51ecbd-b57e-42c3-8e73-dc56fdab454b.ddrescue.log

I use dd to skip to the LBA with the reported UNC error and read the bytes:
Error 120 occurred at disk power-on lifetime: 25726 hours (1071 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

root@viper:~# dd iflag=direct oflag=nocache if=$device bs=512 skip=268435455 count=8 |stdbuf -o0 xxd | less
Note: The LBA 268435455 of 9767541168 is located at the start of the disk approx 2.7% of 100%
There were no issues, no new errors logged.
Next I wanted to test writing to the LBA with issues. i.e. overwriting the LBA with new data. Using dd I wrote a 4k block starting at the LBA reporting issues, and then checked the results were as expected:
root@viper:~# dd iflag=nocache oflag=direct status=progress if=/dev/zero of=$device bs=512 seek=268435455 count=8

8+0 records in
8+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00267102 s, 1.5 MB/s

root@viper:~# dd iflag=direct oflag=nocache status=progress if=$device bs=512 skip=268435455 count=8 |stdbuf -o0 od | head
0000000 000000 000000 000000 000000 000000 000000 000000 000000
*
8+0 records in
8+0 records out
0010000
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00188757 s, 2.2 MB/s
Note: the od command writes * when it encounters repeated bytes, therefore its possible to pipe it to head to monitor that all bytes in the stream are zero. If more than the default 10 lines of output are received by head, the pipeline will be aborted. .
I repeated the dd write a handful of times, no errors.
Next I wanted to stress write IO on the drive and check for any issues, I used dd to zero the drive. The dd block size was a calculated factor of the disks 9767541168 sectors close to 1MiB. I did this to ensure that the last block written is a full block. I’m not sure of the behaviour of dd if there is a remainder/partial write on the last block, if it really zeros every byte in that case? I would assume there could be trailing bytes left unzeroed.
⚠⚠⚠ WARNING / ACHTUNG: This is a destructive command and fills the provided $device with zeros.
root@viper:~# fdisk -l $device
Disk /dev/disk/by-id/ata-ST5000LM000-2AN170_WCJ_____: 4.6 TiB, 5000981078016 bytes, 9767541168 sectors
Disk model: ST5000LM000-2AN1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


root@viper:~# dd iflag=nocache oflag=direct status=progress if=/dev/zero of=$device bs=1064448
5000922533376 bytes (5.0 TB, 4.5 TiB) copied, 47318 s, 106 MB/s
dd: error writing '/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXX32': No space left on device
4698193+0 records in
4698192+0 records out
5000981078016 bytes (5.0 TB, 4.5 TiB) copied, 47318.9 s, 106 MB/s
bs=1064448 * records=4698192 = bytes=5000981078016 which is equal with the device total bytes.
The No space left on device error is expected as I did not limit the dd on purpose.
checking the start and end of the disk, it has been zeroed:
root@viper:~# pv $device | stdbuf -o0 od | head
0000000 000000 000000 000000 000000 000000 000000 000000 000000
*
...

root@viper:~# tail -c 4096 $device | stdbuf -o0 od | head
0000000 000000 000000 000000 000000 000000 000000 000000 000000
*
0010000

# alternative with dd
root@viper:~# dd iflag=direct oflag=nocache if=$device bs=512 skip=$(( $(blockdev --getsz $device) - 8)) |stdbuf -o0 od | head
0000000 000000 000000 000000 000000 000000 000000 000000 000000
*
8+0 records in
8+0 records out
0010000
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00187961 s, 2.2 MB/s

conclusions

The disk has been stressed with:
a long smart self test
a full ddrescue read
and a full zero wipe write pass
No further errors detected or logged. Drive performance seemed nominal.
I think its safe to use the drive, I wouldn’t recommend to a friend or customer, but for my use case the risk is OK.
The disk will be monitored by its internal error checking and correcting capabilities, and OpenZFS. Alerting via smartd and zed. Any data stored will be backed up in two ways:
snapraid triple parity. Can scrub and fix corruption and up to 3 disk failures.
DR full backup online/offline. Can restore files or disks as needed.
I expect the disk to fail in the coming months, and can replaced as needed.

TODO’s

order some spare disks.
ensure zfs is scrubbing. (September scrubs ran as expected)
ensure zed monitoring is working and alerting. (didn’t get Septembers alerts?)
implement (or similar) for zfs snapshots (already using synciod).

misc notes

and
mention that the current default LUKS header is 16MiB. Here is the example from one of the the store drives, only 1028KiB.
root@node:/dev/shm# cryptsetup luksHeaderBackup /dev/vdd --header-backup-file header.test

root@node:/dev/shm# du -BK header.test
1028K header.test

root@node:/dev/shm# rm header.test

useful reading / references

Update 2021-09-13

I had reused the drive as a single drive zfs pool for store3 and all seemed well until Septembers scheduled zpool scrub:
image.png
According to netdata disk utilisation alerts the zfs scrubs for September started circa Sun Sep 12 00:34:21 UTC 2021.
I did get a zed email with this too BUT it was delayed
image.png
I checked the netdata graphs for the following time range:
image.png
image.png
image.png
a normal disk during scrub:
image.png
the problem disk during scrub, note how the data collection ends nearly straight away:
image.png
sensors
image.png
ups
image.png

netdata observations

the scrub ran in parallel on all (12?) zpools? 😡 [i wrote a to control scrub concurrency]
this makes things hotter 🥵 >14.5° on the hottest CPU at peak. >13.5° on the cooler CPU.
🌡3.8° was the delta between the hottest and cooler CPU at peak. Suspect the difference in age/type of thermal compound.
and louder, and used ~13% more power at peak? ⚡
Peak disk read was 1.65GiB/s (~14 Gigabit/s) and dropped off as more blocks/disks were completed.
the problem disk data collection stop early on in the scrub
something stopped netdata collection at ~04:00 UTC on the 13th. This was me preventing netdata trying to monitoring the unhealthy disk.
iowait remained after the healthy disk scrubs (assume because of the unhealthy disk).

tty kernel errors

note /dev/sdn
image.png
This outcome means the drive is not fit for purpose and should be physically destroyed. I ran zpool destory store3, made a new store3 pool and have started to restore the data from store3-backup pool.
Note its not possible to offline a disk in a pool with a single disk, which makes sense:
cannot offline ata-ST5000LM000-2AN170_WCXXXX32: no valid replicas
Note to self: I could of saved some time/efforts here if I’d created this higher risk pool as a mirror.
smartctl now shows the errors for the disk
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 064 006 Pre-fail Always 0
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always 798
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always 0
7 Seek_Error_Rate 0x000f 080 060 045 Pre-fail Always 92801920
9 Power_On_Hours 0x0032 070 070 000 Old_age Always 26545 (6 71 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always 40
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always 120
188 Command_Timeout 0x0032 100 099 000 Old_age Always 4295032833
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always 0
190 Airflow_Temperature_Cel 0x0022 070 060 040 Old_age Always 30 (Min/Max 23/39)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always 137
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always 1025
194 Temperature_Celsius 0x0022 030 040 000 Old_age Always 30 (0 8 0 0 0)
195 Hardware_ECC_Recovered 0x001a 100 064 000 Old_age Always 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline 19296 (70 171 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline 24263598112
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline 122160393414
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.