Explore

Clearing old ZFS labels

Date: Author:

/u/kyle0r⁠

Published: 2024-05-06 Last updated: NA

The problem

A healthy rpool vdev child disk was reporting the existence of a corrupted rpool in other areas of its GPT (other partitions). 😲

The “actual” and healthy vdev child disk partition seemed OK ($disk1-part3). The system was otherwise OK and didn’t seem to impacted during boot or runtime by this cosmetic issue.

Nonetheless this could lead to future issues or confusion so its best to proactively clear this up.

Here is an example that I tracked down to an “old” ZFS label located on $disk1:

# get a list of importable zpools from devices in /dev/disk/by-id

zpool import -d /dev/disk/by-id/

pool: rpool

id: 7184717139914799043

state: UNAVAIL

status: One or more devices contains corrupted data.

action: The pool cannot be imported due to damaged devices or data.

see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E

config:

rpool UNAVAIL insufficient replicas

mirror-0 UNAVAIL insufficient replicas

wwn-0x500a0751____ba65-part1 UNAVAIL corrupted data <<< $disk1

sdj2 UNAVAIL

# -----

# on the same disk the healthy "actual" rpool

zpool status rpool

pool: rpool

state: ONLINE

scan: resilvered 96.9G in 00:05:53 with 0 errors on Sun May 5 16:10:37 2024

config:

NAME STATE READ WRITE CKSUM

rpool ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

wwn-0x500a0751____ba65-part3 ONLINE 0 0 0 <<< $disk1

wwn-0x500a0751____6f6c-part3 ONLINE 0 0 0 <<< $disk2

errors: No known data errors

The clear up

Why did this happen?

I’m not sure if the old labels were a result of the legacy boot switch procedure, BUT they were certainly old and undesired and zpool import -d was reporting at least one of them as unavailable/corrupted rpool. I also spotted one partition ZFS label was “rpool-old” which was from the legacy bootloader switch procedure [

link⁠

So it looked like I had 2 older ZFS labels hanging around on $disk1 and $disk2.

Could zpool labelclear be performed while disk is active in pool?

Perhaps, but why risk the command doing something unexpected or causing a problem for the active pool? I did a bit of research to see what others had experienced, and decided that detaching was the safe way to go.

🏆 Lessons learned

Details of the behaviour of sgdisk --zap-all [
`here`⁠
]

How to use and experience with zpool clearlabel

Its a good practice to zpool labelclear once a disk/partition is no longer in-use

It sounds like an obvious statement, but for my own notes, the partition table points to the sectors on the disk where data for a given partition resides. When the partition table is zapped or wiped using tools such as wipefs or sgdisk, the partition sector data is generally untouched and intact. This is why one can zap the GTP and recreate the previous table with the same geometry or restore the table from a backup and still find the partition data intact AND why ZFS labels survive tools like sgdisk and wipefs.

I’m not a fan of zeroing disks. Just zeroing a disk once its no longer in-use feels wasteful for me, especially for SSD’s which have limited write lifetimes and limits. Zeroing or obfuscating key parts of a disk is a good practice, for example GPT or LUKS data areas, or the first N MiB of each partition.

Typically I use encrypted filesystems for sensitive data, so unless there is a concern about the private key being compromised, wiping encrypted partitions has limited security benefits. Its basically random data without the decryption key.

I’m going to adopt a best practice of using zpool labelclear once a disk is no longer in-use AND prior to zapping the GPT or using wipefs. This should avoid a repeat of the problem described herein.

I suspect that one of the reasons this situation arose in the first place is that these disks and/or partitions were at some point in use in a zpool and were either detached manually or via zpool detach. They weren't fully sanitised when they could have been (before being reused), and what remains is cruft on the disks, which was detected by zpool import -d.

Details of the size and location of GPT [
`here`⁠
]

cite:

How much space does a GPT partition table take up?⁠

⁠

Note that for CT500MX500 SSD’s the wwn suffix is the last N characters of the disk serial number (lower case)

CT500MX500SSD1_XXXXE21ABA65 e21aba65 - in this example 8 chars.

This would appear to vary per manufacturer. Seagate ST5000LM000-2AN170 wwn had no visible relation to the disk serial.

I’ve read in the past that the kernel wwn’s, at least for spinning rust, is incremented a little from the actual on-disk-label wwn. I cannot remember why right now but it might be to do with multipath logic? (edit: yes, transport address and individual port identifiers[

1⁠

])

💡 My research on wwn in the past highlighted the main reason for choosing wwn is because the path *should* be portable on any system, any controller, any connection type, any sub-system etc. Consider for example a disk moving from a USB enclosure to a SATA enclosure. Most /dev/disk/by-id paths would change but wwn *should* remain static and portable. A good reason to use it for:

Persistent block device naming - ArchWiki⁠

⁠

[1]

Why is there a shift between the WWN reported from the controller and the Linux system?⁠

⁠

Collection of background intel

# current HEALTHY rpool UUID 969344613226376403 (zdb lists this as pool_guid)

# $disk1

❌ BAD: zdb -l /dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65 pool_guid: 7184717139914799043 / guid: 15370042588629395110 aka wwn-0x500a0751e21aba65

❌ BAD: zdb -l /dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65-part1 pool_guid: 7184717139914799043 / guid: 14814223971765428024 aka wwn-0x500a0751e21aba65

part2 is boot partition

OK part3 is the latest/correct rpool pool_guid: 969344613226376403

# $disk2

❌ BAD: zdb -l /dev/disk/by-id/ata-CT500MX500SSD1_2045E4C56F6C pool_guid: 7184717139914799043 / guid: 18039295927987431032 aka wwn-0x500a0751e4c56f6c

OK part1 is boot partition

OK part2

OK part3 is the latest/correct rpool pool_guid: 969344613226376403

root@viper:/var/tmp# proxmox-boot-tool status

Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..

System currently booted with legacy bios

11AA-090A is configured with: grub (versions: 6.2.16-20-pve, 6.5.11-7-pve)

65FF-3295 is configured with: grub (versions: 6.2.16-20-pve, 6.5.11-7-pve)

root@viper:/var/tmp# zpool status rpool

pool: rpool

state: ONLINE

scan: scrub repaired 0B in 00:05:06 with 0 errors on Sun Apr 14 00:29:07 2024

config:

NAME STATE READ WRITE CKSUM

rpool ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

wwn-0x500a0751e21aba65-part3 ONLINE 0 0 0

wwn-0x500a0751e4c56f6c-part3 ONLINE 0 0 0

errors: No known data errors

# disks

# disk 1 - rpool mirror vdev child device

ata-CT500MX500SSD1_1934E21ABA65 wwn-0x500a0751e21aba65 # 83% ware

# disk 2 - rpool mirror vdev child device

ata-CT500MX500SSD1_2045E4C56F6C www-0x500a0751e4c56f6c # 34% ware

# disk 3 - spare

ata-CT500MX500SSD1_2045E4C5724A wwn-0x500a0751e4c5724a # 0% ware

Prepare $disk3

disk3=/dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A

lsblk --output NAME,MODEL,SERIAL,WWN -D $disk3

NAME MODEL SERIAL WWN

sdu CT500MX500SSD1 2045E4C5724A 0x500a0751e4c5724a

├─sdu1 0x500a0751e4c5724a

├─sdu2 0x500a0751e4c5724a

└─sdu3 0x500a0751e4c5724a

disk3_wwn=wwn-0x500a0751e4c5724a

sgdisk --print $disk3

Disk /dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A: 976773168 sectors, 465.8 GiB

Sector size (logical/physical): 512/4096 bytes

Disk identifier (GUID): DE43CD39-B54D-42C5-AA26-055B989D5520

Partition table holds up to 128 entries

Main partition table begins at sector 2 and ends at sector 33

First usable sector is 34, last usable sector is 976773134

Partitions will be aligned on 2048-sector boundaries

Total free space is 968386540 sectors (461.8 GiB)

Number Start (sector) End (sector) Size Code Name

1 2048 4095 1024.0 KiB EF02 BIOS boot partition

2 4096 4194304 2.0 GiB EF00 EFI system partition

3 4196352 8390655 2.0 GiB EF00 EFI system partition

# check ZFS labels on the added drive

# check for ZFS labels?

for part in $disk3 ${disk3}-part{1..3}; do echo $part; zdb -l $part; done

✅ clear, there was none

# preview wipefs for $disk3

wipefs --no-act --all $disk3-part3

wipefs --no-act --all $disk3-part2 # vfat

wipefs --no-act --all $disk3-part1

wipefs --no-act --all $disk3 # GPT

# $disk3 ZAP dry run safety check

printf $disk3 | grep 'C5724A$' && { echo match; echo zapping...; echo; echo sgdisk --zap-all $disk3; printf "\nzapped. exit code: %s\n" "$?"; }

/dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A

match

zapping...

sgdisk --zap-all /dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A

zapped. exit code: 0

# 💥 ZAP $disk3

printf $disk3 | grep 'C5724A$' && { echo match; echo zapping...; echo; sgdisk --zap-all $disk3; printf "\nzapped. exit code: %s\n" "$?"; }

/dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A

match

zapping...

GPT data structures destroyed! You may now partition the disk using fdisk or

other utilities.

zapped. exit code: 0

# 👆 this zeroed at least the first 20480 bytes / 20 KiB of the drive and reloaded the partition table

# The GPT partition table is 16,384 bytes / 32 KiB, or 32 (512 byte) sectors, starting on sector 2

# AFAIK the MBR is stored on sector 1 (the first 512 bytes of the disk)

# The GPT header and partition table are written at both the beginning and end of the disk.

# The last 1MiB of the disk was also zeroed, unclear how much was performed by sgdisk.

# how I verified this

# get Endianness [

ref⁠

]

root@viper:/var/tmp# lscpu |grep Endian

Byte Order: Little Endian

endianness=little

# dump the first 1MiB of $disk3 to file

dd iflag=direct if=$disk3 bs=1M count=1 of=/var/tmp/CT500MX500SSD1_2045E4C5724A-1M-after.device

# hex dump first 1MiB

od -A x -t x2z --endian=$endianness CT500MX500SSD1_2045E4C5724A-1M-after.device |head

000000 0000 0000 0000 0000 0000 0000 0000 0000 >................<

005000 4241 4333 0000 0200 ffff ffff ffff ffff >AB3C............<

# 👆 this od format is very close to xxd format and has the advantge of being able to detect and skip duplicates.

# format: hex byte offset followed by 8 pairs of hex bytes, followed by the ASCII representation

# the * represents that between offset 000000 to 005000 the bytes were duplicates of the previous offset (zeroed)

# hex 005000 represents the byte offset, the offset of the first byte displayed on that line of the dump

# also the offset of the end of the previous 16 bytes

# hex 005000 = 20480 decimal, so we can see from offset 000000 to 20480 the bytes were zeroed

# dump the last 1MiB of $disk3 to file (976773168 - 2048 * 512 = 976771120) (2048 * 512 bytes = 1MiB)

dd iflag=direct if=$disk3 bs=512 skip=976771120 of=/var/tmp/CT500MX500SSD1_2045E4C5724A-1M-END-after.device

# hex dump - we sees the full 1MiB has been zeroed

root@viper:/var/tmp# od -A x -t x2z --endian=$endianness CT500MX500SSD1_2045E4C5724A-1M-END-after.device

000000 0000 0000 0000 0000 0000 0000 0000 0000 >................<

100000

# the disk is ready to be used :)

Make a zpool checkpoint

⚠ Note that a checkpoint is not a backup, but it does provide the ability to undo/rewind a pool to a previous state. It will not undo any changes made to the disks under ZFS, such as partition table changes or other changes made directly to the disks/partitions.

In case something goes wrong we have a point we can undo/rewind the rpool:

zpool checkpoint rpool

💡 read about zpool checkpoints in my cheatsheet [

here⁠

]

Attach $disk3 to rpool mirror

# attach $disk3 to rpool, wait for resilver

disk1=/dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65

lsblk --output NAME,MODEL,SERIAL,WWN -D $disk1

NAME MODEL SERIAL WWN

sdf CT500MX500SSD1 1934E21ABA65 0x500a0751e21aba65

├─sdf1 0x500a0751e21aba65

├─sdf2 0x500a0751e21aba65

└─sdf3 0x500a0751e21aba65

disk1_wwn=wwn-0x500a0751e21aba65

# we can copy the GPT from $disk1

# 💡 echo first to dry-run!

sgdisk $disk1 --replicate=$disk3

# randomise $disk3 GUID

# 💡 echo first to dry-run!

sgdisk -G $disk3

# dry run attach

echo zpool attach rpool ${disk1_wwn}-part3 ${disk3_wwn}-part3

zpool attach rpool wwn-0x500a0751e21aba65-part3 wwn-0x500a0751e4c5724a-part3

# 💥 attach $disk3-part3 to rpool mirror

zpool attach rpool $disk1-part3 $disk3-part3

# ... resilvering ~5 mins

zpool status rpool

pool: rpool

state: ONLINE

scan: resilvered 97.0G in 00:04:34 with 0 errors on Sun May 5 12:56:13 2024

config:

NAME STATE READ WRITE CKSUM

rpool ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

wwn-0x500a0751e21aba65-part3 ONLINE 0 0 0

wwn-0x500a0751e4c56f6c-part3 ONLINE 0 0 0

wwn-0x500a0751e4c5724a-part3 ONLINE 0 0 0

errors: No known data errors

Detach $disk1 from rpool, clear ZFS label(s), zap, attach

# detach $disk1, zap it, check for zfs labels, proxmox-boot-tool clean, follow steps to setup proxmox-boot-tool, attach

# dry run

echo zpool detach rpool ${disk1_wwn}-part3

# 💥 detach

zpool detach rpool ${disk1_wwn}-part3

# ----- ZAP

# $disk1 dry run safety check

printf $disk1 | grep '1ABA65$' && { echo match; echo zapping...; echo; echo sgdisk --zap-all $disk1; printf "\nzapped. exit code: %s\n" "$?"; }

/dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65

match

zapping...

sgdisk --zap-all /dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65

zapped. exit code: 0

# 💥 ZAP $disk1

printf $disk1 | grep '1ABA65$' && { echo match; echo zapping...; echo; sgdisk --zap-all $disk1; printf "\nzapped. exit code: %s\n" "$?"; }

/dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65

match

zapping...

GPT data structures destroyed! You may now partition the disk using fdisk or

other utilities.

zapped. exit code: 0

# ----- CLEAR ZFS LABELS

zdb -l $disk1

# 👆 ZFS label still present, so i suspect it will also be on ${disk1}-part1 when its recreated (confirmed!)

# dry-run

echo zpool labelclear $disk1

The problem

The clear up

Why did this happen?

Could zpool labelclear be performed while disk is active in pool?

🏆 Lessons learned

Collection of background intel

Prepare $disk3

Make a zpool checkpoint

Attach $disk3 to rpool mirror

Detach $disk1 from rpool, clear ZFS label(s), zap, attach

Detach $disk2 from rpool, clear ZFS label(s), zap, attach

Detach $disk3 from rpool, clear ZFS label(s), zap, eject

Reboot

zpool import check

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.

Clearing old ZFS labels

The problem

The clear up

Why did this happen?

Could zpool labelclear be performed while disk is active in pool?

🏆 Lessons learned

Details of the behaviour of sgdisk --zap-all [here⁠]

How to use and experience with zpool clearlabel

Its a good practice to zpool labelclear once a disk/partition is no longer in-use

Details of the size and location of GPT [here⁠]

Note that for CT500MX500 SSD’s the wwn suffix is the last N characters of the disk serial number (lower case)

Collection of background intel

Prepare $disk3

Make a zpool checkpoint

Attach $disk3 to rpool mirror

Detach $disk1 from rpool, clear ZFS label(s), zap, attach

Details of the behaviour of sgdisk --zap-all [
`here`⁠
]

Details of the size and location of GPT [
`here`⁠
]