Gallery
Home lab & data vault
Share
Explore
Proxmox: Switching from legacy boot when there is no space for ESP partition

Clearing old ZFS labels

Date: Author: Published: 2024-05-06 Last updated: NA

The problem

A healthy rpool vdev child disk was reporting the existence of a corrupted rpool in other areas of its GPT (other partitions). 😲
The “actual” and healthy vdev child disk partition seemed OK ($disk1-part3). The system was otherwise OK and didn’t seem to impacted during boot or runtime by this cosmetic issue.
Nonetheless this could lead to future issues or confusion so its best to proactively clear this up.
Here is an example that I tracked down to an “old” ZFS label located on $disk1:
# get a list of importable zpools from devices in /dev/disk/by-id
zpool import -d /dev/disk/by-id/

pool: rpool
id: 7184717139914799043
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
config:

rpool UNAVAIL insufficient replicas
mirror-0 UNAVAIL insufficient replicas
wwn-0x500a0751____ba65-part1 UNAVAIL corrupted data <<< $disk1
sdj2 UNAVAIL

# -----

# on the same disk the healthy "actual" rpool
zpool status rpool
pool: rpool
state: ONLINE
scan: resilvered 96.9G in 00:05:53 with 0 errors on Sun May 5 16:10:37 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x500a0751____ba65-part3 ONLINE 0 0 0 <<< $disk1
wwn-0x500a0751____6f6c-part3 ONLINE 0 0 0 <<< $disk2

errors: No known data errors

The clear up

Why did this happen?

I’m not sure if the old labels were a result of the legacy boot switch procedure, BUT they were certainly old and undesired and zpool import -d was reporting at least one of them as unavailable/corrupted rpool. I also spotted one partition ZFS label was “rpool-old” which was from the legacy bootloader switch procedure [
].
So it looked like I had 2 older ZFS labels hanging around on $disk1 and $disk2.

Could zpool labelclear be performed while disk is active in pool?

Perhaps, but why risk the command doing something unexpected or causing a problem for the active pool? I did a bit of research to see what others had experienced, and decided that detaching was the safe way to go.

🏆 Lessons learned

Details of the behaviour of sgdisk --zap-all [
]

How to use and experience with zpool clearlabel

Its a good practice to zpool labelclear once a disk/partition is no longer in-use

It sounds like an obvious statement, but for my own notes, the partition table points to the sectors on the disk where data for a given partition resides. When the partition table is zapped or wiped using tools such as wipefs or sgdisk, the partition sector data is generally untouched and intact. This is why one can zap the GTP and recreate the previous table with the same geometry or restore the table from a backup and still find the partition data intact AND why ZFS labels survive tools like sgdisk and wipefs.
I’m not a fan of zeroing disks. Just zeroing a disk once its no longer in-use feels wasteful for me, especially for SSD’s which have limited write lifetimes and limits. Zeroing or obfuscating key parts of a disk is a good practice, for example GPT or LUKS data areas, or the first N MiB of each partition.
Typically I use encrypted filesystems for sensitive data, so unless there is a concern about the private key being compromised, wiping encrypted partitions has limited security benefits. Its basically random data without the decryption key.
I’m going to adopt a best practice of using zpool labelclear once a disk is no longer in-use AND prior to zapping the GPT or using wipefs. This should avoid a repeat of the problem described herein.
I suspect that one of the reasons this situation arose in the first place is that these disks and/or partitions were at some point in use in a zpool and were either detached manually or via zpool detach. They weren't fully sanitised when they could have been (before being reused), and what remains is cruft on the disks, which was detected by zpool import -d.

Details of the size and location of GPT [
]

Note that for CT500MX500 SSD’s the wwn suffix is the last N characters of the disk serial number (lower case)

CT500MX500SSD1_XXXXE21ABA65 e21aba65 - in this example 8 chars.
This would appear to vary per manufacturer. Seagate ST5000LM000-2AN170 wwn had no visible relation to the disk serial.
I’ve read in the past that the kernel wwn’s, at least for spinning rust, is incremented a little from the actual on-disk-label wwn. I cannot remember why right now but it might be to do with multipath logic? (edit: yes, transport address and individual port identifiers[
])
💡 My research on wwn in the past highlighted the main reason for choosing wwn is because the path *should* be portable on any system, any controller, any connection type, any sub-system etc. Consider for example a disk moving from a USB enclosure to a SATA enclosure. Most /dev/disk/by-id paths would change but wwn *should* remain static and portable. A good reason to use it for:

Collection of background intel

# current HEALTHY rpool UUID 969344613226376403 (zdb lists this as pool_guid)


# $disk1
❌ BAD: zdb -l /dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65 pool_guid: 7184717139914799043 / guid: 15370042588629395110 aka wwn-0x500a0751e21aba65
❌ BAD: zdb -l /dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65-part1 pool_guid: 7184717139914799043 / guid: 14814223971765428024 aka wwn-0x500a0751e21aba65
part2 is boot partition
OK part3 is the latest/correct rpool pool_guid: 969344613226376403

# $disk2
❌ BAD: zdb -l /dev/disk/by-id/ata-CT500MX500SSD1_2045E4C56F6C pool_guid: 7184717139914799043 / guid: 18039295927987431032 aka wwn-0x500a0751e4c56f6c
OK part1 is boot partition
OK part2
OK part3 is the latest/correct rpool pool_guid: 969344613226376403


root@viper:/var/tmp# proxmox-boot-tool status
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with legacy bios
11AA-090A is configured with: grub (versions: 6.2.16-20-pve, 6.5.11-7-pve)
65FF-3295 is configured with: grub (versions: 6.2.16-20-pve, 6.5.11-7-pve)


root@viper:/var/tmp# zpool status rpool
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:05:06 with 0 errors on Sun Apr 14 00:29:07 2024
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x500a0751e21aba65-part3 ONLINE 0 0 0
wwn-0x500a0751e4c56f6c-part3 ONLINE 0 0 0

errors: No known data errors

# disks

# disk 1 - rpool mirror vdev child device
ata-CT500MX500SSD1_1934E21ABA65 wwn-0x500a0751e21aba65 # 83% ware
# disk 2 - rpool mirror vdev child device
ata-CT500MX500SSD1_2045E4C56F6C www-0x500a0751e4c56f6c # 34% ware
# disk 3 - spare
ata-CT500MX500SSD1_2045E4C5724A wwn-0x500a0751e4c5724a # 0% ware

Prepare $disk3

disk3=/dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A

lsblk --output NAME,MODEL,SERIAL,WWN -D $disk3
NAME MODEL SERIAL WWN
sdu CT500MX500SSD1 2045E4C5724A 0x500a0751e4c5724a
├─sdu1 0x500a0751e4c5724a
├─sdu2 0x500a0751e4c5724a
└─sdu3 0x500a0751e4c5724a

disk3_wwn=wwn-0x500a0751e4c5724a

sgdisk --print $disk3

Disk /dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A: 976773168 sectors, 465.8 GiB
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): DE43CD39-B54D-42C5-AA26-055B989D5520
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 976773134
Partitions will be aligned on 2048-sector boundaries
Total free space is 968386540 sectors (461.8 GiB)

Number Start (sector) End (sector) Size Code Name
1 2048 4095 1024.0 KiB EF02 BIOS boot partition
2 4096 4194304 2.0 GiB EF00 EFI system partition
3 4196352 8390655 2.0 GiB EF00 EFI system partition


# check ZFS labels on the added drive
# check for ZFS labels?
for part in $disk3 ${disk3}-part{1..3}; do echo $part; zdb -l $part; done
✅ clear, there was none

# preview wipefs for $disk3

wipefs --no-act --all $disk3-part3
wipefs --no-act --all $disk3-part2 # vfat
wipefs --no-act --all $disk3-part1
wipefs --no-act --all $disk3 # GPT

# $disk3 ZAP dry run safety check
printf $disk3 | grep 'C5724A$' && { echo match; echo zapping...; echo; echo sgdisk --zap-all $disk3; printf "\nzapped. exit code: %s\n" "$?"; }
/dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A
match
zapping...

sgdisk --zap-all /dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A

zapped. exit code: 0

# 💥 ZAP $disk3
printf $disk3 | grep 'C5724A$' && { echo match; echo zapping...; echo; sgdisk --zap-all $disk3; printf "\nzapped. exit code: %s\n" "$?"; }
/dev/disk/by-id/ata-CT500MX500SSD1_2045E4C5724A
match
zapping...

GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.

zapped. exit code: 0

# 👆 this zeroed at least the first 20480 bytes / 20 KiB of the drive and reloaded the partition table
# The GPT partition table is 16,384 bytes / 32 KiB, or 32 (512 byte) sectors, starting on sector 2
# AFAIK the MBR is stored on sector 1 (the first 512 bytes of the disk)
# The GPT header and partition table are written at both the beginning and end of the disk.
# The last 1MiB of the disk was also zeroed, unclear how much was performed by sgdisk.

# how I verified this

# get Endianness [
]
root@viper:/var/tmp# lscpu |grep Endian
Byte Order: Little Endian

endianness=little

# dump the first 1MiB of $disk3 to file
dd iflag=direct if=$disk3 bs=1M count=1 of=/var/tmp/CT500MX500SSD1_2045E4C5724A-1M-after.device

# hex dump first 1MiB
od -A x -t x2z --endian=$endianness CT500MX500SSD1_2045E4C5724A-1M-after.device |head
000000 0000 0000 0000 0000 0000 0000 0000 0000 >................<
*
005000 4241 4333 0000 0200 ffff ffff ffff ffff >AB3C............<

# 👆 this od format is very close to xxd format and has the advantge of being able to detect and skip duplicates.
# format: hex byte offset followed by 8 pairs of hex bytes, followed by the ASCII representation
# the * represents that between offset 000000 to 005000 the bytes were duplicates of the previous offset (zeroed)
# hex 005000 represents the byte offset, the offset of the first byte displayed on that line of the dump
# also the offset of the end of the previous 16 bytes
# hex 005000 = 20480 decimal, so we can see from offset 000000 to 20480 the bytes were zeroed

# dump the last 1MiB of $disk3 to file (976773168 - 2048 * 512 = 976771120) (2048 * 512 bytes = 1MiB)
dd iflag=direct if=$disk3 bs=512 skip=976771120 of=/var/tmp/CT500MX500SSD1_2045E4C5724A-1M-END-after.device

# hex dump - we sees the full 1MiB has been zeroed
root@viper:/var/tmp# od -A x -t x2z --endian=$endianness CT500MX500SSD1_2045E4C5724A-1M-END-after.device
000000 0000 0000 0000 0000 0000 0000 0000 0000 >................<
*
100000

# the disk is ready to be used :)

Make a zpool checkpoint

⚠ Note that a checkpoint is not a backup, but it does provide the ability to undo/rewind a pool to a previous state. It will not undo any changes made to the disks under ZFS, such as partition table changes or other changes made directly to the disks/partitions.
In case something goes wrong we have a point we can undo/rewind the rpool:
zpool checkpoint rpool
💡 read about zpool checkpoints in my cheatsheet [
]

Attach $disk3 to rpool mirror

# attach $disk3 to rpool, wait for resilver

disk1=/dev/disk/by-id/ata-CT500MX500SSD1_1934E21ABA65

lsblk --output NAME,MODEL,SERIAL,WWN -D $disk1
NAME MODEL SERIAL WWN
sdf CT500MX500SSD1 1934E21ABA65 0x500a0751e21aba65
├─sdf1 0x500a0751e21aba65
├─sdf2 0x500a0751e21aba65
└─sdf3 0x500a0751e21aba65

disk1_wwn=wwn-0x500a0751e21aba65

# we can copy the GPT from $disk1
# 💡 echo first to dry-run!
sgdisk $disk1 --replicate=$disk3

# randomise $disk3 GUID
# 💡 echo first to dry-run!
sgdisk -G $disk3

# dry run attach
echo zpool attach rpool ${disk1_wwn}-part3 ${disk3_wwn}-part3
zpool attach rpool wwn-0x500a0751e21aba65-part3 wwn-0x500a0751e4c5724a-part3

# 💥 attach $disk3-part3 to rpool mirror
zpool attach rpool $disk1-part3 $disk3-part3

# ... resilvering ~5 mins

zpool status rpool
pool: rpool
state: ONLINE
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.