Home lab & data vault
Share
Explore

Snapraid parity migration observations - LUKS XFS to ZFS XFS raw image

Hypervisor and kvm info

Scenario

These tests are primarily measuring the transfer (seq write) and checksumming (seq read) of a 2.61TiB snapraid parity file between two disks. Its a single large file sorted on the XFS file system. The tests are running inside a kvm. The physical disks are 5TB 2.5” 5600 rpm SMR Seagate Barracuda’s.

Overall conclusion(s)

Ironically Test #1 was the best performing OpenZFS result and all attempts to improve the results were unsuccessful. 😢
There is a fairly remarkable and repeatable performance degradation pattern visible in the netdata graphs for the OpenZFS tests. IO size remains constant but IOPS and IO bandwidth drop off significantly when compared to test #3 without ZFS.
For write bw test #10 2x striped zpool was still slower than a single disk non-zfs test #3.
Intel 900P slog doesn’t help to stabilise or mitigate the issue.
Test #3 demonstrates the kvm virtio_blk can handle at least 121 and 125 MiB/s seq writes and reads on these disks i.e. kvm and virtio_blk overhead is unlikely to causing the performance degradation or bottlenecks.
Test #15 demonstrates that virtio_scsi is not faster than virtio_blk and likely has more overhead.
These sequential IO tests have demonstrated that for this hardware and workload OpenZFS has an overhead/cost of ~30% IO bw performance vs. the non-ZFS tests. This ~30% degradation applies to ZFS single disk, mirror or striped pool.
The known OpenZFS issue “task txg_sync blocked for more than 120 seconds” was reproduced in Test #4. Having met this issue in the past, I’m suspicious the root cause of #9130 may be related to/and or causing the IO degradation observed in these tests?
The striped pool test #10 demonstrated fairly consistent IO bandwidth on both disks during the read and write tests. The obvious degradation in the single vdev tests was not visible in the graphs - however overall the performance was still ~30% under the physical maximums of the disks. ​Question: why does the IO degradation pattern seem to disappear in this test but not others? ​Question: why is there a ~30% overhead for ZFS?
Dataset compression on or off doesn’t appear to have a negative impact on performance. In fact testing suggests for this sequential workload compression probably helps performance.
Dataset encryption on or off doesn’t appear to have a negative impact on performance.
Dataset checksum on or off doesn’t appear to have a significant impact on performance.
zvol performance shared a familiar symmetry to datasets tests AND incurs a ~4.6 multiplier / ~360% increase in system load avg. This is consistent with my previous testing of zvols and the known OpenZFS issue .
For this workload/test there doesn’t appear to of been any significant change/benefit in upgrading the hypervisor to the latest packages/kernels. pve 7.1-10 and kernel 5.13.19-4-pve and zfs-2.1.2-pve1 vs. pve 7.3-3 and kernel 5.15.74-1-pve and zfs-2.1.6-pve1
There doesn’t appear to of been any significant benefit to changing the zfs recordsize from the default 128K to 256K to match the snapraid default parity block size. For this workload it seems ZFS performs better when recordsize is left default.
Test #1 (parity file 1 of 3) and Test #17 (parity file 2 of 3) suggest issue is not specific to one file.
seq write overview
2
1
2
3
4
5
6
7
8
9
10
Name
test
avg MiB/s
test #1 dst transfer migration to zfs
seq write
102
test #2 dst transfer compression=off
seq write
89
test #3 dst transfer no zfs
seq write
121
test #6 dst transfer recordsize=256K
seq write
88
test #8 dst transfer slog and sync=standard
seq write
97
test #9 dst transfer slog and sync=always
seq write
89
test #10 dst transfer 2x stripe
seq write
112
test #13 dst transfer xfs external logdev
seq write
83
test #16 dst transfer checksum=off
seq write
89
test #16 dst transfer 2nd parity
seq write
79.8
No results from filter
seq read overview
2
1
2
3
4
5
Name
test
avg MiB/s
test #1 src checksum no zfs
seq read
125
test #1 dst checksum zfs dataset
seq read
99
test #3 dst checksum no zfs
seq read
125
test #6 dst checksum recordsize=256K
seq read
100
test #10 dst checksum 2x stripe
seq read
177
No results from filter


Test Results Table
2

Scenario details

primarycache=all was set on all the OpenZFS tests.
Test #1 the src disk is P1 (/dev/sds) in Figure 1 and dst disk is P1 (/dev/sdr) in Figure 2. The way they are provisioned to the kvm is different (without and with OpenZFS).
rsync is being used for the seq write jobs and pv and xxh123sum for seq read jobs.
Here is a logical view of the node Figure 1:
image.png
In Figure 1 you can see there is a 6d+3p array. Each d zpool has a single disk - that is right OpenZFS is not being used for real-time parity. Snapraid is employed on the kvm to provide the near-time parity (p disks). The 6 d disks/zpools have been running happily for ~1y and were migrated from LUKS+XFS (they used to be like the p disks in Figure 1). Now it is time for the p parity disk migration to ZFS.
So the target logical view is as follows Figure 2:
image.png
Before: the p disks are provisioned via virtio-blk via full physical disk utilising LUKS+XFS in the kvm. E.g.:
After: the p disks are provisioned via virtio-blk via OpenZFS dataset raw XFS disk images. E.g.:
What does the difference look like in IO flow terms?
The left side represents the IO flow of the p disks in Figure 1 (and test #3) (sans LUKS/dm-crypt). The right side represents the IO flow of the d disks in Figure 1 & 2 (and test #1).
Figure 3
image.png
Why migrate? ~1y ago I’d completed a very similar migration for the 6 data disks. This follow up activity to migrate the parity disks was primarily to take advantage of coordinated snapshots across snapraid data+parity datasets, plus the all the other benefits of OpenZFS features, including moving the parity encryption workload out of the kvm.
Why XFS raw images and not zvols? In previously conducted performance testing on this node of raw images on datasets vs. zvols. The raw images on datasets approach was performant and utilised significantly less hypervisor resources i.e. load/cpu when comparing like-for-like io workloads. See In Test #14 you can see the zvol costs ~4.6 times more load / ~360% increase in load, and obviously that means more heat generation, cooling demand, both factors increase power consumption too. Its worth to mention that if a zvol costs ~4.6 times more load, this will reduce how many kvm/containers and workload you can run on the node. ​Summary: zvol is not viable on this node setup.

A note on near-time vs. real-time parity

For this glacial data archiving use case I chose the snapraid near-time parity approach to keep things simple. Prior to implementation I did a fairly comprehensive pros and cons analysis of snapraid+zfs near-time parity vs. zfs real-time parity and so far I’m still happy with the decision, especially the ease and cost effective extension of the data storage (just keep adding d pools).

Data redundancy and backups

The 3p (triple parity) setup can sustain up to 3d zpool failures without data loss. For DR each d disk zpool has a 1:1 replicated zpool utilising syncoid dataset replication. So files and/or zpools can be rebuilt from from parity and/or the 1:1 DR backups. Parity can be rebuilt from data should a p disk fail. All of these approaches have been tested and work as designed.

The future

Maybe I’ll re-evaluate ZFS raid real-time parity and/or striped mirrors but for now I’m very happy with the way the setup works and how easy it is to extend, recover and replicate. The only real downside of the current setup is the direct io bandwidth is limited to single disk speeds.
When I have workloads that need more performance and/or sync io - those workloads typically don’t require the redundancy and other benefits of ZFS, so I can utilise a partition on the PCIe INTEL 280GB 900P NVMe present in the node.
The majority of the omv kvm workload is sequential async io. At the moment I do not notice the disk bottleneck. Once something is written to cache then io is very snappy. Most of the io workload and files the node handles fit within the cache. In Figure 4 you can see a windows smb client src NVMe read io is able to utilise the 10 GbE network, and the dst cache on the smb server (kvm) is able to handle this ~3GB write to cache very fast:
Figure 4
image.png

Test #1 - Initial approach

Initial dst transfer

Transferring the 1st of 3 parity files (disks) went OK. ~7h 26m for ~2.9 TB (2.61TiB) of data. 26811 seconds to be precise to transfer 2737112 MiB is ~102 MiB/s.
src read (hypervisor graph)
image.png
dst write (hypervisor graph)
image.png
The outcome of the rsync job:
image.png
rsync does an “during-the-transfer” checksum by default, and I wanted to be certain such a large file was not impacted by any random issues. So to be on the safe side, I started an “after-the-transfer” checksum of the source and dst parity files for peace of mind.
src checksum:
image.png
dst checksum:
image.png

Conditions

During all of these jobs/tests the hypervisor and guest kvm were not running any other specific workloads - only background load and IO.
Assumption: That hard drives data tracks are filled from outer to inner platter and that sectors are interleaved on multiple platters. Given that both drives and file systems in question started from a blank disk and are solely used for snapraid parity which is a single large file, and that physical/logical disk sectors are of increasing number lower being closer to the disk outer edge and higher being closer to the disk inner edge. One would assume that the very large majority of data is written sequentially starting at the lowest outmost sector and progressing towards the inner sectors. i.e. fragmentation should be very low on these disks. ​Unknown: The impact/relation of sectors to multiple platters. Are sectors interleaved or chunked per platter? I’m assuming interleaved for now.
Checking filefrag -v for the src and dst parity files, the logical and physical offsets are nearly all contiguous. src had 5 extents and dst had 4.

src checksum

Here is the 7 hour graph of the disk bandwidth of the physical device on the hypervisor during the checksum of the src parity file.
image.png

Observations

The job took 6 hours and four minutes. 21840 secs to be precise to read 2737112 MiB (2.61TiB) is an avg of 125 MiB/s.
The decrease in speed over the transfer time would seem to align with expected decrease in read speed as the disk heads move closer to the inner area of the disk platters.
Everything seems nominal as expected for bulk sequential read IO on this kind of disk.

dst checksum

Here is the 8 hour graph of the disk bandwidth of the physical device on the hypervisor during the checksum of the dst party file.
image.png

Observations

The job took 7 hours and 41 minutes. 27660 secs to be precise to read 2737112 MiB (2.61TiB) is an avg of 99 MiB/s. 🔴⚠️ This read job was 15 minutes slower then the dst transfer/write job. 😲
At ~6hrs ~79% completion the dst checksum job still had 0.53 TiB of 2.61TiB remaining. The src checksum job was finished. 🔴 The src checksum was 97 minutes faster than the dst zfs based checksum. 🔴 This means the dst zfs job was ~27% slower (increased duration). The same applies for the bw rate.
There does seem to be a lot of symmetry between this dst read graph (green) and the dst write graph (red) from the initial transfer. Here is a blended image of the write bw graph (vertically flipped) vs. the read bw graph: ​
image.png
This suggests the way zfs block handling logic / block content is having an impact on performance.

Q&A

Why does the ZFS read performance degrade for ~4 hours and then pick back up? Same question can be asked about write performance.
Is this a result of compression and/or encryption? ​Answer: test #2 result: unlikely a compression issue.
Is this a result of a slow spindle based disk and/or SMR disk? ​Answer: test #3 result: unlikely due to physical disk.
Why does adjusting primarycache=none|metadata cause massive negative performance issue? One would assume that bypassing the cache related code paths in zfs would make sustained sequential IO transfer around the same as physical disk IO performance?
Can adjusting the zfs dataset recordsize to match the snapraid parity block size improve things? ​Answer: test #4, #5 and #6 suggest no significant benefit?
Could this be related to physical disk sector layout / geometry somehow? ​Answer: test 3 result: physical disk performance in virtio pass-through seems fine.
Could results be impacted by how nocache or pv work? ​Answer: based on testing unlikely.
Could sparse raw files being causing an issue? Could test this by creating a new drive with proxmox and then replacing the sparse file with a non sparse file. This is likely to be a very slow to prepare because the time needed to created the non sparse disk.
Could the hypervisor or guest io scheduler cfg be causing some issues? My experience tells me the cfg is OK BUT in test #4 the kernel logged the ZFS txg_sync caused an io_schedule_timeout (), which at least suggests ZFS is experiencing io scheduling issues. What I don’t currently know is “is this related to Linux kernel code or zfs code issues?”
Why does the dst checksum (read) job take longer than the dst transfer (write)?
What is the performance ceiling of virtio-blk?
What is the performance ceiling of OpenZFS?

Test #2 with compression=off

I’d not used compression=zstd before, so I wanted to turn off compression to eliminate this factor.
I removed the parity file with rm on the kvm and then trimmed the xfs filesystem. The raw disk image on the hypervisor reported:

dst transfer

image.png

rsync result:
image.png

Observations

8h 32m compression=off vs. 7h 26m with compression=zstd. 🔴 ~15% increase in time taken. 🔴 ~89 MiB/s vs. ~102 MiB/s a ~12% decrease in disk bandwidth.
The slowdowns have a close symmetry to test #1 dst transfer.

dst checksum

After ~5 hours the job was only 75% completed and had close symmetry to test #1. I aborted the job and and moved on the to the 3rd test. Here is the disk bw graph for the job:
image.png

Test #3 VirtIO Block full dst disk - no LUKS etc

I wanted to eliminate the physical dst disk as a possibility factor impacting the IO degradation, so I ejected the dst disk from the kvm, destroyed the pool, wipedfs the disk and provisioned the same disk as a pass-through virtio device (the same approach as the src parity disk sans the LUKS overhead). So this is about as direct IO and raw as you can get in virtualisation.

dst transfer

image.png

rsync result:
image.png

dst checksum

image.png

dst checksum result:
image.png

Observations

The performance and graphs match the src disk. This would suggest there is nothing wrong with the physical dst disk or virtio paravirtualisation in kvm - at least for pass-through disks.
The dst transfer job took 6h 16m. 22632 seconds to be precise to transfer 2737112 MiB (2.61TiB) - an avg of ~121 MiB/s.
The dst checksum job took 6h 5m. 21905 seconds to be precise to transfer 2737112 MiB (2.61TiB) - an avg of ~125 MiB/s.

Test #4 virtio zfs raw image with 256K recordsize

256KiB is the default internal recordsize of snapraid parity, so it might help to suggest this recordsize to ZFS? What I don’t understand is ZFS is supposed to auto-adjust the recordsize so I don’t expect this will change much but its worth a test. :
recordsize: Specifies a suggested block size for files in the file system. This property is designed solely for use with database workloads that access files in fixed-size records. ZFS automatically tunes block sizes according to internal algorithms optimized for typical access patterns.
In this test I decided to set compression=off and encryption=off for this test to eliminate those code paths or related aspect causing some issue e.g. entropy starvation?
On entropy starvation, given that the hypervisor is running a Kernel >=5.6 starvation should be a thing of the past. The hypervisor is actually still running the haveged entropy daemon, so this aspect should really be a none issue. :
The haveged service is now obsolete (starting from kernel 5.6).
I have checked this on the hypervisor and subsequently purged haveged related packages from the system. See quick verification tests:
image.png
So the hypervisor is able to produce ~200 MiB of entropy per second, that should be more than enough for encryption=aes-256-gcm and these 2.5” 5400-RPM spindle drives. Nonetheless this test will run with it encryption=off to eliminate the encryption code path causing some kind of issue.

dst transfer

(yellow mark for the first kernel INFO msg):
image.png
I stopped the rsync at ~75% because of an which shows up in the kernel logs with:
txg_sync:XXXXXX blocked for more than 120 seconds
image.png
A summary of the INFO messages without the call traces:

Observations

This test reproduced a known issue .
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.