Home lab & data vault

Explore

Snapraid parity migration observations - LUKS XFS to ZFS XFS raw image

Hypervisor and kvm info

Scenario

These tests are primarily measuring the transfer (seq write) and checksumming (seq read) of a 2.61TiB snapraid parity file between two disks. Its a single large file sorted on the XFS file system. The tests are running inside a kvm. The physical disks are 5TB 2.5” 5600 rpm SMR Seagate Barracuda’s.

Overall conclusion(s)

Ironically Test #1 was the best performing OpenZFS result and all attempts to improve the results were unsuccessful. 😢

There is a fairly remarkable and repeatable performance degradation pattern visible in the netdata graphs for the OpenZFS tests. IO size remains constant but IOPS and IO bandwidth drop off significantly when compared to test #3 without ZFS.

For write bw test #10 2x striped zpool was still slower than a single disk non-zfs test #3.

Intel 900P slog doesn’t help to stabilise or mitigate the issue.

Test #3 demonstrates the kvm virtio_blk can handle at least 121 and 125 MiB/s seq writes and reads on these disks i.e. kvm and virtio_blk overhead is unlikely to causing the performance degradation or bottlenecks.

Test #15 demonstrates that virtio_scsi is not faster than virtio_blk and likely has more overhead.

These sequential IO tests have demonstrated that for this hardware and workload OpenZFS has an overhead/cost of ~30% IO bw performance vs. the non-ZFS tests. This ~30% degradation applies to ZFS single disk, mirror or striped pool.

The known OpenZFS issue

#9130⁠

“task txg_sync blocked for more than 120 seconds” was reproduced in Test #4. Having met this issue in the past, I’m suspicious the root cause of #9130 may be related to/and or causing the IO degradation observed in these tests?

The striped pool test #10 demonstrated fairly consistent IO bandwidth on both disks during the read and write tests. The obvious degradation in the single vdev tests was not visible in the graphs - however overall the performance was still ~30% under the physical maximums of the disks. Question: why does the IO degradation pattern seem to disappear in this test but not others? Question: why is there a ~30% overhead for ZFS?

Dataset compression on or off doesn’t appear to have a negative impact on performance. In fact testing suggests for this sequential workload compression probably helps performance.

Dataset encryption on or off doesn’t appear to have a negative impact on performance.

Dataset checksum on or off doesn’t appear to have a significant impact on performance.

zvol performance shared a familiar symmetry to datasets tests AND incurs a ~4.6 multiplier / ~360% increase in system load avg. This is consistent with my previous testing of zvols and the known OpenZFS issue

#11407⁠

For this workload/test there doesn’t appear to of been any significant change/benefit in upgrading the hypervisor to the latest packages/kernels. pve 7.1-10 and kernel 5.13.19-4-pve and zfs-2.1.2-pve1 vs. pve 7.3-3 and kernel 5.15.74-1-pve and zfs-2.1.6-pve1

There doesn’t appear to of been any significant benefit to changing the zfs recordsize from the default 128K to 256K to match the snapraid default parity block size. For this workload it seems ZFS performs better when recordsize is left default.

Test #1 (parity file 1 of 3) and Test #17 (parity file 2 of 3) suggest issue is not specific to one file.

seq write overview

seq write overview

Name

test

avg MiB/s

test #1 dst transfer migration to zfs

seq write

102

test #2 dst transfer compression=off

seq write

test #3 dst transfer no zfs

seq write

121

test #6 dst transfer recordsize=256K

seq write

test #8 dst transfer slog and sync=standard

seq write

test #9 dst transfer slog and sync=always

seq write

test #10 dst transfer 2x stripe

seq write

112

test #13 dst transfer xfs external logdev

seq write

test #16 dst transfer checksum=off

seq write

test #16 dst transfer 2nd parity

seq write

79.8

No results from filter

⁠

seq read overview

seq read overview

Name

test

avg MiB/s

test #1 src checksum no zfs

seq read

125

test #1 dst checksum zfs dataset

seq read

test #3 dst checksum no zfs

seq read

125

test #6 dst checksum recordsize=256K

seq read

100

test #10 dst checksum 2x stripe

seq read

177

No results from filter

⁠

Test Results Table

Test Results Table

⁠

Scenario details

primarycache=all was set on all the OpenZFS tests.

Test #1 the src disk is P1 (/dev/sds) in Figure 1 and dst disk is P1 (/dev/sdr) in Figure 2. The way they are provisioned to the kvm is different (without and with OpenZFS).

rsync is being used for the seq write jobs and pv and xxh123sum for seq read jobs.

Here is a logical view of the node Figure 1:

⁠

In Figure 1 you can see there is a 6d+3p array. Each d zpool has a single disk - that is right OpenZFS is not being used for real-time parity. Snapraid is employed on the kvm to provide the near-time parity (p disks). The 6 d disks/zpools have been running happily for ~1y and were migrated from LUKS+XFS (they used to be like the p disks in Figure 1). Now it is time for the p parity disk migration to ZFS.

So the target logical view is as follows Figure 2:

⁠

Before: the p disks are provisioned via virtio-blk via full physical disk utilising LUKS+XFS in the kvm. E.g.:

After: the p disks are provisioned via virtio-blk via OpenZFS dataset raw XFS disk images. E.g.:

What does the difference look like in IO flow terms?

The left side represents the IO flow of the p disks in Figure 1 (and test #3) (sans LUKS/dm-crypt). The right side represents the IO flow of the d disks in Figure 1 & 2 (and test #1).

Figure 3

⁠

Why migrate? ~1y ago I’d completed a very similar migration for the 6 data disks. This follow up activity to migrate the parity disks was primarily to take advantage of coordinated snapshots across snapraid data+parity datasets, plus the all the other benefits of OpenZFS features, including moving the parity encryption workload out of the kvm.

Why XFS raw images and not zvols? In previously conducted performance testing on this node of raw images on datasets vs. zvols. The raw images on datasets approach was performant and utilised significantly less hypervisor resources i.e. load/cpu when comparing like-for-like io workloads. See

OpenZFS #11407 Extreme performance penalty, holdups and write amplification when writing to ZVOLs⁠

In Test #14 you can see the zvol costs ~4.6 times more load / ~360% increase in load, and obviously that means more heat generation, cooling demand, both factors increase power consumption too. Its worth to mention that if a zvol costs ~4.6 times more load, this will reduce how many kvm/containers and workload you can run on the node. Summary: zvol is not viable on this node setup.

A note on near-time vs. real-time parity

For this glacial data archiving use case I chose the snapraid near-time parity approach to keep things simple. Prior to implementation I did a fairly comprehensive pros and cons analysis of snapraid+zfs near-time parity vs. zfs real-time parity and so far I’m still happy with the decision, especially the ease and cost effective extension of the data storage (just keep adding d pools).

Data redundancy and backups

The 3p (triple parity) setup can sustain up to 3d zpool failures without data loss. For DR each d disk zpool has a 1:1 replicated zpool utilising syncoid dataset replication. So files and/or zpools can be rebuilt from from parity and/or the 1:1 DR backups. Parity can be rebuilt from data should a p disk fail. All of these approaches have been tested and work as designed.

The future

Maybe I’ll re-evaluate ZFS raid real-time parity and/or striped mirrors but for now I’m very happy with the way the setup works and how easy it is to extend, recover and replicate. The only real downside of the current setup is the direct io bandwidth is limited to single disk speeds.

When I have workloads that need more performance and/or sync io - those workloads typically don’t require the redundancy and other benefits of ZFS, so I can utilise a partition on the PCIe INTEL 280GB 900P NVMe present in the node.

The majority of the omv kvm workload is sequential async io. At the moment I do not notice the disk bottleneck. Once something is written to cache then io is very snappy. Most of the io workload and files the node handles fit within the cache. In Figure 4 you can see a windows smb client src NVMe read io is able to utilise the 10 GbE network, and the dst cache on the smb server (kvm) is able to handle this ~3GB write to cache very fast:

Figure 4

⁠

Test #1 - Initial approach

Initial dst transfer

Transferring the 1st of 3 parity files (disks) went OK. ~7h 26m for ~2.9 TB (2.61TiB) of data. 26811 seconds to be precise to transfer 2737112 MiB is ~102 MiB/s.

src read (hypervisor graph)

⁠

dst write (hypervisor graph)

⁠

The outcome of the rsync job:

⁠

rsync does an “during-the-transfer” checksum by default, and I wanted to be certain such a large file was not impacted by any random issues. So to be on the safe side, I started an “after-the-transfer” checksum of the source and dst parity files for peace of mind.

src checksum:

⁠

dst checksum:

⁠

Conditions

During all of these jobs/tests the hypervisor and guest kvm were not running any other specific workloads - only background load and IO.

Assumption: That hard drives data tracks are filled from outer to inner platter and that sectors are interleaved on multiple platters. Given that both drives and file systems in question started from a blank disk and are solely used for snapraid parity which is a single large file, and that physical/logical disk sectors are of increasing number lower being closer to the disk outer edge and higher being closer to the disk inner edge. One would assume that the very large majority of data is written sequentially starting at the lowest outmost sector and progressing towards the inner sectors. i.e. fragmentation should be very low on these disks. Unknown: The impact/relation of sectors to multiple platters. Are sectors interleaved or chunked per platter? I’m assuming interleaved for now.

Checking filefrag -v for the src and dst parity files, the logical and physical offsets are nearly all contiguous. src had 5 extents and dst had 4.

src checksum

Here is the 7 hour graph of the disk bandwidth of the physical device on the hypervisor during the checksum of the src parity file.

⁠

Observations

The job took 6 hours and four minutes. 21840 secs to be precise to read 2737112 MiB (2.61TiB) is an avg of 125 MiB/s.

The decrease in speed over the transfer time would seem to align with expected decrease in read speed as the disk heads move closer to the inner area of the disk platters.

Everything seems nominal as expected for bulk sequential read IO on this kind of disk.

dst checksum

Here is the 8 hour graph of the disk bandwidth of the physical device on the hypervisor during the checksum of the dst party file.

⁠

Observations

The job took 7 hours and 41 minutes. 27660 secs to be precise to read 2737112 MiB (2.61TiB) is an avg of 99 MiB/s. 🔴⚠️ This read job was 15 minutes slower then the dst transfer/write job. 😲

At ~6hrs ~79% completion the dst checksum job still had 0.53 TiB of 2.61TiB remaining. The src checksum job was finished. 🔴 The src checksum was 97 minutes faster than the dst zfs based checksum. 🔴 This means the dst zfs job was ~27% slower (increased duration). The same applies for the bw rate.

There does seem to be a lot of symmetry between this dst read graph (green) and the dst write graph (red) from the initial transfer. Here is a blended image of the write bw graph (vertically flipped) vs. the read bw graph:

⁠

This suggests the way zfs block handling logic / block content is having an impact on performance.

Q&A

Why does the ZFS read performance degrade for ~4 hours and then pick back up? Same question can be asked about write performance.

Is this a result of compression and/or encryption? Answer: test #2 result: unlikely a compression issue.

Is this a result of a slow spindle based disk and/or SMR disk? Answer: test #3 result: unlikely due to physical disk.

Why does adjusting primarycache=none|metadata cause massive negative performance issue? One would assume that bypassing the cache related code paths in zfs would make sustained sequential IO transfer around the same as physical disk IO performance?

Can adjusting the zfs dataset recordsize to match the snapraid parity block size improve things? Answer: test #4, #5 and #6 suggest no significant benefit?

Could this be related to physical disk sector layout / geometry somehow? Answer: test 3 result: physical disk performance in virtio pass-through seems fine.

Could results be impacted by how nocache or pv work? Answer: based on testing unlikely.

Could sparse raw files being causing an issue? Could test this by creating a new drive with proxmox and then replacing the sparse file with a non sparse file. This is likely to be a very slow to prepare because the time needed to created the non sparse disk.

Could the hypervisor or guest io scheduler cfg be causing some issues? My experience tells me the cfg is OK BUT in test #4 the kernel logged the ZFS txg_sync caused an io_schedule_timeout (

known OpenZFS issue⁠

), which at least suggests ZFS is experiencing io scheduling issues. What I don’t currently know is “is this related to Linux kernel code or zfs code issues?”

Why does the dst checksum (read) job take longer than the dst transfer (write)?

What is the performance ceiling of virtio-blk?

What is the performance ceiling of OpenZFS?

Test #2 with compression=off

I’d not used compression=zstd before, so I wanted to turn off compression to eliminate this factor.

I removed the parity file with rm on the kvm and then trimmed the xfs filesystem. The raw disk image on the hypervisor reported:

dst transfer

⁠
⁠
⁠

rsync result:

⁠

Observations

8h 32m compression=off vs. 7h 26m with compression=zstd. 🔴 ~15% increase in time taken. 🔴 ~89 MiB/s vs. ~102 MiB/s a ~12% decrease in disk bandwidth.

The slowdowns have a close symmetry to test #1 dst transfer.

dst checksum

After ~5 hours the job was only 75% completed and had close symmetry to test #1. I aborted the job and and moved on the to the 3rd test. Here is the disk bw graph for the job:

⁠

Test #3 VirtIO Block full dst disk - no LUKS etc

I wanted to eliminate the physical dst disk as a possibility factor impacting the IO degradation, so I ejected the dst disk from the kvm, destroyed the pool, wipedfs the disk and provisioned the same disk as a pass-through virtio device (the same approach as the src parity disk sans the LUKS overhead). So this is about as direct IO and raw as you can get in virtualisation.

dst transfer

⁠
⁠
⁠

rsync result:

⁠

dst checksum

⁠
⁠
⁠

dst checksum result:

⁠

Observations

The performance and graphs match the src disk. This would suggest there is nothing wrong with the physical dst disk or virtio paravirtualisation in kvm - at least for pass-through disks.

The dst transfer job took 6h 16m. 22632 seconds to be precise to transfer 2737112 MiB (2.61TiB) - an avg of ~121 MiB/s.

The dst checksum job took 6h 5m. 21905 seconds to be precise to transfer 2737112 MiB (2.61TiB) - an avg of ~125 MiB/s.

Test #4 virtio zfs raw image with 256K recordsize

256KiB is the default internal recordsize of snapraid parity, so it might help to suggest this recordsize to ZFS? What I don’t understand is ZFS is supposed to auto-adjust the recordsize so I don’t expect this will change much but its worth a test.

Citing the OpenZFS man⁠

recordsize: Specifies a suggested block size for files in the file system. This property is designed solely for use with database workloads that access files in fixed-size records. ZFS automatically tunes block sizes according to internal algorithms optimized for typical access patterns.

In this test I decided to set compression=off and encryption=off for this test to eliminate those code paths or related aspect causing some issue e.g. entropy starvation?

On entropy starvation, given that the hypervisor is running a Kernel >=5.6 starvation should be a thing of the past. The hypervisor is actually still running the haveged entropy daemon, so this aspect should really be a none issue.

The author of haveged wrote⁠

The haveged service is now obsolete (starting from kernel 5.6).

I have checked this on the hypervisor and subsequently purged haveged related packages from the system. See quick verification tests:

⁠

So the hypervisor is able to produce ~200 MiB of entropy per second, that should be more than enough for encryption=aes-256-gcm and these 2.5” 5400-RPM spindle drives. Nonetheless this test will run with it encryption=off to eliminate the encryption code path causing some kind of issue.

dst transfer

(yellow mark for the first kernel INFO msg):

⁠

I stopped the rsync at ~75% because of an

know issue with OpenZFS⁠

which shows up in the kernel logs with:

txg_sync:XXXXXX blocked for more than 120 seconds

⁠

A summary of the INFO messages without the call traces:

Observations

This test reproduced a known issue

#9130 task txg_sync blocked for more than 120 seconds⁠

The first 1.5h of the dst write seemed OK but performance started to drop off around 06:00 and then nose dived shortly before the first kernel INFO message. This behaviour is consistent with previous encounters with this issue.

At this point it doesn’t make sense to keep testing. I will look at the available package updates and get everything updated and then look at a re-test. At the time of writing the hypervisor was on pve 7.1 and

7.3 was just released⁠

Upgrading the hypervisor

Test #5 - re-run test #4 post upgrade

dst transfer

Unfortunately I aborted the rsync job as the results were the worst yet ☹

Noteworthy netdata graphs:

⁠

Test #6 256K, encryption=on compression=zstd

This test is the same as the test #1 but with updated zfs and kernel versions, and the zfs recordset set to snapraid’s default 256KiB parity block size. These are my production target settings.

dst transfer
⁠
⁠

9h netdata graphs from the hypervisor for the dst transfer:

⁠

Observations

The dst transfer job took 8 hours and 37 minutes. 31020 secs to be precise to transfer 2737112 MiB (2.61TiB) is an avg of ~88 MiB/s.

This test vs. test #3 virtio pass-through for dst transfer: 🔴 ~140 minutes slower. 🔴 ~37% increase in write/transfer time. 🔴 ~27% decrease in disk bandwidth.

This test vs. test #1 zfs version 2.1.2 with recordsize=default for dst transfer: 🔴 ~70 minutes slower. 🔴 ~15.6% increase in write/transfer time. 🔴 ~13% decrease in disk bandwidth.

dst checksum

⁠

8h noteworthy netdata graphs

⁠

Observations

The dst checksum job took 7h 35m. 27305 secs to be precise to read 2737112 MiB (2.61TiB) is an avg of ~100 MiB/s. compared to test #3 virtio pass-through for dst checksum: 🔴 ~90 minutes slower. 🔴 ~25% increase in read time. 🔴 ~25% decrease in disk bandwidth.

Compared to test #1 dst checksum job - the results were nearly the same. This job was ~5 min faster.

The read block size was a constant 1MiB.

The “Completed disk I/O operations” tracked to the degradations on the “Disk I/O Bandwidth” graph.

i.e. the blocks being read were a consistent size but IOPS were being limited?

IO throughput performance starts OK but degrades over time, jumps back up and then degrades again.

None of the other netdata disk graphs were remarkable.

Test #7 sync=standard

To rule out something weird with sync=disabled I re-ran test #6 with sync=standard.

dst transfer

⁠

Observations

The transfer seemed start OK but as the graphs show, the performance dropped off a cliff and didn’t recover - test aborted.

The disk bandwidth graphs have a different degradation pattern to the sync=disabled tests.

When the performance dropped off the cliff the avg io size also dropped significantly - curious. I would of expected the avg IO size to remain the same but at a slower IOPS?

Test #8 sync=standard + slog

To see if an slog vdev could help stabilise the IO workload I added an INTEL 280GB 900P slog device to the pool:

⁠

dst transfer

⁠

Noteworthy netdata graphs for the dst disk:

⁠

Noteworthy netdata graphs for the slog vdev:

⁠

Observations

The dst transfer rsync job took 7h 48m. 28135 secs to be precise to write 2737112 MiB (2.61TiB) is an avg of ~97 MiB/s.

The job took slightly longer than test #6.

🔴 No significant IO stability gains detected, the disk bandwidth graph trend follows the other zfs tests.

Test #9 sync=always + slog

The same as test #8 but with sync=always to see if it helps with IO stability.

dst transfer

⁠

Observations

The dst transfer rsync job took 8h 30m. 30685 secs to be precise to write 2737112 MiB (2.61TiB) is an avg of ~89 MiB/s.

The job took slightly longer than test #6.

🟢 A note on the slog performance: Its impressive that the slog device enables the slow disk pool to give an async performance with a sync workload. There is a significant double digit workload speed up multiplier here.

🔴 Still a major performance degradation vs. test #1 and test #3.

Test #10 striped pool

⁠

dst transfer

⁠

dst disk sdw

⁠

dst disk sdr

⁠

Observations

The dst transfer job took 6h 49m. 24546 seconds to be precise to transfer 2737112 MiB (2.61TiB) - an avg of ~112 MiB/s.

Was the src disk bw a bottleneck in the write bw of this test? Unlikely. The src checksum job in test #3 had an avg of 125 MiB/s which is higher than the avg write performance of this test.

Compared to test #3 dst transfer this test was 🔴 ~8% increased runtime. 🔴 ~7% decrease in IO bandwidth. This means the zfs 2x stripe test was slower than a single non zfs test.

The avg I/O size was much smaller than in the tests with a single disk.

The IO striped over the 2 disks appears to mitigate the “visible” performance degradation in the disk graphs.

Based on test #3 the max theoretical write performance of 2x striped disks would be ~242 MiB/s (121 MiB/s x2)

How much was the src disk a limiting factor ? 125 MiB/s src checksum

dst checksum

An approach using dd with bs=128K which is the same as the default recordsize.

⁠

Previous approach using pv

⁠

Noteworthy netdata disk graphs for both dst checksum jobs

/dev/sdw

⁠

/dev/sdr

⁠

Observations

The dd vs. pv approach took 16073 seconds and 15411 respectively. pv was 11 minutes faster ~4%. ARC cache might play a slight role here.

The avg IO size in the pv job was larger. It looks like zfs was able to better optimise the pv IO and/or pv used a larger block size than 128K used in the dd job.

Test #1 dst checksum avg bw was ~99 MiB/s compared to this tests pv job with avg 177 MiB/s: 🟢 ~42% decrease in job runtime. 🟢 ~79% increase in read bw.

Test #3 dst checksum avg bw was ~125 MiB/s compared to this tests pv job with avg 177 MiB/s: 🟢 ~30% decrease in job runtime. 🟢 ~42% increase in read bw.

Based on test #3 dst checksum the max theoretical read performance of 2x striped disks would be ~250 MiB/s (125 MiB/s x2) The avg performance of the pv job on the zfs stripe was 177 MiB/s. 🔴 ~29% decrease from theoretical max bandwidth. Test #1 dst checksum was ~27% slower vs. the src checksum, this ~30% decrease in performance seems to be a common “zfs factor” in the test results.

Test #11 mirror pool

To test how a mirror performed vs. single disk vs. stripe I setup the following pool:

⁠

Note the resilvering was because I practiced the zpool attach method of changing a single disk pool into a mirror.

dst transfer

Unfortunately the test followed the same pattern as the single disk pool tests and IO bandwidth degraded gradually over time. I aborted the test at ~50% completion once it was clear the pattern was repeating.

rsync job outcome aborted at ~50%:

⁠

Noteworthy netdata disk graphs

/dev/sdw

⁠

/dev/sdr

⁠

dst checksum

Out of interest I ran the dst checksum job on the partially transferred file to check the read performance of the mirror vs. the stripe. Job runtime can be ignored but the IO bw is a telltale for the performance of the mirror.

⁠

/dev/sdr

⁠

/dev/sdw

⁠

Observations

The write and read jobs suffered the same performance issues as with other single disk tests.

IO size was consistent but IOPS degraded as observed in other single vdev tests.

As with other tests the read IO graph shape is a approximate vertical mirror of the write IO graph shape.

The dst checksum job had an avg IO bw of 155 MiB/s. This was: 🔴 ~12% slower than than stripe test #10. 🟢 ~57% faster than the dst checksum job in initial test #1.

Based on test #3 dst checksum the max theoretical read performance of 2x striped disks would be ~250 MiB/s (125 MiB/s x2) 🔴 ~38% io bw decrease from theoretical max bandwidth.

Test #12 vdisk with logical/physical 512/4096 block size?

The physical drives in these tests have a 512/4096 logical/physical sector/block size. The default for virtual disks is 512/512.

I’ve checked these kvm settings in the past with to see if aligning the settings could optimise IO. My short research didn’t reveal any major benefits of aligning the settings. However I didn’t perform such large/long running IO workloads like in this scenario.

Why is this a factor? One can assume that under certain circumstances this sector/block mismatch could create more IO that would be optimally required to complete a given workload. i.e. unnecessary IO amplification.

This might not help, the XFS filesystem in the existing LUKS container is also using 512/512 via the virtio full physical disk.

I skipped this test for now.

Test #13 mkfs.xfs options for io alignment and read/modify/write I/O behaviour

See

OpenZFS #8590 How to stop RMW reads at write time (long)⁠

⁠

Q: Should I test 512 byte i_size for XFS given this is the default virtual disk sector size?

logdev could come from a small allocation from SSD? raw or zol?

dst transfer

Based on knowledge from OpenZFS issue #8590

⁠

just reads:

⁠

Observations

The dst transfer job took 9 hours and 7 minutes. 32928 secs to be precise to transfer 2737112 MiB (2.61TiB) is an avg of ~83 MiB/s.

Fairly constant small KiB/s reads happening throughout the test.

What is this jitter in the last 2.5 hours of the test?

Test #14 zfs zvol

I started a zvol test to compare to the other tests and I also wanted to check if the new zvol_use_blk_mq module parameter and code could improve zvol performance since my previous zvol benchmarks were not good (

related to OpenZFS Issue #11407⁠

). Unfortunately it looks like zfs-2.1.6-pve1 does not yet have the code committed to in OpenZFS:master in June 2022 in Issue

#12483⁠

. I was not able to enable it in /etc/modprobe.d/zfs.conf nor /sys/module/zfs/parameters/zvol_use_blk_mq.

dst transfer

There was no positive change in the performance degradation pattern in other tests. As expected based on previous tests the hypervisor load avg was much higher during the zvol test vs. zfs dataset, so I aborted the testing.

Here are the cpu/load graphs from a previous non-zvol test:

⁠

Noteworthy netdata graphs for this zvol test:

⁠

Observations

The system load is much higher in the zvol test (this matches my past zvol testing experience). 🔴 20-25 load avg vs. 4-7 load avg on the dataset tests. i.e. zvol costs ~4.6 times more load / ~360% increase in load vs. datasets.

The performance degradation pattern has similar symmetry to previous tests.

The avg IO size was consistent but IOPS degraded over time.

Test #15 Virtio Block vs. VirtIO SCSI?

For this test I created the disk in pve gui with SCSI Bus rather than VirtIO Block, at least from my basic understand this then uses the emulated SCSI Controller which was set to VirtIO SCSI.

dst transfer

⁠

Observations

The dst transfer job took 8 hours and 30 minutes. 30696 secs to be precise to transfer 2737112 MiB (2.61TiB) is an avg of ~89 MiB/s.

What is this jitter in the last 2.5 hours of the test?

Very similar test runtime as to test #6 🔴 ~37% increase in write/transfer time vs test #3. 🔴 ~15.6% increase in write/transfer time vs test #1.

I also tried a test with different mkfs.xfs settings:

However the performance degradation was evident within 15m of runtime so I aborted it.

Conclusion

VirtIO Block seems to perform better than SCSI for this sequential write workload and disk models.

Test #16 checksum=off

I repeated test #6 with checksum=off to see if this had any impact to the results.

dst transfer

The rsync job took 8.5 hours:

⁠

Noteworthy netdata disk graphs:

⁠

Observations

The dst transfer job took 8 hours and 30 minutes. 30718 secs to be precise to transfer 2737112 MiB (2.61TiB) is an avg of ~89 MiB/s.

The results followed the performance degradation pattern of other tests, checksum=off had no immediately discernable impact.

🔴 The runtime of this tests dst transfer was 1h longer than test #1 (~13% longer)

It doesn’t make sense to run the dst checksum job.

Test #17 2nd parity file

I repeated test #1 but with the 2nd parity (of 3) to see if the results were similar.

dst transfer

The rsync job took 9.5 hours:

⁠

Observations

A similar performance degradation, as well as bandwidth and iops symmetry to the other tests is present in the netdata graphs.

What is the jitter in the 2nd half of the transfer?

Why does the io avg size drop in the 2nd half of the test?

Future test ideas

Test #X kvm. vs. container

It might be interesting to see how containers perform vs. kvms.

Reviewing this ceph kvm tuning guide for performance tips

⁠

QEMU/KVM + Ceph Librbd Performance - Ceph Ceph is an open source distributed storage system designed to evolve with data. ceph.io⁠

⁠

Alasdair had the idea about pci passthrough of a disk to the kvm and running ZFS on the kvm kernel

This would bypass the hypervisor zfs and allow to test different kernels/distro ubuntu, freebsd, illumos?

🔴 Looks like only whole PCI devices can be passed through - so in the case of the HBA’s proxmox shows the following in the PCI device list for the kvm:

⁠