OpenZFS zvol performance issues
Share
Explore

zvol performance issues

Author: ..
Page born on: 2021-03-08 Updated: 2022-12-22 Published: 2022-12-31
TL;DR
This topic started out as a fairly simple question but it soon snowballed into deeper research and in my opinion deserved a detailed breakdown and write up.
Its difficult to summarise 10,000 - its a fairly complex topic. If you want to do minimum reading and maximum understanding, then check out
.
Some high level stats
During this research I generated a batch of tests which had a cumulative runtime of ~8 days (TODO: scripts here).
The batch tests performed ~10 TB of write IO and ~7 Million IOPS, and 9.5 TB of read IO and ~3.7 Million IOPS.
There were supposed to be 2448 tests but the OpenZFS zvol code (zfs-2.0.3-pve2) was too unstable for certain tests so 201 tests were skipped to avoid consistently crashing the system/kernel.
Preface
In April 2021 I identified an OpenZFS zvol performance issue on my home lab proxmox hypervisor, which is Debian buster based. I was provisioning a new zpool for usage within my private data vault and stumbled across some unexpected problems with zvols.
Why I am making this post? I’m looking for community feedback and collaboration in the hope of finding an explanation and perhaps help to uncover and fix an issue with my setup and/or zvol’s. In my opinion I’ve collected too much info to organise neatly on a GitHub issue or forum post, so this post can be used as a reference to cherry-pick relevant details from.
“I’m 100% ready to accept this is a problem with my setup or config.”
I hope by sharing some of my empirical testing results, I might inspire someone to at least check if they experience similar issues or if its just me.
Goals
Figure out why zvols are performing so badly on my setup, and look for fixes, workarounds or alternatives.
Figure out if there is something wrong with my setup.
Run some tests (including base-lining), and use the results as a guide for optimal configuration.
Learn more about OpenZFS and its benefits, pitfalls, costs, and overheads.
Share my findings to benefit others, and give back to open source contributors and users.
Make my testing approach understandable.
WIP: Share scripts and make testing reproducible.
An important note. This is a SINGLE hardware sample. I’ve done a wide range of test variations BUT only a single node, single disk etc, so this doesn’t rule out the possibility that there is an anomaly with my setup. I didn’t get any anomalous feelings from the results (aside from zvol buffered write tests crashing my kernel) but keep in mind that if others cherry-pick some of my tests to repeat on their hardware and setups, it would validate or invalidate my findings.
I asked myself - does it even matter?
When investing the hours I asked myself “Does it even matter?”, “Am I wasting my time?”, “Will anyone care?”.
It does feel like I’ve stumbled across a problem, at least with my setup. I had exceptions on how zvol block devices would perform and the results went against those expectations.
If I find a solution or answers and share that to benefit others... that is great. If it triggers discussion with the OpenZFS project team or proxmox team, maybe good things can come from that.
I must admit, I was sceptical about wasting time when the large batch of tests was running but the end results have been enlightening for me on how current ZFS performs (and crashes) with various workloads and io depths on my setup. One of my original goals was to understand how different workloads performed on ZFS and thanks to the spreadsheet results I can use them as a guide to get the most out of my seutp and disks. This is not only useful for my home lab and private data vault but also my storage topics in production where I deal with petabytes of data, so overall its been a worthwhile investment, and I hope the results are enlightening for others too.
Story time
In the last decade my storage strategy has been “keep it simple”. In the past I’ve had my fingers burnt and watched colleagues experience similar issues with various RAID levels and storage technologies, not to mention my own mistakes, and the many horror posts you can read online. So I'm an advocate of keeping it simple.
I’ve developed a healthy respect of how much one can trust RAID, storage systems, and filesystems. This leads me to be patient and observant of developments for example of btrfs and OpenZFS. I test and contribute where I can, and utilise such projects because they have great features but I keep verified backups for anything critical.
In a 25 year career one makes enough mistakes and becomes aware of topics like bit-rot and write holes, and in summary one comes to realise with experience:
This is basically what I’m trying to achieve with my updated private data vault strategy. Keep it simple but also replicated and verified off-site. My current strategy is OK but it is cost prohibitive to keep an up-to-date replica off-site. So in a nutshell:
Keep it simple AND reduce the costs of maintaining a near-time off-site replica of TiB’s of data.
The year is 2017
Until now I’d had good levels success with OpenZFS on Linux, and I’m a proxmox user and advocate. I have been a part of provisioning dozens of production hypervisors for mission critical systems where the root file systems were running OpenZFS (and ceph for block store). After this good experience I took a decision in 2017 to build my current gen private data vault based on proxmox with OpenZFS root file system and VM root file systems.
In 2017 I didn’t trust OpenZFS enough to use it for my historical private data vault storage, so the data vault itself it is straightforward.
Single disks with LUKS+XFS and mergerfs to create a single folder hierarchy from multiple disks hierarchies. Thanks to SnapRAID I have near-time triple disk parity, so I can handle up to 3 concurrent disk failures without data loss. SnapRAID also provides a nearl-time data scrubbing capability to detect bit rot and co. The large majority of my vault data change rate is glacial, so real-time parity and checksums are nice to have but not a must-have. SnapRAID suits my needs and since 2017 this setup has been reliable and easy to maintain.
It is also reassuring that any of the storage disks can be mounted on any system supporting LUKS+XFS. No array complexity or vendor or technology lock-ins.
The year is 2021 🤘
With the new zpool I provisioned an existing kvm with an additional zvol disk and went about testing.
If testing went OK I planned to migrate away from single disk LUKS+XFS+SnapRAID ➡ to single disk ZFS zpools+XFS+SnapRAID to keep it simple but also take advantage of ZFS’s many features. e.g. CoW, checksums, encryption, compression, dataset snapshots, pool checkpoints, easy and cost effective off-site replication and differential backups.
You can find full details my existing node setup and config
, including a history and version infos.
important note: All my testing was on a zpool and vdev on spinning rust as mentioned a 2.5” 5400rpm drive not SSD’s.
What suggested there was a problem?
So when I started a copy job from an existing vm disk to the new vm zvol, not all was fine and dandy... time to dig into the details and figure out what is going on.
I observed that the new zvol was hitting performance issues during sequential write workloads. I was watching an rsync --progress from one physical disk (source was a simple virtio device pass-through on a kvm) to the new zvol and I could see the sustained write transfer for large binary files was choking somehow and the hypervisor load1 avg was skyrocketing. I happened to observe this during the copy of Samsung TV recording backups, so encoded video/binary data.
At this point my observability level was poor, so I installed on both the hypervisor and kvm guest which also checked off a long term TODO ✅. netdata is a fantastic sysops/SRE tool and project which I can highly recommend.
Deeper research - prerequisites
researching existing answers and solutions
In early March I started with the search term . I got some interesting hits:
. This seemed like a very close match but was from 2013 which is relatively speaking a very long time ago. Post #2 was good info but I suspect somewhat outdated. zfsprops today supports sync=standard|always|disabled and the default on my node is sync=standard.
. A recent post with lots of info and data but not really a conclusion the author was happy enough with performance after adding an slog vdev “slog helped a bit and that is enough”.

A bit later I updated the search to and that opened up lots of new info:
proxmox forum: . Hinting at potential issues.
/r/zfs: posted by Jim Salter the author of sanoid and syncoid, and he opened an OpenZFS bug but that was a false positive highlighting an error in testing cfg for O_SYNC.
OpenZFS#9: by Brian Behlendorf. Its an interesting indicator that this was captured 11 years ago and covers some interesting details and captures some possible solutions. Its unclear if/what has been worked on since then but I imagine there have been some improvements.
/r/zfs: with results at . This is a fairly concise result that zvol performs worse than disk images stored zfs filesystems.
/r/zfs: . Not so much info on performance differences but interesting views/info.
proxmox forum: . Anecdotal and highlights some of the negatives of using raw images specific to proxmox and snapshots and volume replication.

EDIT: Eventually April 3rd 2021
I stumbled across the following OpenZFS issue looking for answers to related topics:
The good news was this thread was fresh and described nearly 1:1 my experiences too.
Update 2022-12-21: zvol performance issues are still happening with zfs-2.1.6-pve1 on pve 7.3-3/c3928077 (running kernel: 5.15.74-1-pve) as documented in .
Memtest86+
As the time of writing the system had 64GB ECC Ram. Memtest86+ 5.01 performed 1.8 passes with ~17 hours runtime SMP: disabled. No errors.
update+upgrade
To ensure I had the latest and greatest packages and co I ran the usual update+upgrade process for the two relevant nodes:
✅ Hypervisor update+upgrade
✅ KVM update+upgrade
✅ zpool upgrade
This gave me:
pveversion: pve-manager/6.3-4/0a38c56f (running kernel: 5.4.101-1-pve) (debian 10.7 buster)
zfs version: zfs-2.0.3-pve2 / zfs-kmod-2.0.3-pve2
At Thu 04 Mar 2021 07:57:46 AM UTC I did a zpool checkpoint rpool followed by zpool upgrade rpool and rebooted, and so far there have been no issues detected. I’ll wait a few more weeks before discarding the checkpoint.
establishing a baseline
To get a baseline of the physical disk in its raw and native state I batched 128 fio tests (including warmups) to create a baseline of various workloads, read, random read, write, random writes for 4k and 1M block sizes and iodepth 1, 4 and 24, buffered, direct and various sync settings. All the tests were run directly against the drive without a filesystem i.e. /dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXGQ
TODO some results summary of the baseline.
ZIL and slog and related observations of synchronous IO
I’ve placed these topics because its a topic that could distract from the main topic of zvol performance issues. Utilisation of ZIL and slog is workload dependant and the batch test results do a good job of presenting this. It makes sense to check the batch results observations first in my opinion.
A simple cp command within a kvm sends hypervisor load1 avg skyrocketing
Interestingly a simple cp of an 8GiG file filled with random data previously generated with fio, run from within a kvm, had different outcomes on different storage types, with a zvol having a very high load1 avg sustained of ~30 on my 12 core 24 thread node.
I’ve included some select hypervisor graphs from netdata (thanks print mode) comparing the two cp operations.
Please note the cp tests were run inside a kvm and the graphs are from the hypervisor.
In the following graphs you’re looking at two cp operations with a pause in-between.
The source of the cp file was a separate non zfs disk virtio_blk full disk to the kvm.
The left (earlier) cp destination was a zvol virtio_blk provisioned to the kvm.
The right (later) cp destination was a xfs formatted raw disk image stored on a zfs filesystem virtio_blk provisioned to the kvm.
Both the zvol and raw disk were stored on the same zpool and vdev.
Non-synthetic tests within the kvm
The netdata graph snapshots are great for showing behaviour during the test but not great for direct results comparison, so I captured the runtime for cp an rsync and a dd operations. This makes it easy to calculate avg MiB/s.
The test setup: 30 GiB test random data test file generated previously by fio.
The left hand tests are a virtio_blk raw xfs disk volume stored on zfs filesystem.
The right hand tests are virtio_blk zfs zvol.
image.png
From these results I copy+pasted the relevant data to a spreadsheet and did some basic maths, and got the following graphs.
image.png
image.png
image.png
Conclusions
kvm virtio zfs filesystem stored raw disk image vs. zfs zvol:
Raw disk image operation is 1.26 times faster MiB/s with a 25.9% increase in speed on the rsync test.
Raw disk image operation is 1.12 times faster MiB/s with a 11.9% increase in speed on the cp test.
Raw disk image operation is 1.11 times faster MiB/s with a 11.3% increase in speed on the dd test.
The CPU time graph shows zvol requiring more kvm vCPU time in each test.
These results confirm what I witnessed when I first inspected zvol performance issues when observing an rsync --progress.
rerun of the cp test on the hypervisor
I thought it would be a good sanity check to eliminate the virtualisation aspect, and test how the hypervisor handles the simple cp test. Source file is a random 8GiB file produced earlier by fio.
The difference between the kvm cp test and this one:
its on the hypervisor rather than inside a kvm.
the cp target for the zfs filesystem is direct to zfs filesystem rather than writing into a raw disk image stored on the zfs filesystem.
The cp source is a zfs SSD mirror rather than a kvm virtio raw disk pass-through, this should make zero difference as neither are a bottleneck.
Note that the cp commands complete faster than the actual writes to disk due to buffers and co. It takes a few minutes for the blocks to actually reach the disk as seen in the graphs.
cp destination zvol (left) and zfs filesystem (right).
image.png
image.png
image.png
image.png
image.png
image.png
observations
From the graphs, note how the zvol test takes much longer ~3m ~46 MiB/s to write its io’s to disk than the zfs filesystem ~2.2m ~62 MiB/s. So the zfs filesystem exhibits a ~25% decrease in runtime and the MiB/s is ~1.34 times faster than the zvol test.
The load avg is also much higher for the zvol test. ~20 peak vs. ~3.5 peak. That is an ~82% increase in load1 avg for the zvol test ~5.7 times more peak load1 avg. iowait and CPU usage also looks better for zfs filesystem.
The avg completed i/o bandwidth aka avg io block size is much greater and nearly consistent for the test duration for the zfs filesystem test, which equates to less IOPS required to do the work, showing the zfs filesystem appears to be optimising the io block size in its io queue. zvol has much higher and inconsistent IOPS and lower io bandwidth.
slog being utilised for zvol but not for zfs filesystem?
Here you can see the disk used for the zpool slog, in the zvol cp test you can see a burst of writes shortly after the cp starts.
Note the cp to the zfs filesystem appears not to use the slog at all. This seems related to the default sync behaviour for zfs filesystems vs zfs zvols being different.
image.png
ZFS graphs that display activity
image.png
image.png
conclusion of cp test inside kvm vs. hypervisor
At this point I can say that the issue that I observed with zvols inside the kvm is reproducible directly on the hypervisor, as expected the hypervisor performs a bit better than the kvm without the para-virtualisation layer.
Hypervisor graphs comparing two fio runs.
comparison of the two fio runs. These results were the best of few runs to warm up caches/buffers etc.
Full fio command run on both tests:
fio --time_based=1 --runtime 180 --filename=test --rw=write --bs=4k --numjobs=1 --iodepth=32 --group_reporting --name=test --size=8G --loops=50 --ioengine=libaio --direct=1 --ramp_time=5
As before, the following graphs you’re looking at two tests with a pause in-between.
The left (earlier) fio was ran on a zvol formatted with xfs.
The right (later) fio was ran directly on the zfs filesystem.
Both the zvol and filesystem were in the same zpool and vdev. Here are some select hypervisor graphs from netdata:
Please note the fio tests were run directly on the hypervisor.
cp test vs. fio test
A note on synthetic benchmarks. fio is a great tool in the belt and allows for repeatable and detailed comparison of io workloads. It does exactly what is says on the tin... flexible io’s. That being said one has to keep in mind that fio tests are typically not real world tests, and this is why I started with a cp example for comparison. It is possible to capture block traces with blktrace and have fio repeat real world io workloads but I want to keep that out of scope for now.
With a bit of luck and intuition I'm actually quiet pleased with myself that the resulting graphs from cp and fio tests presented above are actually relatively comparable. I actually only noticed this fully during the write up.
Observations
Note the disk graphs are from the sdk device, which is the underlying physical disk.
Why are there lots more zfs arc operations (zfs.reads and zfs.list_hists) for zvol vs. zfs filesystem?
load1 avg skyrockets for zvol tests reaching ~40 on a 3 minute fio test.
fio IOPS and BW are markedly worse for zvol tests.
For zvol’s fio tuning the iodepth setting seems to dramatically impact load1 avg. Higher depths producing higher load.
Why are there many moe zfs important operations for the zvol?
Expectedly the power usage and CPU temps are higher for the zvol tests.
It would seem that the zfs filesystem is optimising submitted io’s into larger io’s, therefore the actual io size being written to underlying disk is much larger, so the physical result is lower IOPS but higher bandwidth throughput. The zvol doesn’t seem to be able to achieve the same, the graphs show much higher IOPS, smaller avg io size and less throughput, and much more system load.
CPU context switches, new processes, and blocked processes is much higher for zvol tests.
CPU contention and stalling is much higher for zvol tests.
fio synthetic batch testing
During this research I generated a batch of tests (scripts here) which had a cumulative runtime of ~8 days.
The batch tests performed ~10 TB of write IO and ~7 Million IOPS, and 9.5 TB of read IO and ~3.7 Million IOPS.
zvol instability
!!! Due to instability of zvol buffered tests on my setup causing kernel crashes, a chunk of zvol tests were skipped. 201 of 2448 tests were skipped. The skipped tests were 0456 to 0656.
Testing setup
Hardware setup is described
in detail. All the synthetic tests in this batch were performed on the hypervisor.
Historically the node was born with proxmox 4.x installed from official pve install media and has subsequently been upgraded to 5.x and at the time of writing upgraded to 6.x.
During the test the versions were as follows:
pveversion: pve-manager/6.3-4/0a38c56f (running kernel: 5.4.101-1-pve) (debian 10.7 buster)
zfs version: zfs-2.0.3-pve2 / zfs-kmod-2.0.3-pve2
The data vdev was a: 2.5” SEAGATE Barracuda 5TB SMR HDD SATA 6Gb/s 5400rpm.
For tests with slog, the vdev was a: CRUCIAL MX300 SATA 6Gb/s SSD.
Each test has a reference in column V, the bash script number, aimed to make it easy to discuss and/or run a specific test for comparison/validation of results.
All tests were scheduled for 3 mins with 10 seconds ramp time.
In-between each test a 2 minute break was added to allow system loadavg to recover, and writes to be flushed to the physical device(s).
All tests used fio version 3.12 and the libaio io library.
All tests ran numjobs=1 (single process) with varying iodepth.
ZFS datasets used encryption set to aes-256-gcm and lz4 compression.
The unique filenames in the fio tests were as follows:
# baseline tests, directly on the block device
--filename=/dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXGQ

# zvol tests, directly on the zvol block devices --filename=/dev/zvol/fiotestpool/testenc/zvol1M --filename=/dev/zvol/fiotestpool/testenc/zvol4k

# zfs dataset tests, on the zfs file system with various zfsprops --filename=/fiotestpool/testenc/1M/test-16G.fiodata --filename=/fiotestpool/testenc/4k/test-16G.fiodata --filename=/fiotestpool/testenc/def/test-16G.fiodata
Warm ups
At the start of testing, or when the batches make a significant change in testing config a warmup run=1 was performed to warmup caches and co. All the results I review and observe herein are on run=2 test results. You can see run=1 tests are filtered out but available in the raw data check column CL name_full for that detail if you are interested in those test results. It is expected that run=1 results to be slower and/or inconsistent.
Raw results data
I’ve shared a read-only copy of the batch results on Google Sheets
.
Highlighting legend
The burgundy text rows highlight zvol tests.
The colour scales use red for worst, green for best and sometimes cream-ish for midpoint.
In most cases I left low or unremarkable results with default/white to avoid colour flooding which can distract from more interesting results.
Notes on when write IO is physically written to a device.
This topic is perhaps not something we think about in day to day computing, but certain real-world workloads like databases and in this kind of testing its important to be aware of when write io is physically written to a device and how it relates to performance. One also has to consider there are many components in writing IO to a device.
Under Linux at least some of these stages will be involved depending on the IO settings:
the writing program → kernel write system calls → buffered IO populating page cache OR direct IO bypassing page cache → the filesystem → the volume manager → the IO scheduler and queue → the block layer → the upper, mid and low level drivers → the physical device.
If we think of this sequence as “the writing pipeline”, only when all parts of the pipeline are configured for synchronous write IO, only then will each write IO wait until the physical device acknowledges the operation, in this case there is a round-trip penalty per IO. If the pipeline isn’t fully synchronous then IO could have mixture of sync and async modes in different parts of the pipeline.
During my simple testing prior to batch testing, there are graphs visualising the time taken for a write workload to be written
.
Note that the cp commands I tested complete much faster than the actual writes to disk due to buffers and co. It takes a few minutes for the blocks to actually reach the disk as seen in the graphs.
This tells us a few things are happening for the given cp operation 1) the IO is asynchronous and 2) it is at least partially utilising page/file system cache aka buffered writes.
To summarise
Async IO hands off the workload as fast as possible to the sub system. The writing application entrusts the IO workload to the subsystem and hopes for the best.
Async buffered write IO is pushing its IO workload into the page cache and IO subsystem as fast as it can and reporting the result of that hand over to subsystem to the requester.
Async direct IO behaves much the same but bypasses the page cache writing directly to the subsystem, which for certain scenarios is preferred and in most cases, especially for sustained or long workloads should perform better than buffered IO as it bypasses the “double write” cost of the page cache.
The draw back of async IO is that there is no guarantee that all writes that the subsystem should perform really got acknowledged by the physical device. The writing application entrusts the IO workload to the subsystem and hopes for the best.
Sync IO direct or buffered is therefore designed to provide a guarantee that each write IO is acknowledged as physically written to the target device. The writing application waits for each IO to make the round-trip down and back up the pipeline, acknowledging each write IO. The guarantee is great for data integrity but inherently slower than async IO.
Sync IO workload performance is therefore limited by the speed and number of concurrent write acknowledgements vdev(s) can make.
I ran fio test batches for async buffered and directio, and sync IO, including buffered, directio, sync, fsync, and fdatasync.
General test case observations
Test runtime and zvol direct IO vs. buffered IO
During testing, results showed that while each fio test was scheduled for 180+10 seconds, in some test configurations the overall runtime would take longer, this was measured by timing the bash script invoking fio. Sometimes the runtime was much longer then configured, and this wasn’t something fio was able to internally provide as a test result metric.
Why is this important?
If it wasn’t measured the extra time taken for a test would be invisible.
Results where fio reports 190 runtime but doesn’t finish until later, sometimes much later, is providing a false result.
This measure can be considered as a KPI for a workload+config stability and reliability, if total runtime is != to fio reported runtime then potentially something is wrong or unexpected.
Measuring this KPI has highlighted a concerning issue with buffered write IO, especially for zvol tests. Take this example for write-seq-4k:
image.png
Observe that these tests are identical except buffered vs. direct IO. The buffered test should of taken a total of ~190 seconds including the 10s ramp up. fio reported the tested took ~180 seconds but in reality fio took another 13970 seconds i.e. 3.8 hours to finish, which is an ~7761% increase in expected runtime.
Where the issue lies is not clear but this test configuration makes the writing subsystem very unhappy.
Lets compare the same problem test with an slog device in the zpool:
image.png
The slog seems to reduce the dramatic impact, the test is only 7.8 minutes longer than the expected 3 minutes, with an increased runtime of 268%. This is still very poor performance but better than the 7761% increase of the non slog test.
Lets compare the same problem test with sync IO:
image.png
Here we observe the runtime is normal for the sync test.
Conclusions
It would be cool if the fio project could take this “total runtime” factor into consideration as a native measure/KPI in test results. I’ll try to remember to raise an issue on the project to check my findings.
Why does this problem/behaviour happen especially with zvol tests? The exact mechanics are unclear but it does seem related and compounds the overall zvol performance issues.
What was clear, there was a strong correlation of zvol + async workload being a factor, and when combined with buffered IO massive performance an system loadavg problems were recorded.
I feel strongly that the root cause of this issue is responsible for crashing my kernel and forcing me to skip 201 zvol tests, and maybe it relates to poor zvol performance.
If you filter the raw results tab column AT opt_async_or_sync eq async and then look at the difference between column AP opt_io_type eq buffered vs. direct you’ll clearly see what I’m getting at. Lots of nok in column N, Q and S.
💡 Recommendation: do not combined zvol + buffered async write IO, you’re going to have a bad time.
Evidence:
image.png
Test cases
Test case: read seq 4k - results ordered by MIB/s descending
image.png
Baseline: the fastest baseline read ~24 GiB IO at ~139 MiB/s and ~35K IOPS are as expected for the disk.
Noteworthy is that directio with low iodepth performed poorly with the workload e.g. row 24/25 vs 29.
There were no timing issues.
Buffered io performance was nearly identical for all iodepths. There were no timing issues.
🥇 Outright winner: row 2: zfs-fs-read-4k default recordsize, directio, iodepth=1.
Read ~159 GiB at ~906 MIiB/s and ~232K IOPS.
zvol: first I’m going to compare row 8 zfs-zvol-read-4k vs. row 7 zfs-fs-read-4k-def both tests ran async directio iodepth=24
This is the fastest zvol test in this test type, and it cost 7.9 times more resources than its competitor, that is a 690% increase in resources for ~9% less performance. The 1m uptime at the end of the test was 9.41 vs. 1.19. This is a horrible negative cost/penality for zvol.
All the other zvol tests fall off a performance cliff.
Next I would draw attention to column BI read bandwidth standard deviation which is ~50 times greater than the competitor zfs-fs test, that is ~5000% performance deviation increase, which is another negative for zvol.
iodepth: greater depth doesn’t always mean more performance, row 2, 6, 7 show this, iodepth=1 performed better than iodepth 4 and 24 and 4 performed better than 24. At least this was true for this read seq 4k batch.
Test case summary
Setting recordsize to 4k for the zfs-fs-read-4k-4k was circa 50% slower vs. zfs-fs-read-4k-def with default recordsize.
zvol was much more expensive in terms of system resources and always slower.
The fastest zfs-fs-read-4k-def vs. the fastest baseline-read-4k-raw-disk was 6.5 times faster than the baseline, which is ~550% increase in performance.
💡 Recommendation: for this disk type and 4k blocksize workload don’t use zvol and use zfs-fs with default recordsize.
Test case: read seq 1M - results ordered by MIB/s descending
ℹ Unfortunately this test case is sparse because of skipped tests due to zvol instabilities. If there are improvements to zvol in OpenZFS in the future this test case is a strong candidate for a retest to zvol results can be included.
image.png
Baseline: the fastest baseline read ~24 GiB IO at ~139 MiB/s and ~139 IOPS are as expected for the disk.
Baseline buffered vs. direct io performance was nearly identical for all iodepths.
There were no timing issues.
Buffered load avg was higher by ~11 times with an increase of ~1000%.
🥇 Outright winner: row 2: zfs-fs 1M recordsize, buffered, iodepth=1.
Read ~619.4 GiB at ~3523 MiB/s and ~3523 IOPS.
Direct vs. buffered io: for this 1M blocksize workload the tests show buffered io is more performant and has the same sort of 1m load avg.
zfs-fs-read-1M-1M iodepth=1 buffered (row 2) vs. directio (row 7), buffered wrote 2.2 times more GiB’s with a 118% increase in performance 1616 MiB/s directio vs. 3523 MiB/s buffered. Directio with iodepth=4 (row 3) closed to gap 3150 MiB/s vs. 3523 MiB/s with the increase gap dropping to ~12%.
Test case summary
The winning test (row 2) vs the fastest baseline test (row 8) was 25.3 times faster than baseline with a ~2431% increase in performance. This is an incredible performance increase by zfs for this disk type. This is likely due to compression of the 1M blocksize workload and ARC cache optimisations.
!!! The zvol tests are missing because buffered zvol tests were unstable and caused the batches to crash the kernel.
ℹ this test category is ripe for re-testing if/when zvol is more stable.
ℹ It doesn’t make sense to test 4k blocksize workload on 1M recordsize, so those test are skipped.
💡 Recommendation: for this disk type and 1M blocksize workload don’t use zvol and use zfs-fs with 1M recordsize.
💡 Recommendation: If using directio then increasing iodepth/jobs to ~4 could yield up to double throughput gains.
Test case: read random 4k - results ordered by MIB/s descending
image.png
Baseline: the fastest baseline read ~0.2 GiB IO at ~0.9 MiB/s and ~240 IOPS which is as expected for the disk.
Baseline directio tests performed better than buffered.
Increasing iodepth from 1 to 24 performed ~2.4 times faster with a ~137% increase in performance, but note that the performance order of magnitude is small/slow for the baseline tests.
There were no timing issues.
🥇 Outright winner: row 2: zfs-zvol-randread-4k, direct, iodepth=24.
Read ~161.9 GiB at ~921.1 MiB/s and ~235.8K IOPS.
Direct vs. buffered io: for the fastest zvol tests in this the 4k blocksize randread category, the tests show direct io is more performant and has a 1m load avg that scales with performance.
zfs-zvol-randread-4k iodepth=24 directio (row 2) vs. buffered (row 6), directio performed 5.2 times more IOPS with a 418% increase in performance 45.5k IOPS buffered vs. 235.8K IOPS direct.
Test case summary
This is the first test category where zvol out performs zfs-fs, and by an impressive margin. In fact the slowest zvol test beats the fastest zfs-fs test.
zfs-zvol-randread-4k, directio, iodepth=24 (row 2) vs. zfs-fs-randread-4k-def (row 12), the zvol test is 8.3 times faster with a 736.6% increase in performance.
The winning test in this category (row 2) was ~983 times faster with a ~98248% increase in performance than the fastest baseline test (row 19).
The winning test for zfs-fs (row 7) was ~127 times faster with a ~12126% increase in performance than the fastest baseline test (row 19).
ℹ This test category demonstrates zfs in general is able to make some very impressive performance gains for random read workloads compared to the baseline disk performance, and zvol is able to make ludicrous performance gains.
💡 Recommendation: test your random workload with zvol to see if its beats zfs-fs.
LUDICROUS SPEED.gif
Edit - Retest
When writing up these results I wanted to know more about why row 2 0389-job.sh was so ridiculously fast. So I ran 0389-job.sh without any warm-up or such, and got terrible performance compared to row 2, similar to the baseline. I was a bit puzzled but knowing the extreme high performance has to be related to RAM/ARC and optimisations, I did a bit more checking.
First I zeroed the /dev/zvol/fiotestpool/testenc/zvol4k with /dev/zero:
zfs set sync=disabled fiotestpool/testenc/zvol4k
dd if=/dev/zero of=/dev/zvol/fiotestpool/testenc/zvol4k bs=1M count=20480 oflag=direct status=progress
Then I ran 038{6..9}-job.sh and as you would expect reading zeros is very fast. IOPS: 134.5k, 11.7k, 35k, 138.5k respectively.
There were no reads from the physical device in these tests, so everything happening in the ARC.
Then I ran the previous write test to that zvol followed by the read tests:
RO= bash # write
for job in 038{6..9}-job.sh; do bash $job; done
The performance was terrible. My conclusion is that the data written by fio in the original test and cached by ARC was able to be optimised, however its hard to reproduce this result with random data and random reads, which seems logical.
Retest summary: Reading performance beyond baseline is greatly impacted by the data being read. Reading zeros was easily optimisable and fast, reading random data not easily optimisable and slow.
Test case: read random 1M - results ordered by MIB/s descending
ℹ Unfortunately this test case is sparse because of skipped tests due to zvol instabilities. If there are improvements to zvol in OpenZFS in the future this test case is a strong candidate for a retest to zvol results can be included.
image.png
Baseline: the fastest baseline read ~13.1 GiB IO at ~74.7 MiB/s and ~75 IOPS which is expected for the disk.
Baseline directio tests performed better than buffered.
Increasing iodepth from 1 to 4 performed ~1.3 times faster with a ~26% increase in performance iodepth=24 performed marginally better than iodepth=1 but not better than iodepth=24.
There were no timing issues.
🥇 Outright winner: row 2: zfs-fs-randread-1M-1M, directio, iodepth=1.
Read ~635.5 GiB at ~3515.3 MiB/s and ~3615 IOPS.
Direct vs. buffered io: directio performed marginally better than buffered with this workload.
iodepth: for this test case workload zfs performed better when iodepth=1, performance dropped off with iodepth=4 and 24.
Test case summary
The winning test in this category (row 2) was ~48 times faster with a ~4751% increase in performance than the fastest baseline test (row 8).
ℹ This test category demonstrates zfs in general is able to make some very impressive performance gains for random read workloads compared to the baseline disk performance.
Test case: write seq 4k - results ordered by MIB/s descending
image.png
Screenshot of the top 30 results, There are a total of 672 results in this test case, so I will try and breakdown the most relevant sub cases.
Baseline async: the fastest test wrote ~23.8 GiB IO at ~135.5 MiB/s at ~34.7K IOPS which is expected for the disk.
Directio tests performed better than buffered.
Increasing iodepth from 1 to 4 performed ~1.5 times faster with a ~51% increase in performance. iodepth=24 performed marginally better than iodepth=4.
The fastest buffered tests took ~65 seconds longer to complete than scheduled.
Baseline sync: the fastest test wrote ~0.3 GiB IO at 1.6 MiB/s at 413 IOPS which is expected for the disk.
Directio tests performed much better than buffered.
Tests performed better with higher iodepth.
Sync tests outperformed fsync and fdatasync tests.
image.png
🥇 Outright winner: row 2: zfs-fs-write-4k-def-sync-disabled, buffered, iodepth=1.
Wrote ~88 GiB at ~502 MiB/s and ~128K IOPS.
🥉 3rd place (row 4) with a 3% decrease in performance vs. 1st place was zfs-fs-write-4k-def-sync-disabled directio, iodepth=1.
5th place (row 6) was with a 4.3% decrease in performance vs. 1st place was zfs-slog-fs-write-4k-def-sync-standard-logbias-throughput buffered, iodepth=1.
Direct vs. buffered io: at the top of the leader board buffered performed marginally better than directio with this workload.
iodepth: at the top of the leader board zfs performed better with iodepth=1, performance dropped off with iodepth=4 and 24.
zvol: overall results are terrible for zvol. The fastest zvol test was in 45th position (row 46) zfs-zvol-write-4k-sync-disabled with buffered, iodepth=1, was a ~84% decrease in performance vs. 1st place. This test also had a timing issue of ~30 seconds.
The first zvol test without timing issues was position 154th (row 155) and suffered a 97.7% performance decrease at 2 MiB/s and suffered a horrible load1 average of 22.2.
zfs-fs 4k-def vs. 4k-4k: comparing the fastest comparable zfs-fs tests, 4k-def is the clear winner. Leaving the recordsize default is 6.4 times faster (~544% ⬆) than lowering to 4k which incurs an ~84% ⬇ performance penalty.
image.png
sync vs. async io: if sync=disabled tests are excluded, and we compare async vs sync tests:
top 20 async results:
image.png
At the top of the leader board slog tests are marginally faster, no major observations detected.
top 20 sync results:
image.png
Interestingly zvol beats zfs-fs in sync io on raw numbers.
zvol fastest sync test wrote 2.0 GiB at 11.4 MiB/s at 2922 IOPS (row 155), iodepth=24, load1 avg 22.2.
zfs-fs fastest sync test wrote 0.3 GiB at 1.7 MiB/s at 427 IOPS (row 177), an 85.3% decrease, iodepth=4, load1 avg 1.16.
baseline fastest sync test wrote 0.3 GiB at 1.6 MiB/s at 413 IOPS (row 177), an 85.8% decrease, iodepth=24, load1 avg 0.13.
⚠⚠⚠ zvol fastest test performed ~6.8 times faster (~584%) than zfs-fs fastest but created ~19 times (~1814%) more load on the system.
When comparing the performance traits of sync vs. fsync vs. fdatasync there seems to be a correlation that sync tests cause significantly higher load1 avg on zvol tests. zfs-fs does not appear to be impacted by this. Check this view of iodepth=24, directio, sync tests, and look at the correlation between columns Z and AT.
It looks like sync provides more IOPS for zvol tests but at the expense of system resources.
image.png
For zfs-fs tests, the performance traits of sync vs. fsync vs. fdatasync seem to be relatively indifferent:
image.png
sync zvol slog vs. zvol:
image.png
Here we can see that slog helps sync performance. 15.8 times faster, a 1488% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.
sync zfs-fs slog vs zfs-fs:
image.png
Here we can see that slog helps sync performance. ~20 times faster, a 1933% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.
One would expect that if using a modern fast slog device like optane, that the slog test results would be orders of magnitude faster. This was tested and observed by napp-it, of the write up.
Test case summary
The winning test in this category (row 2) was ~3.7 times faster with a ~271% increase in performance than the fastest baseline test (row 31).
ℹ zfs-fs performs much better than zvol with async workloads.
ℹ zvol outperforms zfs-fs for sync workloads but at significant system load/resource costs.
💡 Recommendation: async workloads: do not use zvol.
💡 Recommendation: sync workloads: consider very carefully if zvol system load overhead is worth the extra performance.
💡 Recommendation: use slog to significantly boost sync workload performance.
💡 Recommendation: leave zfs-fs recordsize default. Lowering it from the default makes performance worse for this 4k test case.
Test case: write seq 1M - results ordered by MIB/s descending
image.png
Screenshot of the top 30 results, There are a total of 372 results in this test case, so I will try and breakdown the most relevant sub cases.
Baseline async: the fastest test (row 2) wrote ~24.2 GiB IO at ~137.5 MiB/s at 137 IOPS which is expected for the disk.
Directio tests performed marginally better MiB/s than buffered.
Buffered tests suffered from a high IOPS standard deviation, and from high write latencies, mean latency was a 41% increase and peak was a 12,541% increase. This was comparing row 2 vs. row 40.
Increasing iodepth did not increase performance but did increase write latency.
The fastest buffered tests took ~62 seconds longer to complete than scheduled.
Baseline sync: the fastest test wrote ~10.4 GiB IO at 59.1 MiB/s at 59 IOPS which is expected for the disk.
Directio tests performed much better than buffered.
Tests performed better with higher iodepth.
Sync tests outperformed fsync and fdatasync tests.
🥇 Outright winner: row 2: baseline-write-1M-raw-disk, directio, iodepth=24.
Wrote ~24.2 GiB IO at ~137.5 MiB/s at 137 IOPS.
4th place (row 5) with a ~12% decrease in performance vs. 1st place was zfs-slog-fs-write-1M-1M-sync-standard-logbias-throughput directio, iodepth=24.
8th place (row 9) was with a 16.5% decrease in performance vs. 1st place was zfs-fs-write-1M-1M-sync-disabled directio, iodepth=24.
Direct vs. buffered io: no clear winner. Different test settings had different results, some were marginally better buffered and vice versa.
iodepth: overall it looks like higher iodepth provided marginal performance gains.
zvol: overall results are terrible for zvol. The fastest zvol test was in 37th position (row 38) zfs-slog-zvol-write-1M-sync-standard-logbias-throughput with directio, iodepth=4 was a ~31% decrease in performance vs. 1st place. This test also had a timing issue of ~114 seconds.
The first zvol test without timing issues was position 117th (row 118) and suffered a 56.6% performance decrease at 59.9 MiB/s and suffered a bad load1 average of 11.36.
sync vs. async io: if sync=disabled tests are excluded, and we compare async vs sync tests:
top 20 async results:
image.png
⚠⚠⚠ interestingly this is the first test case where zfs is not able outperform the async baseline.
At the top of the leader board the fastest slog test is marginally faster that the non-slog comparative test.
top 20 sync results:
image.png
ℹ Interestingly zvol and zfs-fs are tied for 1st and 2nd place.
The top 20 tests are all very close with only 4 IOPS between the 1st and 20th place.
⚠⚠⚠ all of the zvol in the top 20 suffer timing issues (column N and S).
zfs-fs fastest sync test wrote 14.9 GiB at 84.9 MiB/s at 85 IOPS (row 57), iodepth=1, load1 avg 9.25.
zvol fastest sync test wrote 14.9 GiB at 84.8 MiB/s at 85 IOPS (row 58), iodepth=24, load1 avg 4.84, noteworthy are the higher write latencies.
baseline fastest sync test wrote 10.4 GiB at 59.1 MiB/s at 59 IOPS (row 122), a ~31% ⬇, iodepth=24, load1 avg 0.13.
⚠⚠⚠ The fastest zvol and zfs-fs sync tests were tied but zfs-fs created ~1.9 times (~91%) more load on the system for the same performance.
When comparing the performance traits of sync vs. fsync vs. fdatasync there seems to be a correlation that sync tests cause significantly higher load1 avg on zvol tests. zfs-fs does not appear to be impacted by this. Check this top 20 view of iodepth=24, directio, sync tests, and look at the correlation between columns Z and AT:
image.png
Have a look at zfs-slog-zvol-write-1M-sync-standard-logbias-latency and zfs-slog-fs-write-1M-1M-sync-standard-logbias-latency tests by example:
image.png
For the zfs-fs tests it seem fdatasync provides best performance 1.09 times faster than fsync (8.6% ⬇) and 1.4 times faster than sync (29% ⬇).
For the zvol tests it seem fsync provides best performance 1.58 times faster than fdatasync (37% ⬇) and 2.95 times faster than sync (66% ⬇). Note the huge load1 avg for the sync test.
My tests show that different test settings and devices have different sync characteristics, its not easy to pick a clear winner between sync, fsync and fdatasync for this test case.
sync zvol slog vs. zvol:
image.png
Here we can see that slog helps sync performance. 8.76 times faster, a 776% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.
I spotted one test config where slog performed worse than non-slog:
image.png
Studying the bottom of the leader-board, it was dominated by slog-zvol tests with very high load1 averages. At pattern there is buffered tests.
image.png
sync zfs-fs slog vs zfs-fs:
image.png
Here we can see that slog helps sync performance. ~9.3 times faster, a 831% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.
One would expect that if using a modern fast slog device like optane, that the slog test results would be orders of magnitude faster. This was tested and observed by napp-it, of the write up.
Test case summary
Noteworthy is that the baseline tests beat zfs in this test case. The winning test in this category (row 2) was ~1.13 times faster with a ~13% increase in performance than the fastest zfs test (row 5).
ℹ it might be interesting to run a few 1M-4k and 1M-def tests, just to see if the baseline is still the fastest.
zfs-fs performs much better than zvol with async workloads.
zvol ties zfs-fs at the top of the sync leader-board.
💡 Recommendation: for async workloads: do not use zvol.
💡 Recommendation: for sync workloads: zfs-fs and zvol were tied 1st place in their fastest test configs. zfs-fs produced more system load than zvol but only the fastest config, other tests recorded nominal load1 avg.
💡 Recommendation: use slog to significantly boost sync workload performance.
Test case: write random 4k - results ordered by MIB/s descending
image.png
Screenshot of the top 30 results, There are a total of 672 results in this test case, so I will try and breakdown the most relevant sub cases.
Baseline async: the fastest test (row 97) wrote ~3.5 GiB IO at ~20.2 MiB/s at 5161 IOPS with buffered IO.
BUT this is a false result because the test took 1547 seconds aka ~26 minutes longer to complete than fio recorded. So 20 MiB/s becomes 2.1 MiB/s.
This means we can ignore row 97, 99 and 106 as invalid, and the fastest test becomes row 122 which wrote ~0.6 GiB IO at ~3.4 MiB/s at 865 IOPS with directio, iodepth=24.
Directio tests performed faster MiB/s than buffered, and without timing issues.
Increasing iodepth increased performance and also mean write latency.
image.png
Baseline sync: the fastest test wrote ~0.3 GiB IO at 1.9 MiB/s at 475 IOPS which is expected for the disk.