Explore

OpenZFS zvol performance issues

zvol performance issues

Author:

/u/kyle0r⁠

github/kyle0r⁠

⁠

Page born on: 2021-03-08 Updated: 2025-03-16 Published: 2022-12-31

2025-Q1 update

zvols

As I shared in my original research back in 2021-Apr I discovered the discussion

OpenZFS Issue #11407⁠

entitled: Extreme performance penalty, holdups and write amplification when writing to ZVOLs I contributed to the discussion and the last comment before closure was from me in 2023-Jan

here⁠

. To the best of my understanding the zvol performance issues have not been resolved, at least not the fundamental design issues.

The sensationalist quote from the #11407 discussion is from

sempervictus⁠

here⁠

ZVOLs are pretty pathologically broken by design.

⁠

SMR vs. CMR drives

In 2023-Jan I opened a

issue #14346⁠

on the OpenZFS project entitled: ~30% performance degradation for ZFS vs. non-ZFS for large file transfer

There was some useful community contribution and the high level conclusion was: avoid SMR drives with OpenZFS, and at the very least be informed about the downside and side-effects if SMR drives are the only choice, for example in a high density 2.5” chassis.

Citing myself from issue #14346:

Relating to SEQ WRITE TESTS: My SATA SMR disk suffers ~30% performance degradation vs. the baseline test without ZFS 🤬 My SAS CMR disk is within ~4% of the baseline test without ZFS. 🥳

⁠

Summary

OK, so issue #11407 relates directly to this research, what about #14346? The SMR performance degradation is relevant because the drive used in this research was SMR. i.e. the research hit a double-whammy of zvol and SMR performance issues with ZFS.

There are also other troubling behaviours displayed in #14346 related to IO drop off:

Why does ZFS suffer catastrophic write IO drop off without recovery on my SMR drives?

Which I suspect is related to

OpenZFS issue #9130⁠

which I’ve also contributed to (

most recent post⁠

), entitled: task txg_sync blocked for more than 120 seconds

Citing myself from #9130

I have a strong hunch that the root cause of this issue (#9130) and #14346 are the same, and task txg_sync blocked for more than 120 seconds is a symptom of whatever the issue might be.

Citing myself from #14346:

The ZFS read I/O pattern on the SAS CMR is worrying, and really worrying on the SATA SMR disks. What worries me the most is the pattern seems to be the same - more pronounced on the SMR disks.

Visualisation from a CMR drive read test run from inside a kvm: Left is a read test from an XFS raw volume, the right a read test from XFS raw volume stored on ZFS:

⁠

Citing myself from #14346:

There is a clear correlation: as the 'average time for I/O requests issued to the device being served' increases, IOPS decrease and throughput also decreases. The average I/O size is constant.

This issue is more pronounced on SMR drives, which also suffer the the catastrophic write IO drop off without recovery.

Future re-test topics

With the main comparisons being SMR vs. CMR, raw xfs on zfs vs. xfs on zvol, ZFS defaults vs. directio

Cherry-pick some key tests a do a general re-run with the latest kernel/zfs versions to see if anything significant has changes in recent years.

Cherry-pick tests as above and check if zvol_use_blk_mq

#13148⁠

feature is now stable and test zvols again. From memory it had issues including a corruption bug when it was first released?

Cherry-pick tests as above and test the recently added directio

#10018⁠

feature.

⁠

zvol: Support blk-mq for better performance (updated) by tonyhutter · Pull Request #13148 · openzfs/zfs⁠

⁠

Direct IO Support by bwatkinson · Pull Request #10018 · openzfs/zfs⁠

⁠

TL;DR

This topic started out as a fairly simple question but it soon snowballed into deeper research and in my opinion deserved a detailed breakdown and write up.

Its difficult to summarise 10,000 words - its a fairly complex topic. If you want to do minimum reading and maximum understanding, then check out

⁠

Preface⁠

⁠

A simple cp command within a kvm sends hypervisor load1 avg skyrocketing⁠

⁠

Non-synthetic tests within the kvm⁠

⁠

Re-run of the cp test on the hypervisor⁠

⁠

Overall conclusions⁠

⁠

Some high level stats

During this research I generated a batch of tests which had a cumulative runtime of ~8 days (TODO: scripts here).

The batch tests performed ~10 TB of write IO and ~7 Million IOPS, and 9.5 TB of read IO and ~3.7 Million IOPS.

There were supposed to be 2448 tests but the OpenZFS zvol code (zfs-2.0.3-pve2) was too unstable for certain tests so 201 tests were skipped to avoid consistently crashing the system/kernel.

Preface

In April 2021 I identified an OpenZFS zvol performance issue on my home lab proxmox hypervisor, which is Debian buster based. I was provisioning a new zpool for usage within my private data vault and stumbled across some unexpected problems with zvols.

Why I am making this post? I’m looking for community feedback and collaboration in the hope of finding an explanation and perhaps help to uncover and fix an issue with my setup and/or zvol’s. In my opinion I’ve collected too much info to organise neatly on a GitHub issue or forum post, so this post can be used as a reference to cherry-pick relevant details from.

“I’m 100% ready to accept this is a problem with my setup or config.”

I hope by sharing some of my empirical testing results, I might inspire someone to at least check if they experience similar issues or if its just me.

Goals

Figure out why zvols are performing so badly on my setup, and look for fixes, workarounds or alternatives.

Figure out if there is something wrong with my setup.

Run some tests (including base-lining), and use the results as a guide for optimal configuration.

Learn more about OpenZFS and its benefits, pitfalls, costs, and overheads.

Share my findings to benefit others, and give back to open source contributors and users.

Make my testing approach understandable.

WIP: Share scripts and make testing reproducible.

An important note. This is a SINGLE hardware sample. I’ve done a wide range of test variations BUT only a single node, single disk etc, so this doesn’t rule out the possibility that there is an anomaly with my setup. I didn’t get any anomalous feelings from the results (aside from zvol buffered write tests crashing my kernel) but keep in mind that if others cherry-pick some of my tests to repeat on their hardware and setups, it would validate or invalidate my findings.

I asked myself - does it even matter?

When investing the hours I asked myself “Does it even matter?”, “Am I wasting my time?”, “Will anyone care?”.

It does feel like I’ve stumbled across a problem, at least with my setup. I had expectations on how zvol block devices would perform and the results went against those expectations.

If I find a solution or answers and share that to benefit others... that is great. If it triggers discussion with the OpenZFS project team or proxmox team, maybe good things can come from that.

I must admit, I was sceptical about wasting time when the large batch of tests was running but the end results have been enlightening for me on how current ZFS performs (and crashes) with various workloads and io depths on my setup. One of my original goals was to understand how different workloads performed on ZFS and thanks to the spreadsheet results I can use them as a guide to get the most out of my seutp and disks. This is not only useful for my home lab and private data vault but also my storage topics in production where I deal with petabytes of data, so overall its been a worthwhile investment, and I hope the results are enlightening for others too.

Story time

In the last decade my storage strategy has been “keep it simple”. In the past I’ve had my fingers burnt and watched colleagues experience similar issues with various RAID levels and storage technologies, not to mention my own mistakes, and the many horror posts you can read online. So I'm an advocate of keeping it simple.

I’ve developed a healthy respect of how much one can trust RAID, storage systems, and filesystems. This leads me to be patient and observant of developments for example of btrfs and OpenZFS. I test and contribute where I can, and utilise such projects because they have great features but I keep verified backups for anything critical.

In a 25 year career one makes enough mistakes and becomes aware of topics like bit-rot and write holes, and in summary one comes to realise with experience:

This is basically what I’m trying to achieve with my updated private data vault strategy. Keep it simple but also replicated and verified off-site. My current strategy is OK but it is cost prohibitive to keep an up-to-date replica off-site. So in a nutshell:

Keep it simple AND reduce the costs of maintaining a near-time off-site replica of TiB’s of data.

The year is 2017

Until now I’d had good levels success with OpenZFS on Linux, and I’m a proxmox user and advocate. I have been a part of provisioning dozens of production hypervisors for mission critical systems where the root file systems were running OpenZFS (and ceph for block store). After this good experience I took a decision in 2017 to build my current gen private data vault based on proxmox with OpenZFS root file system and VM root file systems.

In 2017 I didn’t trust OpenZFS enough to use it for my historical private data vault storage, so the data vault itself it is straightforward.

Single disks with LUKS+XFS and mergerfs to create a single folder hierarchy from multiple disks hierarchies. Thanks to SnapRAID I have near-time triple disk parity, so I can handle up to 3 concurrent disk failures without data loss. SnapRAID also provides a nearl-time data scrubbing capability to detect bit rot and co. The large majority of my vault data change rate is glacial, so real-time parity and checksums are nice to have but not a must-have. SnapRAID suits my needs and since 2017 this setup has been reliable and easy to maintain.

It is also reassuring that any of the storage disks can be mounted on any system supporting LUKS+XFS. No array complexity or vendor or technology lock-ins.

The year is 2021 🤘

With the new zpool I provisioned an existing kvm with an additional zvol disk and went about testing.

If testing went OK I planned to migrate away from single disk LUKS+XFS+SnapRAID ➡ to single disk ZFS zpools+XFS+SnapRAID to keep it simple but also take advantage of ZFS’s many features. e.g. CoW, checksums, encryption, compression, dataset snapshots, pool checkpoints, easy and cost effective off-site replication and differential backups.

You can find full details my existing node setup and config

here⁠

, including a history and version infos.

important note: All my testing was on a zpool and vdev on spinning rust as mentioned a 2.5” 5400rpm drive not SSD’s.

What suggested there was a problem?

So when I started a copy job from an existing vm disk to the new vm zvol, not all was fine and dandy... time to dig into the details and figure out what is going on.

I observed that the new zvol was hitting performance issues during sequential write workloads. I was watching an rsync --progress from one physical disk (source was a simple virtio device pass-through on a kvm) to the new zvol and I could see the sustained write transfer for large binary files was choking somehow and the hypervisor load1 avg was skyrocketing. I happened to observe this during the copy of Samsung TV recording backups, so encoded video/binary data.

At this point my observability level was poor, so I installed

netdata⁠

on both the hypervisor and kvm guest which also checked off a long term TODO ✅. netdata is a fantastic sysops/SRE tool and project which I can highly recommend.

Deeper research - prerequisites

researching existing answers and solutions

In early March I started with the search term

"slow zvol"⁠

. I got some interesting hits:

⁠

ZFS Slow performance in zvol block device, but normal performance with normal files⁠

. This seemed like a very close match but was from 2013 which is relatively speaking a very long time ago. Post #2 was good info but I suspect somewhat outdated. zfsprops today supports sync=standard|always|disabled and the default on my node is sync=standard.

⁠

zfs, zvol, kvm settings/tuning (to fix SLOW VM sync-access)⁠

. A recent post with lots of info and data but not really a conclusion the author was happy enough with performance after adding an slog vdev “slog helped a bit and that is enough”.

A bit later I updated the search to

“zvol vs. dataset”⁠

and that opened up lots of new info:

proxmox forum:

zvol vs image on top of dataset⁠

. Hinting at potential issues.

/r/zfs:

Benchmarking ZVOL vs QCOW2 with KVM⁠

posted by Jim Salter the author of sanoid and syncoid, and he opened an OpenZFS bug

#7297⁠

but that was a false positive highlighting an error in testing cfg for O_SYNC.

OpenZFS#9:

ZVOL Performance⁠

by Brian Behlendorf. Its an interesting indicator that this was captured 11 years ago and covers some interesting details and captures some possible solutions. Its unclear if/what has been worked on since then but I imagine there have been some improvements.

/r/zfs:

Benchmarking RAW IMAGE vs QCOW2 vs ZVOL with KVM⁠

with results at

openbenchmarking.org⁠

. This is a fairly concise result that zvol performs worse than disk images stored zfs filesystems.

/r/zfs:

libvirt - zfs datasets or zvols⁠

. Not so much info on performance differences but interesting views/info.

proxmox forum:

ZFS, file or block level storage⁠

. Anecdotal and highlights some of the negatives of using raw images specific to proxmox and snapshots and volume replication.

EDIT: Eventually April 3rd 2021

⁠

I stumbled across the following OpenZFS issue looking for answers to related topics:

⁠

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs · Issue #11407 · openzfs/zfs⁠

⁠

The good news was this thread was fresh and described nearly 1:1 my experiences too.

Update 2022-12-21: zvol performance issues are still happening with zfs-2.1.6-pve1 on pve 7.3-3/c3928077 (running kernel: 5.15.74-1-pve) as documented in

test #14 here⁠

Memtest86+

As the time of writing the system had 64GB ECC Ram. Memtest86+ 5.01 performed 1.8 passes with ~17 hours runtime SMP: disabled. No errors.

update+upgrade

To ensure I had the latest and greatest packages and co I ran the usual update+upgrade process for the two relevant nodes:

✅ Hypervisor update+upgrade

✅ KVM update+upgrade

✅ zpool upgrade

This gave me:

At Thu 04 Mar 2021 07:57:46 AM UTC I did a zpool checkpoint rpool followed by zpool upgrade rpool and rebooted, and so far there have been no issues detected. I’ll wait a few more weeks before discarding the checkpoint.

Establishing a baseline

To get a baseline of the physical disk in its raw and native state I batched 128 fio tests (including warmups) to create a baseline of various workloads, read, random read, write, random writes for 4k and 1M block sizes and iodepth 1, 4 and 24, buffered, direct and various sync settings. All the tests were run directly against the drive without a filesystem i.e. /dev/disk/by-id/ata-ST5000LM000-2AN170_WCXXXXGQ

TODO some results summary of the baseline.

ZIL and slog and related observations of synchronous IO

I’ve placed these topics

after the batch testing section⁠

because its a topic that could distract from the main topic of zvol performance issues. Utilisation of ZIL and slog is workload dependant and the batch test results do a good job of presenting this. It makes sense to check the batch results observations first in my opinion.

A simple cp command within a kvm sends hypervisor load1 avg skyrocketing

Interestingly a simple cp of an 8GiG file filled with random data previously generated with fio, run from within a kvm, had different outcomes on different storage types, with a zvol having a very high load1 avg sustained of ~30 on my 12 core 24 thread node.

I’ve included some select hypervisor graphs from netdata (thanks print mode) comparing the two cp operations.

Please note the cp tests were run inside a kvm and the graphs are from the hypervisor.

In the following graphs you’re looking at two cp operations with a pause in-between.

The source of the cp file was a separate non zfs disk virtio_blk full disk to the kvm.

The left (earlier) cp destination was a zvol virtio_blk provisioned to the kvm.

The right (later) cp destination was a xfs formatted raw disk image stored on a zfs filesystem virtio_blk provisioned to the kvm.

Both the zvol and raw disk were stored on the same zpool and vdev.

⁠

Non-synthetic tests within the kvm

The netdata graph snapshots are great for showing behaviour during the test but not great for direct results comparison, so I captured the runtime for cp an rsync and a dd operations. This makes it easy to calculate avg MiB/s.

The test setup: 30 GiB test random data test file generated previously by fio.

The left hand tests are a virtio_blk raw xfs disk volume stored on zfs filesystem.

The right hand tests are virtio_blk zfs zvol.

⁠

From these results I copy+pasted the relevant data to a spreadsheet and did some basic maths, and got the following graphs.

⁠

Conclusions

kvm virtio zfs filesystem stored raw disk image vs. zfs zvol:

Raw disk image operation is 1.26 times faster MiB/s with a 25.9% increase in speed on the rsync test.

Raw disk image operation is 1.12 times faster MiB/s with a 11.9% increase in speed on the cp test.

Raw disk image operation is 1.11 times faster MiB/s with a 11.3% increase in speed on the dd test.

The CPU time graph shows zvol requiring more kvm vCPU time in each test.

These results confirm what I witnessed when I first inspected zvol performance issues when observing an rsync --progress.

Re-run of the cp test on the hypervisor

I thought it would be a good sanity check to eliminate the virtualisation aspect, and test how the hypervisor handles the simple cp test. Source file is a random 8GiB file produced earlier by fio.

The difference between the kvm cp test and this one:

its on the hypervisor rather than inside a kvm.

the cp target for the zfs filesystem is direct to zfs filesystem rather than writing into a raw disk image stored on the zfs filesystem.

The cp source is a zfs SSD mirror rather than a kvm virtio raw disk pass-through, this should make zero difference as neither are a bottleneck.

Note that the cp commands complete faster than the actual writes to disk due to async io. It takes a few minutes for the blocks to actually reach the disk as seen in the graphs.

cp destination zvol (left) and zfs filesystem (right).

⁠

observations

From the graphs, note how the zvol test takes much longer ~3m ~46 MiB/s to write its io’s to disk than the zfs filesystem ~2.2m ~62 MiB/s. So the zfs filesystem exhibits a ~25% decrease in runtime and the MiB/s is ~1.34 times faster than the zvol test.

The load avg is also much higher for the zvol test. ~20 peak vs. ~3.5 peak. That is an ~82% increase in load1 avg for the zvol test ~5.7 times more peak load1 avg. iowait and CPU usage also looks better for zfs filesystem.

The avg completed i/o bandwidth aka avg io block size is much greater and nearly consistent for the test duration for the zfs filesystem test, which equates to less IOPS required to do the work, showing the zfs filesystem appears to be optimising the io block size in its io queue. zvol has much higher and inconsistent IOPS and lower io bandwidth.

slog being utilised for zvol but not for zfs filesystem?

Here you can see the disk used for the zpool slog, in the zvol cp test you can see a burst of writes shortly after the cp starts.

Note the cp to the zfs filesystem appears not to use the slog at all. This seems related to the default sync behaviour for zfs filesystems vs zfs zvols being different.

⁠

ZFS graphs that display activity

⁠

conclusion of cp test inside kvm vs. hypervisor

At this point I can say that the issue that I observed with zvols inside the kvm is reproducible directly on the hypervisor, as expected the hypervisor performs a bit better than the kvm without the para-virtualisation layer.

Hypervisor graphs comparing two fio runs.

⁠

Beyond Compare⁠

comparison of the two fio runs. These results were the best of few runs to warm up caches/buffers etc.

⁠

Full fio command run on both tests:

As before, the following graphs you’re looking at two tests with a pause in-between.

The left (earlier) fio was ran on a zvol formatted with xfs.

The right (later) fio was ran directly on the zfs filesystem.

Both the zvol and filesystem were in the same zpool and vdev. Here are some select hypervisor graphs from netdata:

Please note the fio tests were run directly on the hypervisor.

⁠

cp test vs. fio test

A note on synthetic benchmarks. fio is a great tool in the belt and allows for repeatable and detailed comparison of io workloads. It does exactly what is says on the tin... flexible io’s. That being said one has to keep in mind that fio tests are typically not real world tests, and this is why I started with a cp example for comparison. It is possible to capture block traces with blktrace and have fio repeat real world io workloads but I want to keep that out of scope for now.

With a bit of luck and intuition I'm actually quiet pleased with myself that the resulting graphs from cp and fio tests presented above are actually relatively comparable. I actually only noticed this fully during the write up.

Observations

Note the disk graphs are from the sdk device, which is the underlying physical disk.

Why are there lots more zfs arc operations (zfs.reads and zfs.list_hists) for zvol vs. zfs filesystem?

load1 avg skyrockets for zvol tests reaching ~40 on a 3 minute fio test.

fio IOPS and BW are markedly worse for zvol tests.

For zvol’s fio tuning the iodepth setting seems to dramatically impact load1 avg. Higher depths producing higher load.

Why are there many moe zfs important operations for the zvol?

Expectedly the power usage and CPU temps are higher for the zvol tests.

It would seem that the zfs filesystem is optimising submitted io’s into larger io’s, therefore the actual io size being written to underlying disk is much larger, so the physical result is lower IOPS but higher bandwidth throughput. The zvol doesn’t seem to be able to achieve the same, the graphs show much higher IOPS, smaller avg io size and less throughput, and much more system load.

CPU context switches, new processes, and blocked processes is much higher for zvol tests.

CPU contention and stalling is much higher for zvol tests.

fio synthetic batch testing

During this research I generated a batch of tests (scripts here) which had a cumulative runtime of ~8 days.

The batch tests performed ~10 TB of write IO and ~7 Million IOPS, and 9.5 TB of read IO and ~3.7 Million IOPS.

zvol instability

!!! Due to instability of zvol buffered tests on my setup causing kernel crashes, a chunk of zvol tests were skipped. 201 of 2448 tests were skipped. The skipped tests were 0456 to 0656.

Testing setup

Hardware setup is described

here⁠

in detail. All the synthetic tests in this batch were performed on the hypervisor.

Historically the node was born with proxmox 4.x installed from official pve install media and has subsequently been upgraded to 5.x and at the time of writing upgraded to 6.x.

During the test the versions were as follows:

The data vdev was a: SMR 2.5” SEAGATE Barracuda 5TB SMR HDD SATA 6Gb/s 5400rpm.

For tests with slog, the vdev was a: CRUCIAL MX300 SATA 6Gb/s SSD.

Each test has a reference in column V, the bash script number, aimed to make it easy to discuss and/or run a specific test for comparison/validation of results.

All tests were scheduled for 3 mins with 10 seconds ramp time.

In-between each test a 2 minute break was added to allow system loadavg to recover, and writes to be flushed to the physical device(s).

All tests used fio version 3.12 and the libaio io library.

All tests ran numjobs=1 (single process) with varying iodepth.

ZFS datasets used encryption set to aes-256-gcm and lz4 compression.

The unique filenames in the fio tests were as follows:

Warm ups

At the start of testing, or when the batches make a significant change in testing config a warmup run=1 was performed to warmup caches and co. All the results I review and observe herein are on run=2 test results. You can see run=1 tests are filtered out but available in the raw data check column CL name_full for that detail if you are interested in those test results. It is expected that run=1 results to be slower and/or inconsistent.

Raw results data

I’ve shared a read-only copy of the batch results on Google Sheets

here⁠

Highlighting legend

The burgundy text rows highlight zvol tests.

The colour scales use red for worst, green for best and sometimes cream-ish for midpoint.

In most cases I left low or unremarkable results with default/white to avoid colour flooding which can distract from more interesting results.

Notes on when write IO is physically written to a device.

This topic is perhaps not something we think about in day to day computing, but certain real-world workloads like databases and in this kind of testing its important to be aware of when write io is physically written to a device and how it relates to performance. One also has to consider there are many components in writing IO to a device.

Under Linux at least some of these stages will be involved depending on the IO settings:

the writing program → kernel write system calls → buffered IO populating page cache OR direct IO bypassing page cache → the filesystem → the volume manager → the IO scheduler and queue → the block layer → the upper, mid and low level drivers → the physical device.

⁠

Thomas Krenn has a nice visual of the Linux storage stack for kernel 4.10 here⁠

If we think of this sequence as “the writing pipeline”, only when all parts of the pipeline are configured for synchronous write IO, only then will each write IO wait until the physical device acknowledges the operation, in this case there is a round-trip penalty per IO. If the pipeline isn’t fully synchronous then IO could have mixture of sync and async modes in different parts of the pipeline.

During my simple testing prior to batch testing, there are graphs visualising the time taken for a write workload to be written

here⁠

Note that the cp commands I tested complete much faster than the actual writes to disk due to buffers and co. It takes a few minutes for the blocks to actually reach the disk as seen in the graphs.

This tells us a few things are happening for the given cp operation 1) the IO is asynchronous and 2) it is at least partially utilising page/file system cache aka buffered writes.

To summarise

Async IO hands off the workload as fast as possible to the sub system. The writing application entrusts the IO workload to the subsystem and hopes for the best.

Async buffered write IO is pushing its IO workload into the page cache and IO subsystem as fast as it can and reporting the result of that hand over to subsystem to the requester.

Async direct IO behaves much the same but bypasses the page cache writing directly to the subsystem, which for certain scenarios is preferred and in most cases, especially for sustained or long workloads should perform better than buffered IO as it bypasses the “double write” cost of the page cache.

The draw back of async IO is that there is no guarantee that all writes that the subsystem should perform really got acknowledged by the physical device. The writing application entrusts the IO workload to the subsystem and hopes for the best.

Sync IO direct or buffered is therefore designed to provide a guarantee that each write IO is acknowledged as physically written to the target device. The writing application waits for each IO to make the round-trip down and back up the pipeline, acknowledging each write IO. The guarantee is great for data integrity but inherently slower than async IO.

Sync IO workload performance is therefore limited by the speed and number of concurrent write acknowledgements vdev(s) can make.

I ran fio test batches for async buffered and directio, and sync IO, including buffered, directio, sync, fsync, and fdatasync.

General test case observations

Test runtime and zvol direct IO vs. buffered IO

During testing, results showed that while each fio test was scheduled for 180+10 seconds, in some test configurations the overall runtime would take longer, this was measured by timing the bash script invoking fio. Sometimes the runtime was much longer then configured, and this wasn’t something fio was able to internally provide as a test result metric.

Why is this important?

If it wasn’t measured the extra time taken for a test would be invisible.

Results where fio reports 190 runtime but doesn’t finish until later, sometimes much later, is providing a false result.

This measure can be considered as a KPI for a workload+config stability and reliability, if total runtime is != to fio reported runtime then potentially something is wrong or unexpected.

Measuring this KPI has highlighted a concerning issue with buffered write IO, especially for zvol tests. Take this example for write-seq-4k:

⁠

Observe that these tests are identical except buffered vs. direct IO. The buffered test should of taken a total of ~190 seconds including the 10s ramp up. fio reported the tested took ~180 seconds but in reality fio took another 13970 seconds i.e. 3.8 hours to finish, which is an ~7761% increase in expected runtime.

Where the issue lies is not clear but this test configuration makes the writing subsystem very unhappy.

Lets compare the same problem test with an slog device in the zpool:

⁠

The slog seems to reduce the dramatic impact, the test is only 7.8 minutes longer than the expected 3 minutes, with an increased runtime of 268%. This is still very poor performance but better than the 7761% increase of the non slog test.

Lets compare the same problem test with sync IO:

⁠

Here we observe the runtime is normal for the sync test.

Conclusions

It would be cool if the fio project could take this “total runtime” factor into consideration as a native measure/KPI in test results. I’ll try to remember to raise an issue on the project to check my findings.

Why does this problem/behaviour happen especially with zvol tests? The exact mechanics are unclear but it does seem related and compounds the overall zvol performance issues.

What was clear, there was a strong correlation of zvol + async workload being a factor, and when combined with buffered IO massive performance an system loadavg problems were recorded.

I feel strongly that the root cause of this issue is responsible for crashing my kernel and forcing me to skip 201 zvol tests, and maybe it relates to poor zvol performance.

If you filter the raw results tab

write-4k⁠

column AT opt_async_or_sync eq async and then look at the difference between column AP opt_io_type eq buffered vs. direct you’ll clearly see what I’m getting at. Lots of nok in column N, Q and S.

💡 Recommendation: do not combined zvol + buffered async write IO, you’re going to have a bad time.

Evidence:

⁠

Test cases

Test case: read seq 4k - results ordered by MIB/s descending

⁠
⁠
⁠

Baseline: the fastest baseline read ~24 GiB IO at ~139 MiB/s and ~35K IOPS are as expected for the disk.

Noteworthy is that directio with low iodepth performed poorly with the workload e.g. row 24/25 vs 29.

There were no timing issues.

Buffered io performance was nearly identical for all iodepths. There were no timing issues.

🥇 Outright winner: row 2: zfs-fs-read-4k default recordsize, directio, iodepth=1.

Read ~159 GiB at ~906 MIiB/s and ~232K IOPS.

zvol: first I’m going to compare row 8 zfs-zvol-read-4k vs. row 7 zfs-fs-read-4k-def both tests ran async directio iodepth=24

This is the fastest zvol test in this test type, and it cost 7.9 times more resources than its competitor, that is a 690% increase in resources for ~9% less performance. The 1m uptime at the end of the test was 9.41 vs. 1.19. This is a horrible negative cost/penality for zvol.

All the other zvol tests fall off a performance cliff.

Next I would draw attention to column BI read bandwidth standard deviation which is ~50 times greater than the competitor zfs-fs test, that is ~5000% performance deviation increase, which is another negative for zvol.

iodepth: greater depth doesn’t always mean more performance, row 2, 6, 7 show this, iodepth=1 performed better than iodepth 4 and 24 and 4 performed better than 24. At least this was true for this read seq 4k batch.

Test case summary

Setting recordsize to 4k for the zfs-fs-read-4k-4k was circa 50% slower vs. zfs-fs-read-4k-def with default recordsize.

zvol was much more expensive in terms of system resources and always slower.

The fastest zfs-fs-read-4k-def vs. the fastest baseline-read-4k-raw-disk was 6.5 times faster than the baseline, which is ~550% increase in performance.

💡 Recommendation: for this disk type and 4k blocksize workload don’t use zvol and use zfs-fs with default recordsize.

Test case: read seq 1M - results ordered by MIB/s descending

ℹ Unfortunately this test case is sparse because of skipped tests due to zvol instabilities. If there are improvements to zvol in OpenZFS in the future this test case is a strong candidate for a retest to zvol results can be included.

⁠

Baseline: the fastest baseline read ~24 GiB IO at ~139 MiB/s and ~139 IOPS are as expected for the disk.

Baseline buffered vs. direct io performance was nearly identical for all iodepths.

There were no timing issues.

Buffered load avg was higher by ~11 times with an increase of ~1000%.

🥇 Outright winner: row 2: zfs-fs 1M recordsize, buffered, iodepth=1.

Read ~619.4 GiB at ~3523 MiB/s and ~3523 IOPS.

Direct vs. buffered io: for this 1M blocksize workload the tests show buffered io is more performant and has the same sort of 1m load avg.

zfs-fs-read-1M-1M iodepth=1 buffered (row 2) vs. directio (row 7), buffered wrote 2.2 times more GiB’s with a 118% increase in performance 1616 MiB/s directio vs. 3523 MiB/s buffered. Directio with iodepth=4 (row 3) closed to gap 3150 MiB/s vs. 3523 MiB/s with the increase gap dropping to ~12%.

Test case summary

The winning test (row 2) vs the fastest baseline test (row 8) was 25.3 times faster than baseline with a ~2431% increase in performance. This is an incredible performance increase by zfs for this disk type. This is likely due to compression of the 1M blocksize workload and ARC cache optimisations.

!!! The zvol tests are missing because buffered zvol tests were unstable and caused the batches to crash the kernel.

ℹ this test category is ripe for re-testing if/when zvol is more stable.

ℹ It doesn’t make sense to test 4k blocksize workload on 1M recordsize, so those test are skipped.

💡 Recommendation: for this disk type and 1M blocksize workload don’t use zvol and use zfs-fs with 1M recordsize.

💡 Recommendation: If using directio then increasing iodepth/jobs to ~4 could yield up to double throughput gains.

Test case: read random 4k - results ordered by MIB/s descending

⁠

Baseline: the fastest baseline read ~0.2 GiB IO at ~0.9 MiB/s and ~240 IOPS which is as expected for the disk.

Baseline directio tests performed better than buffered.

Increasing iodepth from 1 to 24 performed ~2.4 times faster with a ~137% increase in performance, but note that the performance order of magnitude is small/slow for the baseline tests.

There were no timing issues.

🥇 Outright winner: row 2: zfs-zvol-randread-4k, direct, iodepth=24.

Read ~161.9 GiB at ~921.1 MiB/s and ~235.8K IOPS.

Direct vs. buffered io: for the fastest zvol tests in this the 4k blocksize randread category, the tests show direct io is more performant and has a 1m load avg that scales with performance.

zfs-zvol-randread-4k iodepth=24 directio (row 2) vs. buffered (row 6), directio performed 5.2 times more IOPS with a 418% increase in performance 45.5k IOPS buffered vs. 235.8K IOPS direct.

Test case summary

This is the first test category where zvol out performs zfs-fs, and by an impressive margin. In fact the slowest zvol test beats the fastest zfs-fs test.

zfs-zvol-randread-4k, directio, iodepth=24 (row 2) vs. zfs-fs-randread-4k-def (row 12), the zvol test is 8.3 times faster with a 736.6% increase in performance.

The winning test in this category (row 2) was ~983 times faster with a ~98248% increase in performance than the fastest baseline test (row 19).

The winning test for zfs-fs (row 7) was ~127 times faster with a ~12126% increase in performance than the fastest baseline test (row 19).

ℹ This test category demonstrates zfs in general is able to make some very impressive performance gains for random read workloads compared to the baseline disk performance, and zvol is able to make ludicrous performance gains.

💡 Recommendation: test your random workload with zvol to see if its beats zfs-fs.

⁠

⁠

Edit - Retest

When writing up these results I wanted to know more about why row 2 0389-job.sh was so ridiculously fast. So I ran 0389-job.sh without any warm-up or such, and got terrible performance compared to row 2, similar to the baseline. I was a bit puzzled but knowing the extreme high performance has to be related to RAM/ARC and optimisations, I did a bit more checking.

First I zeroed the /dev/zvol/fiotestpool/testenc/zvol4k with /dev/zero:

Then I ran 038{6..9}-job.sh and as you would expect reading zeros is very fast. IOPS: 134.5k, 11.7k, 35k, 138.5k respectively.

There were no reads from the physical device in these tests, so everything happening in the ARC.

Then I ran the previous write test to that zvol followed by the read tests:

The performance was terrible. My conclusion is that the data written by fio in the original test and cached by ARC was able to be optimised, however its hard to reproduce this result with random data and random reads, which seems logical.

Retest summary: Reading performance beyond baseline is greatly impacted by the data being read. Reading zeros was easily optimisable and fast, reading random data not easily optimisable and slow.

Test case: read random 1M - results ordered by MIB/s descending

⁠

Baseline: the fastest baseline read ~13.1 GiB IO at ~74.7 MiB/s and ~75 IOPS which is expected for the disk.

Baseline directio tests performed better than buffered.

Increasing iodepth from 1 to 4 performed ~1.3 times faster with a ~26% increase in performance iodepth=24 performed marginally better than iodepth=1 but not better than iodepth=24.

There were no timing issues.

🥇 Outright winner: row 2: zfs-fs-randread-1M-1M, directio, iodepth=1.

Read ~635.5 GiB at ~3515.3 MiB/s and ~3615 IOPS.

Direct vs. buffered io: directio performed marginally better than buffered with this workload.

iodepth: for this test case workload zfs performed better when iodepth=1, performance dropped off with iodepth=4 and 24.

Test case summary

The winning test in this category (row 2) was ~48 times faster with a ~4751% increase in performance than the fastest baseline test (row 8).

ℹ This test category demonstrates zfs in general is able to make some very impressive performance gains for random read workloads compared to the baseline disk performance.

Test case: write seq 4k - results ordered by MIB/s descending

⁠

Screenshot of the top 30 results,

see the raw data for more results.⁠

There are a total of 672 results in this test case, so I will try and breakdown the most relevant sub cases.

Baseline async: the fastest test wrote ~23.8 GiB IO at ~135.5 MiB/s at ~34.7K IOPS which is expected for the disk.

Directio tests performed better than buffered.

Increasing iodepth from 1 to 4 performed ~1.5 times faster with a ~51% increase in performance. iodepth=24 performed marginally better than iodepth=4.

The fastest buffered tests took ~65 seconds longer to complete than scheduled.

Baseline sync: the fastest test wrote ~0.3 GiB IO at 1.6 MiB/s at 413 IOPS which is expected for the disk.

Directio tests performed much better than buffered.

Tests performed better with higher iodepth.

Sync tests outperformed fsync and fdatasync tests.

⁠

🥇 Outright winner: row 2: zfs-fs-write-4k-def-sync-disabled, buffered, iodepth=1.

Wrote ~88 GiB at ~502 MiB/s and ~128K IOPS.

🥉 3rd place (row 4) with a 3% decrease in performance vs. 1st place was zfs-fs-write-4k-def-sync-disabled directio, iodepth=1.

5th place (row 6) was with a 4.3% decrease in performance vs. 1st place was zfs-slog-fs-write-4k-def-sync-standard-logbias-throughput buffered, iodepth=1.

Direct vs. buffered io: at the top of the leader board buffered performed marginally better than directio with this workload.

iodepth: at the top of the leader board zfs performed better with iodepth=1, performance dropped off with iodepth=4 and 24.

zvol: overall results are terrible for zvol. The fastest zvol test was in 45th position (row 46) zfs-zvol-write-4k-sync-disabled with buffered, iodepth=1, was a ~84% decrease in performance vs. 1st place. This test also had a timing issue of ~30 seconds.

The first zvol test without timing issues was position 154th (row 155) and suffered a 97.7% performance decrease at 2 MiB/s and suffered a horrible load1 average of 22.2.

zfs-fs 4k-def vs. 4k-4k: comparing the fastest comparable zfs-fs tests, 4k-def is the clear winner. Leaving the recordsize default is 6.4 times faster (~544% ⬆) than lowering to 4k which incurs an ~84% ⬇ performance penalty.

⁠

sync vs. async io: if sync=disabled tests are excluded, and we compare async vs sync tests:

top 20 async results:

⁠

At the top of the leader board slog tests are marginally faster, no major observations detected.

top 20 sync results:

⁠

Interestingly zvol beats zfs-fs in sync io on raw numbers.

zvol fastest sync test wrote 2.0 GiB at 11.4 MiB/s at 2922 IOPS (row 155), iodepth=24, load1 avg 22.2.

zfs-fs fastest sync test wrote 0.3 GiB at 1.7 MiB/s at 427 IOPS (row 177), an 85.3% decrease, iodepth=4, load1 avg 1.16.

baseline fastest sync test wrote 0.3 GiB at 1.6 MiB/s at 413 IOPS (row 177), an 85.8% decrease, iodepth=24, load1 avg 0.13.

ℹ ⚠⚠⚠ zvol fastest test performed ~6.8 times faster (~584%) than zfs-fs fastest but created ~19 times (~1814%) more load on the system.

When comparing the performance traits of sync vs. fsync vs. fdatasync there seems to be a correlation that sync tests cause significantly higher load1 avg on zvol tests. zfs-fs does not appear to be impacted by this. Check this view of iodepth=24, directio, sync tests, and look at the correlation between columns Z and AT.

It looks like sync provides more IOPS for zvol tests but at the expense of system resources.

⁠

For zfs-fs tests, the performance traits of sync vs. fsync vs. fdatasync seem to be relatively indifferent:

⁠

sync zvol slog vs. zvol:

⁠

Here we can see that slog helps sync performance. 15.8 times faster, a 1488% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.

sync zfs-fs slog vs zfs-fs:

⁠

Here we can see that slog helps sync performance. ~20 times faster, a 1933% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.

One would expect that if using a modern fast slog device like optane, that the slog test results would be orders of magnitude faster. This was tested and observed by napp-it,

check the "napp-it observations about ARC and Optane slogs"⁠

of the write up.

Test case summary

The winning test in this category (row 2) was ~3.7 times faster with a ~271% increase in performance than the fastest baseline test (row 31).

ℹ zfs-fs performs much better than zvol with async workloads.

ℹ zvol outperforms zfs-fs for sync workloads but at significant system load/resource costs.

💡 Recommendation: async workloads: do not use zvol.

💡 Recommendation: sync workloads: consider very carefully if zvol system load overhead is worth the extra performance.

💡 Recommendation: use slog to significantly boost sync workload performance.

💡 Recommendation: leave zfs-fs recordsize default. Lowering it from the default makes performance worse for this 4k test case.

Test case: write seq 1M - results ordered by MIB/s descending

⁠

Screenshot of the top 30 results,

see the raw data for more results.⁠

There are a total of 372 results in this test case, so I will try and breakdown the most relevant sub cases.

Baseline async: the fastest test (row 2) wrote ~24.2 GiB IO at ~137.5 MiB/s at 137 IOPS which is expected for the disk.

Directio tests performed marginally better MiB/s than buffered.

Buffered tests suffered from a high IOPS standard deviation, and from high write latencies, mean latency was a 41% increase and peak was a 12,541% increase. This was comparing row 2 vs. row 40.

Increasing iodepth did not increase performance but did increase write latency.

The fastest buffered tests took ~62 seconds longer to complete than scheduled.

Baseline sync: the fastest test wrote ~10.4 GiB IO at 59.1 MiB/s at 59 IOPS which is expected for the disk.

Directio tests performed much better than buffered.

Tests performed better with higher iodepth.

Sync tests outperformed fsync and fdatasync tests.

🥇 Outright winner: row 2: baseline-write-1M-raw-disk, directio, iodepth=24.

Wrote ~24.2 GiB IO at ~137.5 MiB/s at 137 IOPS.

4th place (row 5) with a ~12% decrease in performance vs. 1st place was zfs-slog-fs-write-1M-1M-sync-standard-logbias-throughput directio, iodepth=24.

8th place (row 9) was with a 16.5% decrease in performance vs. 1st place was zfs-fs-write-1M-1M-sync-disabled directio, iodepth=24.

Direct vs. buffered io: no clear winner. Different test settings had different results, some were marginally better buffered and vice versa.

iodepth: overall it looks like higher iodepth provided marginal performance gains.

zvol: overall results are terrible for zvol. The fastest zvol test was in 37th position (row 38) zfs-slog-zvol-write-1M-sync-standard-logbias-throughput with directio, iodepth=4 was a ~31% decrease in performance vs. 1st place. This test also had a timing issue of ~114 seconds.

The first zvol test without timing issues was position 117th (row 118) and suffered a 56.6% performance decrease at 59.9 MiB/s and suffered a bad load1 average of 11.36.

sync vs. async io: if sync=disabled tests are excluded, and we compare async vs sync tests:

top 20 async results:

⁠

ℹ ⚠⚠⚠ interestingly this is the first test case where zfs is not able outperform the async baseline.

At the top of the leader board the fastest slog test is marginally faster that the non-slog comparative test.

top 20 sync results:

⁠

ℹ Interestingly zvol and zfs-fs are tied for 1st and 2nd place.

The top 20 tests are all very close with only 4 IOPS between the 1st and 20th place.

ℹ ⚠⚠⚠ all of the zvol in the top 20 suffer timing issues (column N and S).

zfs-fs fastest sync test wrote 14.9 GiB at 84.9 MiB/s at 85 IOPS (row 57), iodepth=1, load1 avg 9.25.

zvol fastest sync test wrote 14.9 GiB at 84.8 MiB/s at 85 IOPS (row 58), iodepth=24, load1 avg 4.84, noteworthy are the higher write latencies.

baseline fastest sync test wrote 10.4 GiB at 59.1 MiB/s at 59 IOPS (row 122), a ~31% ⬇, iodepth=24, load1 avg 0.13.

ℹ ⚠⚠⚠ The fastest zvol and zfs-fs sync tests were tied but zvol created ~1.9 times (~91%) more load on the system for the same performance.

When comparing the performance traits of sync vs. fsync vs. fdatasync there seems to be a correlation that sync tests cause significantly higher load1 avg on zvol tests. zfs-fs does not appear to be impacted by this. Check this top 20 view of iodepth=24, directio, sync tests, and look at the correlation between columns Z and AT:

⁠

Have a look at zfs-slog-zvol-write-1M-sync-standard-logbias-latency and zfs-slog-fs-write-1M-1M-sync-standard-logbias-latency tests by example:

⁠

For the zfs-fs tests it seem fdatasync provides best performance 1.09 times faster than fsync (8.6% ⬇) and 1.4 times faster than sync (29% ⬇).

For the zvol tests it seem fsync provides best performance 1.58 times faster than fdatasync (37% ⬇) and 2.95 times faster than sync (66% ⬇). Note the huge load1 avg for the sync test.

My tests show that different test settings and devices have different sync characteristics, its not easy to pick a clear winner between sync, fsync and fdatasync for this test case.

sync zvol slog vs. zvol:

⁠

Here we can see that slog helps sync performance. 8.76 times faster, a 776% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.

I spotted one test config where slog performed worse than non-slog:

⁠

Studying the bottom of the leader-board, it was dominated by slog-zvol tests with very high load1 averages. At pattern there is buffered tests.

⁠

sync zfs-fs slog vs zfs-fs:

⁠

Here we can see that slog helps sync performance. ~9.3 times faster, a 831% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.

One would expect that if using a modern fast slog device like optane, that the slog test results would be orders of magnitude faster. This was tested and observed by napp-it,

check the "napp-it observations about ARC and Optane slogs"⁠

of the write up.

Test case summary

Noteworthy is that the baseline tests beat zfs in this test case. The winning test in this category (row 2) was ~1.13 times faster with a ~13% increase in performance than the fastest zfs test (row 5).

ℹ it might be interesting to run a few 1M-4k and 1M-def tests, just to see if the baseline is still the fastest.

ℹ zfs-fs performs much better than zvol with async workloads.

ℹ zvol ties zfs-fs at the top of the sync leader-board.

💡 Recommendation: for async workloads: do not use zvol.

💡 Recommendation: for sync workloads: zfs-fs and zvol were tied 1st place in their fastest test configs. zfs-fs produced more system load than zvol but only the fastest config, other tests recorded nominal load1 avg.

💡 Recommendation: use slog to significantly boost sync workload performance.

Test case: write random 4k - results ordered by MIB/s descending

⁠

Screenshot of the top 30 results,

see the raw data for more results.⁠

There are a total of 672 results in this test case, so I will try and breakdown the most relevant sub cases.

Baseline async: the fastest test (row 97) wrote ~3.5 GiB IO at ~20.2 MiB/s at 5161 IOPS with buffered IO.

BUT this is a false result because the test took 1547 seconds aka ~26 minutes longer to complete than fio recorded. So 20 MiB/s becomes 2.1 MiB/s.

This means we can ignore row 97, 99 and 106 as invalid, and the fastest test becomes row 122 which wrote ~0.6 GiB IO at ~3.4 MiB/s at 865 IOPS with directio, iodepth=24.

Directio tests performed faster MiB/s than buffered, and without timing issues.

Increasing iodepth increased performance and also mean write latency.

⁠

Baseline sync: the fastest test wrote ~0.3 GiB IO at 1.9 MiB/s at 475 IOPS which is expected for the disk.

Directio tests performed much better than buffered.

Tests performed better with higher iodepth.

Sync tests outperformed fsync and fdatasync tests.

⁠

🥇 Outright winner: row 2: zfs-fs-randwrite-4k-4k-sync-disabled, buffered, iodepth=24, fsync.

Wrote ~13.5 GiB IO at ~76.6 MiB/s at 19.6K IOPS.

🥈 2nd place (row 3) with a 4.4% decrease in performance vs. 1st place was zfs-fs-randwrite-4k-4k-sync-standard-logbias-latency directio, iodepth=4, async.

🥉 3rd place (row 4) was with a 5.1% decrease in performance vs. 1st place was zfs-fs-randwrite-4k-4k-sync-standard-logbias-latency directio, iodepth=1, async.

Direct vs. buffered io: no clear winner. Different test settings had different results, some were marginally better buffered and vice versa.

iodepth: overall it looks like higher iodepth provided marginal performance gains.

zvol: overall results are not good for zvol. The fastest zvol test was in 11th position (row 12) zfs-zvol-randwrite-4k-sync-standard-logbias-latency with directio, iodepth=1 was a 14% ⬇ in performance vs. 1st place. This test also had a timing issue of ~48 seconds.

This timing issue means the performance is at least ~32% ⬇ in performance which moves the position in the leader-board from 11th to ~70th place.

The first zvol test without timing issues was position 170th (row 171) and suffered a ~97% performance ⬇ at 2 MiB/s.

zfs-fs 4k-def vs. 4k-4k: comparing the fastest comparable zfs-fs tests, The 4k-4k recordsize significantly outperforms the 4k-def recordsize. This is the opposite outcome vs. the 4k sequential write test case.

The fastest 4k-4k test is overall 2nd place and the fastest 4k-def is overall 119th place with a ~95% ⬇ performance drop.

That means the fastest 4k-4k test is 33 times faster than 4k-def which is a huge 2,050% increase in performance.

⁠

sync vs. async io: if sync=disabled tests are excluded, and we compare async vs sync tests:

top 20 async results:

⁠

At the top of the leader-board the fastest non-slog test is marginally faster that the slog comparative test. This is the opposite of the 4k sequential write test case. The is probably due to the legacy SSD not be well suited as an slog device.

top 20 sync results:

⁠

ℹ Interestingly zvol beats the baseline and zfs-fs, and baseline beats zfs-fs.

ℹ ⚠⚠⚠ all of the zvol in the top 20 suffer timing issues (column N and S), for this test case the timing issues are minor.

🥇 zvol fastest sync test wrote 2.1 GiB at 12.2 MiB/s at 3112 IOPS (row 103), iodepth=24, directio, sync, load1 avg 21.46, 9 seconds over test schedule.

🥈 baseline fastest sync test wrote 0.3 GiB at 1.9 MiB/s at 475 IOPS (row 172), 84.7% ⬇, iodepth=24, directio, sync, load1 avg 0.1.

🥉 zfs-fs fastest sync test wrote 0.3 GiB at 1.6 MiB/s at 405 IOPS (row 179), 87% ⬇, iodepth=1, directio, fsync, load1 avg 1.8.

ℹ ⚠⚠⚠ The fastest zvol sync test was 6.6 times faster than zfs-fs but zvol created ~214 times (~21,360%) ⬆ more load on the system.

When comparing the performance traits of sync vs. fsync vs. fdatasync there seems to be a correlation that sync tests cause significantly higher load1 avg on zvol tests. zfs-fs does not appear to be impacted by this. Check this top 35 view of iodepth=24, directio, sync tests, and look at the correlation between columns Z and AT:

⁠

🥇 zvol row 103 zvol sync test wins the leader-board on pure performance stats but at huge system load1 avg costs.

zvol row row 168 is 5.5 times slower (82% ⬇) vs. row 103 producing 4.9 times less (79% ⬇) system load1 avg.

zvol row 168 is 1.2 times faster (21% ⬆) vs. the fastest sync baseline test (row 172).

baseline row 172 produces ~98% ⬇ in system load1 avg vs. row 168, and produces 99.5% ⬇ in system load1 avg vs. row 103.

My tests show that different test settings and devices have different sync characteristics, its not easy to pick a clear winner between sync, fsync and fdatasync for this test case. Sync IO does appear to be a strong contender for outright winner but suffers from significant system load avg issues on the fastest tests.

sync zvol slog vs. zvol:

⁠

Here we can see that slog helps sync performance. ~23 times faster, a ~2,240% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.

sync zfs-fs slog vs zfs-fs:

⁠

Here we can see that slog helps sync performance. ~21 times faster, a ~2,032% increase in performance. The performance orders of magnitude are small but nonetheless its an noteworthy difference.

One would expect that if using a modern fast slog device like optane, that the slog test results would be orders of magnitude faster. This was tested and observed by napp-it,

check the "napp-it observations about ARC and Optane slogs"⁠

of the write up.

Test case summary

ERROR: were the zvol sync=disabled tests missed?

ℹ zfs-fs performs much better than zvol for async workloads.

ℹ Overall zvol wins for sync workloads (massive load penalty) and baseline performs marginally better than zfs-fs.

💡 Recommendation: for zfs-fs use 4k recordsize which performed 33 times faster (2,050% ⬆) than default recordsize.

💡 Recommendation: for async workloads: do not use zvol, the fastest zvol test was ~32% ⬇ slower than zfs-fs.

ℹ For sync workloads: zvol was fastest but incurred ~214 times (~21,360%) ⬆ more load on the system.

💡 Recommendation: For 4k random sync workloads on my hardware it is hard to recommend a zfs setup. Do your own testing with your hardware.

💡 Recommendation: use slog to significantly boost sync workload performance.

Test case: write random 1M - results ordered by MIB/s descending

ERROR?: were the zvol sync=disabled tests missed or skipped?

TODO: add results

Overall conclusions

The synthetic test cases demonstrate what is possible with the fio workload. Real world workloads results may vary.

One can see from the synthetic test case results that a one size fits all configuration isn’t really possible - one must understand the aspects of their IO workload, run tests with various settings and tune accordingly. Some general OpenZFS recommendations as follows:

💡 Recommendation: use slog to significantly boost sync IO workload performance.

In 2022-Q3 I re-tested with an optane NVMe⁠

and re-confirmed this point. A fast slog gives very significant sync workload gains. This was also confirmed independently by napp-it

as documented here⁠

💡 Recommendation: avoid using zvol - in nearly all test cases it was slower, less stable and cost a lot more resources for the same workload. If you do use zvol then benchmark it vs. alternatives to check what works best and what is the zvol “cost”.

💡 Recommendation: do not combine zvol + buffered async write IO, you’re going to have a bad time -

see here⁠

💡 Recommendation: Take time to understand sync vs. async IO and the risk and rewards of both as

described in detail here⁠

. Sync IO aims to guarantee write IO which is typically much slower but very reliable, where as async IO makes no guarantee and submits the IO as fast as possible to the IO subsystem, which is much faster but less reliable in certain scenarios like power cuts.

I wonder if the recommendations in the test case summaries will still be valid for SSD and NVMe devices?

ARC performance/optimisation is subjective to the workload and cache warmth. fio uses random bytes for tests, my observations are that sometimes the workload is optimisable but in the majority of cases its not. This creates inconsistency in read test results. A retest with fixed/static/consistent set of bytes/files might be interesting and increase the consistency of the results. ❓ i.e. fio writes some random files and then we always read those for read tests?

It would be interesting to repeat certain tests with different primarycache settings to measure the impact of different ARC settings.

TODO sort by latency and see what patterns emerge

TODO filter main results by script id for the fastest tests from read and write test case

ZIL and slog and related observations of synchronous IO

Some of my early online research suggested adding a dedicated slog vdev into the zpool. On the surface that suggestion seems to make a lot of sense. As my testing expanded and knowledge grew I realised the ZIL and slog is important for synchronous write IO, and not important when synchronous IO is disabled or the write workload is asynchronous.

I understood early on that adding an slog should mitigate issues where write IO to the ZIL and normal data writes are competing for the same vdevs, that is to say when the zfs is trying to concurrently write data and log IO and competing for the same physical vdev(s), because by default the intent log exists on the same vdevs as the data. What I didn’t understand at the start was under what conditions ZFS utilises its intent log.

Lets demystify the terms ZIL and slog.

ZIL stands for

ZFS Intent Log⁠

and is a mechanism that “satisfies POSIX requirements for synchronous transactions.”. But what does even mean? The OpenZFS project could improve that section of their concepts docs.

ixsystems wrote an article “

The ZFS ZIL and SLOG Demystified⁠

” which was reviewed by Matthew Ahrens of the OpenZFS project, so I’m going to cite that here because its rather excellent in putting the ZIL and slog into lay terms:

When synchronous writes are requested, the ZIL is the short-term place where the data lands prior to being formally spread across the pool for long-term storage at the configured level of redundancy.

There are however two special cases when the ZIL is not used despite being requested: If large blocks are used or the logbias=throughput zfs property is set.

By default, the short-term ZIL storage exists on the same vdevs as the long-term pool storage at the expense of all data being written to disk twice: once to the short-term ZIL and again across the long-term pool.

This gives us strong understanding of the ZIL but what about slog then?

Because each disk can only perform one operation at a time, the performance penalty of this duplicated effort can be alleviated by sending the ZIL writes to a Separate ZFS Intent Log or “SLOG”, or simply “log”.

ZIL and slog only plays a major role if IO is synchronous

For my batch testing I repurposed an older CRUCIAL MX300 SSD to answer “is slog a factor with my workload?”. I know this drive considered a legacy SSD and suboptimal for the slog workload but I hope for my tests cases it enough to eliminate the concurrent data and log writes issue being a major factor.

There is a zfs property that modifies zfs data synchronous behaviour called sync which can be one of either standard the default, always, or disabled. Each has a different performance characteristic as one would expect. I included the these modes in my batch testing for write tests.

A second zfs property that modifies zfs slog behaviour is logbias which can be either latency the default or throughput. I included the these modes in my batch testing for write tests. This is mentioned in the aftermentioned cited ixsystems article.

The behaviour I observed when testing is that zfs-fs datasets have what I’d call relaxed sync characteristics, and the majority of writes are not synced by default. In fact this is defined in the ixsystems article as “Asynchronous unless requested otherwise” write behaviour.

I observed that zfs-zvol datasets even when using the default sync=standard behave differently to zfs-fs datasets, probably related to being presented as a block device.

Co-play of IO between the the data vdev and log vdev

After adding the slog I monitored the co-play of IO between the the data vdev and log vdev and everything seems to be working as expected. Unfortunately the dedicated log vdev didn’t seem to make a meaningful difference for my setup and workload, and I was still observing what felt like really significant “off a cliff” performance drop off's and worryingly high hypervisor load1 avg during the workload.

This “off a cliff edge” wasn’t a good start for my future storage strategy, so I set out to turn the feeling into facts and benchmark various setups. zfs native filesystem vs. zvol vs. raw disk images stored on the zfs native filesystem, and similar tests from within a kvm with some surprising results.

I understood this much better later on in testing, that cp and other tests that I performed are not requesting O_SYNC IO by default i.e. asynchronous, so the ZIL actually had little to no work to perform in those workloads.

procuring a faster slog vdev and testing in the future

From my research one thing that seems pretty clear, if you workload is synchronous, then the faster your slog vdev can write, the faster you can write to your data vdevs.

I do plan to get more appropriate device per research here:

⁠

https://openzfs.github.io/openzfs-docs/Performance and Tuning/Hardware.html⁠

⁠

https://www.servethehome.com/exploring-best-zfs-zil-slog-ssd-intel-optane-nand/⁠

⁠

https://www.napp-it.org/doc/downloads/optane_slog_pool_performane.pdf⁠

⁠

mention the power loss issues surrounding the 900p, maybe put this section in a dedicated page

The Intel SSD Optane 900p 280GB PCIe looks like a great option for home labs and private data vaults with UPS protection.

napp-it observations about ARC and Optane based slogs

Kudos to napp-it for publishing their a detailed and comprehensive research on optane slog zpool performance. napp-it noted in benchmarking that ARC cache can make spinning rust perform like fast NVMe SSD’s.

cite

napp-it⁠

1. The most important factor is RAM.

Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as fast as an ultimate Optane pool.

and

2. Even a pure HD pool can be nearly as fast as a NVMe pool.

In my tests I used a pool from 4 x HGST HE8 disks with a combined raw sequential read/write performance of more than 1000 MB/s. As long as you can process your workload mainly from RAM, it is tremendously fast. The huge fallback when using sync-write can be nearly eliminated by a fast Optane Slog like the 900P. Such a combination can be nearly as fast as a pure SSD pool at a fraction of the cost with higher capacity. Even an SMB filer with a secure write behaviour (sync-write=always) is now possible as a 4 x HGST HE8 pool (Raid-0) and an Optane 900P Slog offered around 500-700 MB/s (needed for 10G networks) on OmniOS. Solaris with native ZFS was even faster.

Open questions / Open research topics

netgraph data and/or stats for sync=disabled

Retesting with an SSD?

Why are there read_short_ios for fdatasync tests?

Why does fsync tests have stats for sync_total_ios but other sync types do not?

Does the napp-it appliance use Oracle ZFS or OpenZFS? Does that use an older version of zfs that doesn’t suffer the same zvol fate?

When does L2ARC make sense?

What determines when a program utilises io depth (concurrent io issued to the io subsystem)?

what determines if cp, mv or rsync or dd or something else utilise iodepth? Is it program code ? does the kernel change the io depth?

kernel crash/reboot issue

netdata alerts during zvol tests

These alerts show how certain zvol tests were making the system unhappy/slow.

⁠

VirtioFS support

⁠

https://forum.proxmox.com/threads/virtiofs-support.77889/⁠

appears to use virtiofsd and args: -chardev in the vm .conf

LXC with bind mounts

⁠

https://forums.servethehome.com/index.php?threads/napp-it-for-proxmox.32368/#post-299369⁠

⁠

No harder than everything else up to this point. I imagine there is a way to do it directly on the host but I decided to move my shares to an LXC container to keep things separated.

Setup a container with your preferred flavor of Linux and install the packages for an NFS server. There is also a template for a Turnkey Fileserver in Proxmox but I have not tried it.

You will have to go into the shell and add your storage to the container manually. Look into bind mounts, but mostly you are adding a line "mp0: /mypool/storage,mp=/storage" to the end of the config file for the container. You might need to do multiple mounts if you created different sub volumes in the zfs pool. When you go to setup your shares the path will be something like /mnt/storage

⁠

https://forum.proxmox.com/threads/single-disk-proxmox-setup-how-to-encrypt-and-share-a-larger-chunk-of-the-disk-with-multiple-vms-and-other-devices-on-the-network.84674/#post-371953⁠

⁠

For containers, you can do a
bind mount⁠
to access certain directories only, though keep in mind that you might have to fiddle with the user id mapping to get useable permissions then. This does not work for VMs, and while there is virtio-fs floating around for a while now, it's not natively supported by PVE - so your best bet is probably NFS or some other network protocol as you suspected.

ZFS + NFS or SMB ?

Probably fast enough for bulk IO? zfs share

https://openzfs.github.io/openzfs-docs/man/8/zfs-share.8.html⁠

can share via nfs or smb.

Update 2022-03-05: I’ve tested zfs shared over nfs host + kvm guest and everything seemed to work nominally for rsync workloads. I didn’t do any benchmarking.

research again zol performance related to comments from
`janetcampbell`⁠
/
`taratarabobara`⁠
⁠

⁠

related redit post⁠

related github issue⁠

my tests in fio-results-v2.tsv only used numjobs=1

should consider trying multiple numjob counts as a comparison to iodepth (which is limited to a single process/thread).

my tests in fio-results-v2.tsv only used primarycache=all

should consider testing the difference between all|metadata|none?

OpenZFS #12483 Better performance with O_DIRECT on zvols

Looks like some progress has been made on the zvol performance issues

committed to OpenZFS:master in June 2022⁠

Need to cherry-pick some zvol tests and see how the compare to previous tests, especially the system load.

I have not yet been able to determine which if any zfs release has this code included, it may be that it can only be tested right now by checking out master and compiling OpenZFS manually?

Future re-test topics

TL;DR

Some high level stats

Preface

Goals

I asked myself - does it even matter?

Story time

The year is 2017