Home lab & data vault
Share
Explore

icon picker
ZFS Concepts and Considerations

Author: Last updated: 2023-12-30
The goal of this content is to give the reader insights into ZFS and OpenZFS, and how ZFS works at a high to medium level. It raises for planning and using OpenZFS.
If you are looking to answer “How do I do x with ZFS?” or are looking for a specific command check out my .

ZFS intro

ZFS is a software-implemented data storage system that runs alongside a computer's kernel. It provides a highly integral filesystem with advanced features such as snapshots, replication, encryption, data redundancy, and can self-heal data corruption. ZFS can be highly performant, offering various read caching strategies and the ability to dedicate low-latency storage to synchronous write workloads, which can significantly improve write performance on slower pools.
As the name suggests, is an open community project implementation of ZFS hosted on GitHub with over 500 contributors. Many operating system distributions include support for OpenZFS. OpenZFS origins stem from the porting of Sun Microsystems ZFS, which was distributed with OpenSolaris (under the CDDL licence) before Sun was acquired by Oracle Corp. .
Unlike most other storage systems, ZFS unifies both physical volume management and logical block devices, acting as both a volume manager and a filesystem. As a result, ZFS has complete knowledge of both the physical disks and volumes (including their status, health, and logical arrangement in volumes) and all the files stored on them. ZFS is designed to ensure (with suitable hardware) that data stored on disks cannot be lost due to physical errors, hardware or operating system mis-processing, or bit-rot events and data corruption that can occur over time. ZFS's complete control of the storage system ensures that every step, whether related to file management or disk management, is verified, confirmed, corrected if necessary, and optimised in a way that storage controller cards and separate volume and file managers cannot achieve.
Headline features are documented here: and .

Core concepts

CoW - Copy on Write

To cite reddit user description of ZFS on a post
:
ZFS is a Copy-on-Write (COW) filesystem. This basically means that the data written to the pool is immutable. When time comes to add or change the data, blocks that are marked as committed won't be modified in any way but copied to a new block with the additions and/or changes needed and then the new sector committed to the pool.
This means there's cost (ZFS overhead) when writing to the pool, it might take multiple disk operations before the data is committed to the pool. But this gives us the transactional behaviour the data block level and not just metadata as journaling filesystems do. Other advanced features such as snapshots make use of this design very efficiently.

Objects and object sets

In ZFS, objects are grouped together in object sets. A dataset is an “object set” object, files and directories are objects grouped into a dataset.
To cite robn, an OpenZFS contributor, posting about “OpenZFS objects and object sets”
.
Objects are OpenZFS’ conception of an arbitrarily-large chunk of data. Objects are what the “upper” levels of OpenZFS deal with, and use them to assemble more complicated constructs, like filesystems, volumes, snapshots, property lists, and so on.
... objects can be grouped together into a named collection; that collection is called an “object set”, and is itself an object. An object set forms a unique namespace for its object ... So, the full “path” of an object is the “object set id”, then the “object id”.
A typical OpenZFS pool has lots of object sets. Some of these are visible to the user, and are called “datasets”. You might know these better as “filesystems”, “snapshots” and “volumes”. There’s nothing special about those, they’re just collections of objects assembled in such a way as to provide useful and familiar facilities to the user.

Blocks and block pointers

In ZFS, objects are sequences of blocks. A ZFS plain file object records information about its sequence of data blocks. Typically, when writing a file, ZFS will write n blocks of max recordsize to fill the space required for a given file. A file object therefore stores a list of block pointers aka DVA’s (Data Virtual Addresses). A DVA contains the coordinates needed to access a block, vdev, address, size. So, a block coordinate in ZFS is object set id → object id → DVA.
With the CoW semantic, when an ZFS plain file object is being modified, only the blocks to be modified are copied and then modified. Unmodified blocks remain unchanged. The object records the changes. When the blocks that were copied to make the modification are no longer referred to, for example by a snapshot, they are marked for garbage collection.

Snapshots and transactions

Modifications to a ZFS filesystem are recorded in transactions. In simple terms, a dataset snapshot is an explicitly named transaction reference. Blocks will remain in the filesystem until no snapshot refers to them. Snapshots provide the ability to look back the block state at a given point in time, à la time machine. If desired, transactions can be discarded, i.e. a dataset can be rolled back to a given snapshot.

Dataset replication

ZFS can calculate the difference or delta between transaction X and Y, i.e. what objects and blocks changed between state X and Y. With this information ZFS is able to offer very efficient file system replication, only needing to send the delta of new and/or modified blocks.
If you’d like to learn more details, have a look at my ZFS internal page [
].

Resilvering

TODO - the process of rebuilding one or more vdev children and restoring a pool to a healthy state. There are number of ZFS module parameters than control resilvering [
]. With Mirror vdevs resilvering is a fairly straightforward computation, for other vdev types things can be more complex.

raidz

for raidz details, and
info on raidz, vdevs and IOPS scaling.

My high level ZFS glossary

zpool command manages ZFS storage pool configuration and their virtual devices aka vdevs. A ZFS zpool and the zpool command has similar characteristics to a physical volume manager.
zfs command manages ZFS datasets, which are children of pools. zfs defines and manages dataset hierarchy. A ZFS dataset has similar characteristics to a logical volume. So, the zfs command is similar to a logical volume manager.
Dataset types include: filesystem, snapshot, volume (zvol), bookmark.
vdev = virtual device - made up of one or more drives or partitions or files.
zpools are populated with vdevs, vdevs are populated with datasets.
zvol = volume storage i.e. virtual block device. >> (avoid using them - ) <<
ZIL stands for ZFS Intent Log, which temporarily holds synchronous writes until they are written to the zpool. ZIL is not used/written in async or sync=disabled workloads. By default the ZFS Intent Log is written to the same vdev’s as data.
SLOG is a dedicated separate log vdev aka slog which can be added to a zpool with the aim of making synchronous write workloads go as fast as possible. The ZIL is moved to the slog. The slog device should be as fast as possible at sync workloads.
(L1)ARC is ZFS’s level 1 cache. An Adaptive Replacement Cache located in RAM.
cite :
Level 1 ARC based in RAM. OpenZFS will always have an L1ARC, whereas the L2ARC is optional. An L1ARC is often referred to by users as ARC.
L2ARC is ZFS’s level 2 Adaptive Replacement Cache and should be on a fast device (like SSD or NVMe).
cite :
Level 2 ARC. A persistent and non-RAM ARC. When cached data overflows RAM and an L2ARC is present, it will be used to complement the L1ARC.
cite :
Devices can be added to a storage pool as “cache devices”. These devices provide an additional layer of caching between main memory and disk. For read-heavy workloads, where the working set size is much larger than what can be cached in main memory, using cache devices allows much more of this working set to be served from low latency media. Using cache devices provides the greatest performance improvement for random read-workloads of mostly static content.
mirror a mirror vdev consists two or more child devices, each child is an exact copy (or mirror) of its siblings; a classic RAID 1 mirrored pair contains two disks. .
spare TODO
RAIDZ citing the OpenZFS docs [
]:
RAIDZ is a variation on RAID-5 that allows for better distribution of parity and eliminates the RAID-5 “write hole” (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group.
A raidz group can have single, double, or triple parity, meaning that the raidz group can sustain one, two, or three failures, respectively, without losing any data.

ZFS PSA’s - Public Service Announcements

AVOID SMR drives with ZFS - use CMR drives where possible

See the issue I authored : ~30% performance degradation for ZFS vs. non-ZFS for large file transfer. This covers things in details. In summary, ZFS performs significantly worse on SMR drives because of the ZFS fundamental internals and the co-play with the physical storage mechanics used in SMR drives. . . of HGST Inc. about SMR and ZFS.

AVOID USING ZVOL’s

Avoid using zvol’s unless you absolutely understand . From my experience one needs to take a number of things into account and make tweaks to have zvol’s operate in a performant manor. zvol’s are very attractive for provisioning virtual block devices especially for kvm or persistent volumes on k8s 💪 but unfortunately the zvol implementation in ZoL OpenZFS has a number of issues which I’ve verified first hand (on my hardware). Further reading: Extreme performance penalty, holdups and write amplification when writing to ZVOLs.
As of 2021-09-15 AFAIK there is no roadmap to resolve these issues.
As of 2022-06 it looks like some commits went into OpenZFS:master related to Issue that require testing and a specific flag to be enabled zvol_use_blk_mq - . I doubt the commits fundamentally change the way zvols are implemented rather provide more efficient io scheduling? I could imagine this could reduce system loadavg when using zvols. Needs further testing and investigation.
As of 2023-11 there was a bug reported in zvol_use_blk_mq as documented in - so ensure your version of ZFS includes this change if you plan to try zvol_use_blk_mq.

HOWTO

gives a great overview of concepts and command lines.

Official Docs

and . These are great docs to get ones head around ZFS concepts.

Things to consider

Understanding that RAID is not a backup

💡 Do not rely on a single logical copy of your important data - there is no redundancy in that approach.
The strategy of RAID and data mirroring is higher data availability, NOT data redundancy!
RAID and ZFS alone will not protect you from data loss. In data redundancy terms: one copy of data is none, two is one and so on.
A single ZFS pool, even with multiple copies of data or parity levels, is still a single point (pool) of failure and logically a single copy of your data.
ZFS is great at providing highly available data pools AND ZFS can form part of your data redundancy and backup strategy. ZFS has many useful features that can be utilised to achieve data redundancy objectives, including tested and integrity verified backups.
I work with the principle that data is not safe until backups exist, and backups do not exist unless they have been restored and their integrity has been verified. ~kyle0r
To quote myself :
In the event of a disk/drive/device failure, real-time RAID parity like in RAIDZ provides enough data redundancy to allow a zpool/array to continue to operate and prevents data from becoming unavailable. After a drive failure a sysop must replace the failed device and the array is rebuilt (resilvered) via parity calculations of the available blocks across the remaining drives. Keep in mind that during the rebuild/resilver process, a time window exists where the data array is typically in a high risk DEGRADED state, having lost its higher data availability because one or more drives became unavailable. With each further loss of pool redundancy, the risk increases in data loss which could be the entire array/pool. Its also important to remember that the drive rebuild/resilver process is typically a very heavy workload for the drives (high stress - especially in raidz) and this can increase the risk additional drive failures during the rebuild/resilver process.
This is why its critical to maintain at least a second copy of your important data, and ideally you’d have >2 copies. You can read my in-depth notes and strategies:

Understand the difference between async and sync IO

When comes to write IO:
sync IO promises to write an IO to physical media before writing the next IO. Sync IO workload performance is therefore limited by the speed and number of concurrent write acknowledgements vdev(s) can make.
async IO makes no promises and will try and write the IO as fast as it can, eventually flushing all IO to physical media.
So sync=immediate (slower and safer) vs. async=eventual consistency (faster and less safe).
With async IO there is a risk that not all IO reaches the physical disk. Consider an rsync process finishes and the IO was issued as async to the io subsystem. The command has finished but the IO is still in-flight in the io subsystem waiting to be flushed to the physical media. If something like a power failure happened before the io was fully flushed to the physical media, then data loss will occur and rsync (and you) would be non the wiser.
Mission critical and highly sensitive applications like databases typically require (want) sync IO, it depends on the business, non-functional and criticality requirements of the data being stored.
One has to choose the right IO characteristics and trade offs based on requirements and situation.
If the workload does not require sync IO guarantees then stick with async and consider dataset property sync=disabled to maximise disk throughput. A battery backup/UPS/USV can help to mitigate the risks of async IO during power outages.

Understand the difference between buffered and direct IO.

buffered IO uses the page cache, direct bypasses page cache.
In some scenarios the page cache can make applications perform faster than the underlying physical media.
One downside of buffered IO is “2x double write” cost: 1x to write into the page cache, and 1x to write to the physical media. Bypassing the 2x cost for large bulk IO is typically more performant.
There are pros and cons to both, workload and use case dependant.
My fio benchmarking of zfs suggests that buffered IO especially for high throughput/IO workloads use more system resources and in reality perform much worse than directio. zvols were unstable with buffered io and should be avoided.
If your workload involves writing a lot of unique and bulk data, like backups, then why would you want to write that in the page cache because you probably won’t be reading it again any time soon, plus the negative of the double write cost?
If you are reading some large data file, buffered reads would push existing pages out of the cache, this might have a negative impact on running applications that were making use of the page cache. aka cache poisoning.
As a general rule, use directio for bulk write workloads. For other uses cases you have to consider all the variables and run benchmarks.
Not related to zfs but SAN’s (or storage controllers) with cache pools also have a role to play in buffered vs. direct io. Typically its best to use direct io and let the SAN perform the caching (buffering).

RAM... lots of RAM

Kudos to napp-it for publishing their . napp-it noted in benchmarking that ARC (read) cache can make spinning rust perform like fast NVMe SSD’s.
cite napp-it:
1. The most important factor is RAM.
Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as fast as an ultimate Optane pool.
and

The benefits of a fast slog for sync workloads

2. Even a pure HD pool can be nearly as fast as a NVMe pool.
In my tests I used a pool from 4 x HGST HE8 disks with a combined raw sequential read/write performance of more than 1000 MB/s. As long as you can process your workload mainly from RAM, it is tremendously fast. The huge fallback when using sync-write can be nearly eliminated by a fast Optane Slog like the 900P. Such a combination can be nearly as fast as a pure SSD pool at a fraction of the cost with higher capacity. Even an SMB filer with a secure write behaviour (sync-write=always) is now possible as a 4 x HGST HE8 pool (Raid-0) and an Optane 900P Slog offered around 500-700 MB/s (needed for 10G networks) on OmniOS. Solaris with native ZFS was even faster.

Use ECC RAM whenever possible

Why? Mitigate bit flipping in memory which could result in corrupt data. Like hard drives have hardware ECC:
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 78 to 83
helps to mitigate data (bits) becoming corrupt when passing through RAM. This is an important aspect especially for databases, vaults and archives.
Ordinary background radiation will randomly flip bits in computer memory, which causes undefined behaviour. These are known as “bit flips”.
Bit flips can have fairly dramatic consequences for all computer filesystems and ZFS is no exception. No technique used in ZFS (or any other filesystem) is capable of protecting against bit flips. Consequently, ECC Memory is highly recommended.

checksum and nopwrite

In certain (non default) configurations OpenZFS supports no-op writes.
ZFS supports end-to-end checksumming of every data block. When a cryptographically secure checksum is being used (and compression is enabled) OpenZFS will compare the checksums of incoming writes to checksum of the existing on-disk data and avoid issuing any write i/o for data that has not changed. This can help performance and snapshot space usage in situations were the same files are regularly overwritten with almost-identical data (e.g. regular full-backups of large random-access files).
The docs include a table of which checksum algorithms support nopwrite
.
Matt Ahrens (ZFS co-author -
) that he was using the ZFS edonr feature and checksum algorithm in production (because is the fastest nopwrite-capable checksum (at least for the C implementations that are in illumos))
For example checksum=sha512 would enable nopwrite on a dataset (for all newly written data). One would need to consider the performance implications vs. the default fletcher4 algorithm. It would make sense to benchmark the various algorithms for a given system to see if there are any noteworthy performance gains.
The Illumos project implemented new algorithms in which included the info:
This feature submission implements new hash algorithms into ZFS with improved performance:
* SHA-512/256: 50% higher performance than SHA-256 on 64-bit hardware with minimum code changes. * Skein: 80% higher performance than SHA-256 with new and highly secure algorithm. Includes a KCF SW provider interface. * Edon-R: >350% higher performance than SHA-256. Lower security margin than Skein, but much higher throughput.
💡 Note that Edon-R requires the edonr pool feature flag to be enabled.
💡 Consider that the chosen checksum algorithm will impact not only normal IOPS but also scrub IOPS. It makes sense to do your own research and testing to find the right settings for your requirements.

raidz, vdevs and IOPS scaling

to quote Jim Salter aka / :
IOPS scales per vdev, not per disk.
A 10-wide RAIDz2 is a single vdev, with IOPS that approximate that of a single disk.
A 10-wide pool of mirrors is five vdevs, with write IOPS approximating five times that of a single disk, and read IOPS (assuming parallel requests) approximating ten times that of a single disk.
It is not a subtle difference.
...
There's not a whole lot of difference between two RAIDz2 vdevs and two mirror vdevs... but there's a lot of difference between 20 disks in two 10-wide RAIDz2, and 20 disks in ten 2-wide mirrors!

Resilvering disk stress in mirrors vs. raidz

💡 In OpenZFS when a disk is being rebuilt (resilvered) from parity in raidz it stresses the majority/full array. This can lead to more concurrent drive failures.
One advantage of striped mirrors is that resilvering a failed drive only stresses the mirror not the full array.

dataset property sync=disabled

If a given workload doesn’t require sync IO, then datasets with sync=disabled will perform the best, especially for spinning rust vdevs. It is import to when configuring the sync property.

Single drive zpools and property copies>1

You will still get the benefit of identifying errors in your data. Unfortunately you get none of the self-correction since there is no redundancy with single-disk pools.
On a single disk pool, setting copies=2 will reduce disk capacity by half but provide block redundancy. Still doesn’t help if the device fails... but would protect from blocks/sectors going bad on a disk, which certainly does happen especially when a disk nearing the end of its life.
ZFS on a single disk is a perfectly valid way to make sure your files are perfect and will indeed notify you during a scrub if something happens to specific files so you could restore them from a backup.

>> use. awesome <<

Before starting critical procedures that include destructive actions (e.g zfs destroy or adding new vdevs), an administrator can checkpoint the pool's state and in the case of a mistake or failure, rewind the entire pool back to the checkpoint. Otherwise, the checkpoint can be discarded when the procedure has completed successfully.
Pool checkpoints can also be used before upgrading zfs and/or an operating system using an root zpool (rpool). If something goes wrong, rewinding to the checkpoint reverts all changes.

Further reading

Alignment shift ashift - critical to get this right

💡 NOTE: ashift is a per vdev setting, not per pool which is often misunderstood.
Please read: , an excerpt:
ashift is actually a bit shift value, so an ashift value for 512 bytes is 9 (2^9 = 512) while the ashift value for 4,096 bytes is 12 (2^12 = 4,096).
I recommend to read the section, an excerpt:
Configuring ashift correctly is important because partial sector writes incur a penalty where the sector must be read into a buffer before it can be written. ZFS makes the implicit assumption that the sector size reported by drives is correct and calculates ashift based on that.
Sometimes drives report or zfs detects incorrect sector size. This can cause significant issues.
ashift=12 is the current recommendation for 4k native drives but it does depend on the physical devices for the given vdev. It is important that a vdev has the correct setting for the physical devices in use. Research the right ashift when creating pools or adding vdevs to existing pools.
What about a pool with multiple vdevs and mixed ashift? .
vdev removal isn't possible unless the vdev to be removed is single disk or mirror, [there are no top-level RAIDZ vdevs], AND all vdevs on the pool have the same ashift setting. If you add a vdev with mismatched ashift, you can neither remove that vdev NOR any other. You're hosed until you destroy the whole pool and restore from backup.
...
Unless something has changed very recently, no it does not—adding a vdev without specifying ashift defaults to default ashift, which depends on what the drive itself reports as its native sector size—which many drives, including Samsung SSDs, lie about (claiming they're 512n when they're actually 512e/4kn).
This isn't just theoretical for me; I've ruined pools that way accidentally in the past by adding a new mirror of Samsung SSDs to a pool with existing ashift=13 Samsung SSD mirrors, and gotten ashift=9 on the newly added vdev.
And, yeah, the only fix for that is destroy the pool and restore from backup.

Striped pool / RAID0 >> (fast but not fault tolerant) <<

A ZFS Striped Vdev pool is very similar to RAID0. You get to keep all of the available storage that your drives offer, but you have no resiliency to hard drive failure. If one drive in a Striped Vdev Zpool fails you will lose all of your data. You do still have checksum data to prevent silent data loss, but any physical failure of a drive will result in data loss. We strongly recommend never using this level of ZFS, as there is no resiliency to drive failure.
RAID-0 is not RAID. There is no redundancy. If a single disk fails, you lose the whole array. You can only replace drives on redundant arrays.

Reserved slop - probably want to tune this

Especially for large capacity drives, consider tuning this. Be aware of reserved ("slop") space:
edit appending /etc/modprobe.d/zfs.conf:
options zfs spa_slop_shift=6

dataset recordsize

Be aware of dataset property recordsize, each dataset can have its own recordsize, default looks like 128k. .
Be aware that for zfs file systems “ZFS automatically tunes block sizes according to internal algorithms optimized for typical access patterns.” and “recordsize affects only files created afterward; existing files are unaffected.”

zvol volblocksize

see also #_luKey
Be aware of zvol property volblocksize, each zvol can have its own volblocksize. “The default 8k is the size of a page on the SPARC architecture.” and “Workloads that use smaller sized IOs (such as swap on x86 which use 4096-byte pages) will benefit from a smaller volblocksize.”
Be aware of kvm virtio logical_block_size=4096,physical_block_size=4096 for drives. As of writing this must be set manually on proxmox if you have mixed device block sizes. At least if you’re aiming for optimal block usage. Or alternatively as a global for a vm in the vm .conf
args: -global virtio-blk-device.physical_block_size=4096 -global virtio-blk-device.logical_block_size=4096

sync IO workloads - procure an Optane / 3D XPoint mirror for SLOG:

If your workload requires fast synchronous writes:
If your workload involves fsync or O_SYNC and your pool is backed by mechanical storage, consider adding one or more SLOG devices. Pools that have multiple SLOG devices will distribute ZIL operations across them. The best choice for SLOG device(s) are likely Optane / 3D XPoint SSDs. See for a description of them. If an Optane / 3D XPoint SSD is an option, the rest of this section on synchronous I/O need not be read. If Optane / 3D XPoint SSDs is not an option, see for suggestions for NAND flash SSDs and also read the information below.
I’ve procured and tested an 3D XPoint Intel SSD Optane 900p 280GB PCIe 3.0 x4, NVMe (SSDPED1D280GA) and they work very well as an slog.
I’ve read its recommended for enterprise grade ZFS that slog vdev should be mirrored on two identical physically separate devices, and there is potential a write IOPS multiplier:
Pools that have multiple SLOG devices will distribute ZIL operations across them.

L2ARC

TODO

Fragmentation

zpool list FRAG stat is related to free space fragmentation? Good write up:
A pool with a low FRAG percent has most of its remaining free space in large contiguous segments, while a pool with a high FRAG percentage has most of its free space broken up into small pieces. The FRAG percentage tells you nothing about how fragmented (or not fragmented) your data is, and thus how many seeks it will take to read it back. Instead it is part of how hard ZFS will have to work to find space for large chunks of new data (and how fragmented they may be forced to be when they get written out).
(How hard ZFS has to work to find space is also influenced by how much total free space is left in your pool. There's likely to be some correlation between low free space and higher FRAG numbers, but I wouldn't assume that they're inextricably yoked together.)
The way to defrag is to do a zfs send then zfs recv the pool but at 26% fragmentation, you probably don't have any performance impact due to fragmentation so why bother.
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.