Author:
Last updated: 2023-12-30 The goal of this content is to give the reader insights into ZFS and OpenZFS, and how ZFS works at a high to medium level. It raises for planning and using OpenZFS. If you are looking to answer “How do I do x with ZFS?” or are looking for a specific command check out my . ZFS intro
ZFS is a software-implemented data storage system that runs alongside a computer's kernel. It provides a highly integral filesystem with advanced features such as snapshots, replication, encryption, data redundancy, and can self-heal data corruption.
ZFS can be highly performant, offering various read caching strategies and the ability to dedicate low-latency storage to synchronous write workloads, which can significantly improve write performance on slower pools.
As the name suggests, is an open community project implementation of ZFS hosted on GitHub with over 500 contributors. Many operating system distributions include support for OpenZFS. OpenZFS origins stem from the porting of Sun Microsystems ZFS, which was distributed with OpenSolaris (under the CDDL licence) before Sun was acquired by Oracle Corp. . Unlike most other storage systems, ZFS unifies both physical volume management and logical block devices, acting as both a volume manager and a filesystem. As a result, ZFS has complete knowledge of both the physical disks and volumes (including their status, health, and logical arrangement in volumes) and all the files stored on them. ZFS is designed to ensure (with suitable hardware) that data stored on disks cannot be lost due to physical errors, hardware or operating system mis-processing, or bit-rot events and data corruption that can occur over time. ZFS's complete control of the storage system ensures that every step, whether related to file management or disk management, is verified, confirmed, corrected if necessary, and optimised in a way that storage controller cards and separate volume and file managers cannot achieve.
Headline features are documented here: and . Core concepts
CoW - Copy on Write
To cite reddit user description of ZFS on a post : ZFS is a Copy-on-Write (COW) filesystem. This basically means that the data written to the pool is immutable. When time comes to add or change the data, blocks that are marked as committed won't be modified in any way but copied to a new block with the additions and/or changes needed and then the new sector committed to the pool.
This means there's cost (ZFS overhead) when writing to the pool, it might take multiple disk operations before the data is committed to the pool. But this gives us the transactional behaviour the data block level and not just metadata as journaling filesystems do. Other advanced features such as snapshots make use of this design very efficiently.
Objects and object sets
In ZFS, objects are grouped together in object sets. A dataset is an “object set” object, files and directories are objects grouped into a dataset. To cite robn, an OpenZFS contributor, posting about “OpenZFS objects and object sets” . Objects are OpenZFS’ conception of an arbitrarily-large chunk of data. Objects are what the “upper” levels of OpenZFS deal with, and use them to assemble more complicated constructs, like filesystems, volumes, snapshots, property lists, and so on.
... objects can be grouped together into a named collection; that collection is called an “object set”, and is itself an object. An object set forms a unique namespace for its object ... So, the full “path” of an object is the “object set id”, then the “object id”.
A typical OpenZFS pool has lots of object sets. Some of these are visible to the user, and are called “datasets”. You might know these better as “filesystems”, “snapshots” and “volumes”. There’s nothing special about those, they’re just collections of objects assembled in such a way as to provide useful and familiar facilities to the user.
Blocks and block pointers
In ZFS, objects are sequences of blocks. A ZFS plain file object records information about its sequence of data blocks. Typically, when writing a file, ZFS will write n blocks of max recordsize to fill the space required for a given file. A file object therefore stores a list of block pointers aka DVA’s (Data Virtual Addresses). A DVA contains the coordinates needed to access a block, vdev, address, size. So, a block coordinate in ZFS is object set id → object id → DVA.
With the CoW semantic, when an ZFS plain file object is being modified, only the blocks to be modified are copied and then modified. Unmodified blocks remain unchanged. The object records the changes. When the blocks that were copied to make the modification are no longer referred to, for example by a snapshot, they are marked for garbage collection.
Snapshots and transactions
Modifications to a ZFS filesystem are recorded in transactions. In simple terms, a dataset snapshot is an explicitly named transaction reference. Blocks will remain in the filesystem until no snapshot refers to them. Snapshots provide the ability to look back the block state at a given point in time, à la time machine. If desired, transactions can be discarded, i.e. a dataset can be rolled back to a given snapshot.
Dataset replication
ZFS can calculate the difference or delta between transaction X and Y, i.e. what objects and blocks changed between state X and Y. With this information ZFS is able to offer very efficient file system replication, only needing to send the delta of new and/or modified blocks.
If you’d like to learn more details, have a look at my ZFS internal page []. Resilvering
TODO - the process of rebuilding one or more vdev children and restoring a pool to a healthy state. There are number of ZFS module parameters than control resilvering []. With Mirror vdevs resilvering is a fairly straightforward computation, for other vdev types things can be more complex. raidz
for raidz details, and info on raidz, vdevs and IOPS scaling. My high level ZFS glossary
zpool command manages ZFS storage pool configuration and their virtual devices aka vdevs.
A ZFS zpool and the zpool command has similar characteristics to a physical volume manager.
zfs command manages ZFS datasets, which are children of pools. zfs defines and manages dataset hierarchy.
A ZFS dataset has similar characteristics to a logical volume. So, the zfs command is similar to a logical volume manager.
Dataset types include: filesystem, snapshot, volume (zvol), bookmark.
vdev = virtual device - made up of one or more drives or partitions or files.
zpools are populated with vdevs, vdevs are populated with datasets.
zvol = volume storage i.e. virtual block device. >> (avoid using them - ) << ZIL stands for ZFS Intent Log, which temporarily holds synchronous writes until they are written to the zpool. ZIL is not used/written in async or sync=disabled workloads. By default the ZFS Intent Log is written to the same vdev’s as data.
SLOG is a dedicated separate log vdev aka slog which can be added to a zpool with the aim of making synchronous write workloads go as fast as possible. The ZIL is moved to the slog. The slog device should be as fast as possible at sync workloads.
(L1)ARC is ZFS’s level 1 cache. An Adaptive Replacement Cache located in RAM.
Level 1 ARC based in RAM. OpenZFS will always have an L1ARC, whereas the L2ARC is optional. An L1ARC is often referred to by users as ARC.
L2ARC is ZFS’s level 2 Adaptive Replacement Cache and should be on a fast device (like SSD or NVMe).
Level 2 ARC. A persistent and non-RAM ARC. When cached data overflows RAM and an L2ARC is present, it will be used to complement the L1ARC.
Devices can be added to a storage pool as “cache devices”. These devices provide an additional layer of caching between main memory and disk. For read-heavy workloads, where the working set size is much larger than what can be cached in main memory, using cache devices allows much more of this working set to be served from low latency media. Using cache devices provides the greatest performance improvement for random read-workloads of mostly static content.
mirror a mirror vdev consists two or more child devices, each child is an exact copy (or mirror) of its siblings; a classic RAID 1 mirrored pair contains two disks. . spare TODO
RAIDZ citing the OpenZFS docs []: RAIDZ is a variation on RAID-5 that allows for better distribution of parity and eliminates the RAID-5 “write hole” (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group.
A raidz group can have single, double, or triple parity, meaning that the raidz group can sustain one, two, or three failures, respectively, without losing any data.
ZFS PSA’s - Public Service Announcements
AVOID SMR drives with ZFS - use CMR drives where possible
See the issue I authored : ~30% performance degradation for ZFS vs. non-ZFS for large file transfer. This covers things in details. In summary, ZFS performs significantly worse on SMR drives because of the ZFS fundamental internals and the co-play with the physical storage mechanics used in SMR drives. . . of HGST Inc. about SMR and ZFS. AVOID USING ZVOL’s
Avoid using zvol’s unless you absolutely understand . From my experience one needs to take a number of things into account and make tweaks to have zvol’s operate in a performant manor. zvol’s are very attractive for provisioning virtual block devices especially for kvm or persistent volumes on k8s 💪 but unfortunately the zvol implementation in ZoL OpenZFS has a number of issues which I’ve verified first hand (on my hardware).
Further reading: Extreme performance penalty, holdups and write amplification when writing to ZVOLs. As of 2021-09-15 AFAIK there is no roadmap to resolve these issues.
As of 2022-06 it looks like some commits went into OpenZFS:master related to Issue that require testing and a specific flag to be enabled zvol_use_blk_mq - . I doubt the commits fundamentally change the way zvols are implemented rather provide more efficient io scheduling? I could imagine this could reduce system loadavg when using zvols. Needs further testing and investigation. As of 2023-11 there was a bug reported in zvol_use_blk_mq as documented in - so ensure your version of ZFS includes this change if you plan to try zvol_use_blk_mq. HOWTO
gives a great overview of concepts and command lines. Official Docs
and . These are great docs to get ones head around ZFS concepts. Things to consider
Understanding that RAID is not a backup
💡 Do not rely on a single logical copy of your important data - there is no redundancy in that approach.
The strategy of RAID and data mirroring is higher data availability, NOT data redundancy! RAID and ZFS alone will not protect you from data loss. In data redundancy terms: one copy of data is none, two is one and so on.
A single ZFS pool, even with multiple copies of data or parity levels, is still a single point (pool) of failure and logically a single copy of your data.
ZFS is great at providing highly available data pools AND ZFS can form part of your data redundancy and backup strategy. ZFS has many useful features that can be utilised to achieve data redundancy objectives, including tested and integrity verified backups.
I work with the principle that data is not safe until backups exist, and backups do not exist unless they have been restored and their integrity has been verified. ~kyle0r In the event of a disk/drive/device failure, real-time RAID parity like in RAIDZ provides enough data redundancy to allow a zpool/array to continue to operate and prevents data from becoming unavailable. After a drive failure a sysop must replace the failed device and the array is rebuilt (resilvered) via parity calculations of the available blocks across the remaining drives. Keep in mind that during the rebuild/resilver process, a time window exists where the data array is typically in a high risk DEGRADED state, having lost its higher data availability because one or more drives became unavailable. With each further loss of pool redundancy, the risk increases in data loss which could be the entire array/pool. Its also important to remember that the drive rebuild/resilver process is typically a very heavy workload for the drives (high stress - especially in raidz) and this can increase the risk additional drive failures during the rebuild/resilver process.
This is why its critical to maintain at least a second copy of your important data, and ideally you’d have >2 copies.
You can read my in-depth notes and strategies:
Understand the difference between async and sync IO
When comes to write IO:
sync IO promises to write an IO to physical media before writing the next IO. Sync IO workload performance is therefore limited by the speed and number of concurrent write acknowledgements vdev(s) can make.
async IO makes no promises and will try and write the IO as fast as it can, eventually flushing all IO to physical media.
So sync=immediate (slower and safer) vs. async=eventual consistency (faster and less safe).
With async IO there is a risk that not all IO reaches the physical disk. Consider an rsync process finishes and the IO was issued as async to the io subsystem. The command has finished but the IO is still in-flight in the io subsystem waiting to be flushed to the physical media. If something like a power failure happened before the io was fully flushed to the physical media, then data loss will occur and rsync (and you) would be non the wiser.
Mission critical and highly sensitive applications like databases typically require (want) sync IO, it depends on the business, non-functional and criticality requirements of the data being stored.
One has to choose the right IO characteristics and trade offs based on requirements and situation.
If the workload does not require sync IO guarantees then stick with async and consider dataset property sync=disabled to maximise disk throughput. A battery backup/UPS/USV can help to mitigate the risks of async IO during power outages.
Understand the difference between buffered and direct IO.
buffered IO uses the page cache, direct bypasses page cache.
In some scenarios the page cache can make applications perform faster than the underlying physical media.
One downside of buffered IO is “2x double write” cost: 1x to write into the page cache, and 1x to write to the physical media. Bypassing the 2x cost for large bulk IO is typically more performant.
There are pros and cons to both, workload and use case dependant.
My fio benchmarking of zfs suggests that buffered IO especially for high throughput/IO workloads use more system resources and in reality perform much worse than directio. zvols were unstable with buffered io and should be avoided.
If your workload involves writing a lot of unique and bulk data, like backups, then why would you want to write that in the page cache because you probably won’t be reading it again any time soon, plus the negative of the double write cost?
If you are reading some large data file, buffered reads would push existing pages out of the cache, this might have a negative impact on running applications that were making use of the page cache. aka cache poisoning.
As a general rule, use directio for bulk write workloads. For other uses cases you have to consider all the variables and run benchmarks.
Not related to zfs but SAN’s (or storage controllers) with cache pools also have a role to play in buffered vs. direct io. Typically its best to use direct io and let the SAN perform the caching (buffering).