ZFS is a Copy-on-Write (COW) filesystem. This basically means that the data written to the pool is immutable. When time comes to add or change the data, blocks that are marked as committed won't be modified in any way but copied to a new block with the additions and/or changes needed and then the new sector committed to the pool.
This means there's cost (ZFS overhead) when writing to the pool, it might take multiple disk operations before the data is committed to the pool. But this gives us the transactional behaviour the data block level and not just metadata as journaling filesystems do. Other advanced features such as snapshots make use of this design very efficiently.
Objects are OpenZFS’ conception of an arbitrarily-large chunk of data. Objects are what the “upper” levels of OpenZFS deal with, and use them to assemble more complicated constructs, like filesystems, volumes, snapshots, property lists, and so on.
... objects can be grouped together into a named collection; that collection is called an “object set”, and is itself an object. An object set forms a unique namespace for its object ... So, the full “path” of an object is the “object set id”, then the “object id”.
A typical OpenZFS pool has lots of object sets. Some of these are visible to the user, and are called “datasets”. You might know these better as “filesystems”, “snapshots” and “volumes”. There’s nothing special about those, they’re just collections of objects assembled in such a way as to provide useful and familiar facilities to the user.
Level 1 ARC based in RAM. OpenZFS will always have an L1ARC, whereas the L2ARC is optional. An L1ARC is often referred to by users as ARC.
Level 2 ARC. A persistent and non-RAM ARC. When cached data overflows RAM and an L2ARC is present, it will be used to complement the L1ARC.
Devices can be added to a storage pool as “cache devices”. These devices provide an additional layer of caching between main memory and disk. For read-heavy workloads, where the working set size is much larger than what can be cached in main memory, using cache devices allows much more of this working set to be served from low latency media. Using cache devices provides the greatest performance improvement for random read-workloads of mostly static content.
RAIDZ is a variation on RAID-5 that allows for better distribution of parity and eliminates the RAID-5 “write hole” (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group.
A raidz group can have single, double, or triple parity, meaning that the raidz group can sustain one, two, or three failures, respectively, without losing any data.
In the event of a disk/drive/device failure, real-time RAID parity like in RAIDZ provides enough data redundancy to allow a zpool/array to continue to operate and prevents data from becoming unavailable. After a drive failure a sysop must replace the failed device and the array is rebuilt (resilvered) via parity calculations of the available blocks across the remaining drives. Keep in mind that during the rebuild/resilver process, a time window exists where the data array is typically in a high risk DEGRADED state, having lost its higher data availability because one or more drives became unavailable. With each further loss of pool redundancy, the risk increases in data loss which could be the entire array/pool. Its also important to remember that the drive rebuild/resilver process is typically a very heavy workload for the drives (high stress - especially in raidz) and this can increase the risk additional drive failures during the rebuild/resilver process.
Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as fast as an ultimate Optane pool.
In my tests I used a pool from 4 x HGST HE8 disks with a combined raw sequential read/write performance of more than 1000 MB/s. As long as you can process your workload mainly from RAM, it is tremendously fast. The huge fallback when using sync-write can be nearly eliminated by a fast Optane Slog like the 900P. Such a combination can be nearly as fast as a pure SSD pool at a fraction of the cost with higher capacity. Even an SMB filer with a secure write behaviour (sync-write=always) is now possible as a 4 x HGST HE8 pool (Raid-0) and an Optane 900P Slog offered around 500-700 MB/s (needed for 10G networks) on OmniOS. Solaris with native ZFS was even faster.
Ordinary background radiation will randomly flip bits in computer memory, which causes undefined behaviour. These are known as “bit flips”.
Bit flips can have fairly dramatic consequences for all computer filesystems and ZFS is no exception. No technique used in ZFS (or any other filesystem) is capable of protecting against bit flips. Consequently, ECC Memory is highly recommended.
ZFS supports end-to-end checksumming of every data block. When a cryptographically secure checksum is being used (and compression is enabled) OpenZFS will compare the checksums of incoming writes to checksum of the existing on-disk data and avoid issuing any write i/o for data that has not changed. This can help performance and snapshot space usage in situations were the same files are regularly overwritten with almost-identical data (e.g. regular full-backups of large random-access files).
This feature submission implements new hash algorithms into ZFS with improved performance:
* SHA-512/256: 50% higher performance than SHA-256 on 64-bit hardware with minimum code changes. * Skein: 80% higher performance than SHA-256 with new and highly secure algorithm. Includes a KCF SW provider interface. * Edon-R: >350% higher performance than SHA-256. Lower security margin than Skein, but much higher throughput.
IOPS scales per vdev, not per disk.
A 10-wide RAIDz2 is a single vdev, with IOPS that approximate that of a single disk.
A 10-wide pool of mirrors is five vdevs, with write IOPS approximating five times that of a single disk, and read IOPS (assuming parallel requests) approximating ten times that of a single disk.
It is not a subtle difference.
There's not a whole lot of difference between two RAIDz2 vdevs and two mirror vdevs... but there's a lot of difference between 20 disks in two 10-wide RAIDz2, and 20 disks in ten 2-wide mirrors!
Before starting critical procedures that include destructive actions (e.g zfs destroy or adding new vdevs), an administrator can checkpoint the pool's state and in the case of a mistake or failure, rewind the entire pool back to the checkpoint. Otherwise, the checkpoint can be discarded when the procedure has completed successfully.
ashift is actually a bit shift value, so an ashift value for 512 bytes is 9 (2^9 = 512) while the ashift value for 4,096 bytes is 12 (2^12 = 4,096).
Configuring ashift correctly is important because partial sector writes incur a penalty where the sector must be read into a buffer before it can be written. ZFS makes the implicit assumption that the sector size reported by drives is correct and calculates ashift based on that.
vdev removal isn't possible unless the vdev to be removed is single disk or mirror, [there are no top-level RAIDZ vdevs], AND all vdevs on the pool have the same ashift setting. If you add a vdev with mismatched ashift, you can neither remove that vdev NOR any other. You're hosed until you destroy the whole pool and restore from backup.
Unless something has changed very recently, no it does not—adding a vdev without specifying ashift defaults to default ashift, which depends on what the drive itself reports as its native sector size—which many drives, including Samsung SSDs, lie about (claiming they're 512n when they're actually 512e/4kn).
This isn't just theoretical for me; I've ruined pools that way accidentally in the past by adding a new mirror of Samsung SSDs to a pool with existing ashift=13 Samsung SSD mirrors, and gotten ashift=9 on the newly added vdev.
And, yeah, the only fix for that is destroy the pool and restore from backup.
A ZFS Striped Vdev pool is very similar to RAID0. You get to keep all of the available storage that your drives offer, but you have no resiliency to hard drive failure. If one drive in a Striped Vdev Zpool fails you will lose all of your data. You do still have checksum data to prevent silent data loss, but any physical failure of a drive will result in data loss. We strongly recommend never using this level of ZFS, as there is no resiliency to drive failure.
If your workload involves fsync or O_SYNC and your pool is backed by mechanical storage, consider adding one or more SLOG devices. Pools that have multiple SLOG devices will distribute ZIL operations across them. The best choice for SLOG device(s) are likely Optane / 3D XPoint SSDs. See for a description of them. If an Optane / 3D XPoint SSD is an option, the rest of this section on synchronous I/O need not be read. If Optane / 3D XPoint SSDs is not an option, see for suggestions for NAND flash SSDs and also read the information below.
A pool with a low FRAG percent has most of its remaining free space in large contiguous segments, while a pool with a high FRAG percentage has most of its free space broken up into small pieces. The FRAG percentage tells you nothing about how fragmented (or not fragmented) your data is, and thus how many seeks it will take to read it back. Instead it is part of how hard ZFS will have to work to find space for large chunks of new data (and how fragmented they may be forced to be when they get written out).
(How hard ZFS has to work to find space is also influenced by how much total free space is left in your pool. There's likely to be some correlation between low free space and higher FRAG numbers, but I wouldn't assume that they're inextricably yoked together.)
The way to defrag is to do a zfs send then zfs recv the pool but at 26% fragmentation, you probably don't have any performance impact due to fragmentation so why bother.
The process to move data was the following: I did a zfs send -R oldtank@manualsnapshot to the backup pool, and then sequentially restored all my six datasets
Despite having 3.9G free on our pool, we can’t snapshot the zvol. If you don’t have at least as much free space in a pool as the REFER of a ZVOL on that pool, you can’t snapshot the ZVOL, period. ... In a real-world situation with VM images, this could easily be a case where you can’t snapshot your 15TB VM image without 15 terabytes of free space available
Set either relatime=on or atime=off to minimize IOs used to update access time stamps. For backward compatibility with a small percentage of software that supports it, relatime is preferred when available and should be set on your entire pool. atime=off should be used more selectively.
(see also reserved slop topic) Keep pool free space above 10% to avoid many metaslabs from reaching the 4% free space threshold to switch from first-fit to best-fit allocation strategies. When the threshold is hit, the becomes very CPU intensive in an attempt to protect itself from fragmentation. This reduces IOPS, especially as more metaslabs reach the 4% threshold.
The recommendation is 10% rather than 5% because metaslabs selection considers both location and free space unless the global metaslab_lba_weighting_enabled tunable is set to 0. When that tunable is 0, ZFS will consider only free space, so the the expense of the best-fit allocator can be avoided by keeping free space above 5%. That setting should only be used on systems with pools that consist of solid state drives because it will reduce sequential IO performance on mechanical disks.
Set compression=lz4 (zstd might be better today?) on your pools root datasets so that all datasets inherit it unless you have a reason not to enable it. Userland tests of LZ4 compression of incompressible data in a single thread have shown that it can process 10GB/sec, so it is unlikely to be a bottleneck even on incompressible data. Furthermore, incompressible data will be stored without compression such that reads of incompressible data with compression enabled will not be subject to decompression. Writes are so fast that in-compressible data is unlikely to see a performance penalty from the use of LZ4 compression. The reduction in IO from LZ4 will typically be a performance win.
Note that larger record sizes will increase compression ratios on compressible data by allowing compression algorithms to process more data at a time.
Virtual machine images on ZFS should be stored using either zvols or raw files to avoid unnecessary overhead. The recordsize/volblocksize and guest filesystem should be configured to match to avoid overhead from partial record modification. This would typically be 4K. If raw files are used, a separate dataset should be used to make it easy to configure recordsize independently of other things stored on ZFS.
So what everyone says about the ZFS recordsize is completely true. ZFS files only ever have one (logical) block size, which starts out as small as it can be and then expands out as the file gets more data (or, more technically, as the maximum offset of data in the file increases). If you push it, ZFS will rewrite existing data you're not touching in order to expand the (logical) block size out to the dataset recordsize.
All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.
If you're running a MySQL database on an SSD using ZFS, you should go change your datasets's record record size to 16k. If however, you are running a media server on traditional HDDs, then you are probably fine on the default 128k, or may wish to increase it to 1MB but don't expect any massive performance boost.