Home lab & data vault

Explore

Background on my goals and data storage concepts

Author:

/u/kyle0r⁠

⁠

First version: 2021-05-04. Last update: 2024-01-15. Published: 2022-12-31.

Purpose

Private family data vault and home lab. Reliable storage for home videos, music and photos, and misc backups since 2000. An approach to keep important data safe and an inheritance asset for my family and descendants.

Philosophy

My data... my family’s data is important - some of the most treasured blocks are those that store the loving memories of those whose energy and atoms have been scattered in the interstellar winds 🌌❤️✝️. My aim and hope is that their data blocks and those of my life and works, and our family media will live on with my children and their descendants for a long time to come.

I apply the same technical practices and philosophy to the business world.

Why?

I’ve heard the following all to often, even from myself in my formative years.

“If only I’d made a backup 😩🤬” or “My backups didn’t work 😲”

Some great Redditor quotes I read recently:

⁠

In layman's terms /u/DeXLLDrOID is saying that single copy of a file = none, two copies = one (you can recover one from the other) and so on.

⁠

Both Redditors were commenting on an article from the theverge.com entitled: We just lost 3TB of data on a SanDisk Extreme SSD [

link⁠

Does a technology user exist who hasn’t regretted a delete or rogue action?

...to realise they don’t have a suitable backup? Is there a Linux user who doesn’t have a regrettable story about an rm or dd or similar command? A command copy+paste gone wrong? A bad shell history execution?

Are the backups tested and checksum verified? This is often overlooked until its too late.

Raid, mirrors and parity are not backups, mistakes happen, drives die, blocks become unrecoverable, bits get flipped, corruption happens over network links, encryption keys can become unusable, ram goes bad, memory controllers/CPUs can go bad, storage controllers and cables can fault, software and firmware have bugs, fire and theft happen... I have 1st or 2nd hand experience of all of these scenarios.

In a nutshell:

Force majeure⁠

and

Murphy's law⁠

must be accounted for in safe and reliable data storage practices.

I recall this quote from a reddit user:

Its the way our reality works, shit happens, shit fails, so have a plan to recover your shit.

“Layer 8” human errors consistently prove to be destructive. IF verified backups do not exist... even the best of intentions can go very wrong.

One mistake can be catastrophic, with incalculable material and financial costs.

Approach

Based on these points and philosophy, and my 25+ years in the tech industry, for the most valuable and treasured data:

I work with the principle that data is not safe until backups exist, and backups do not exist unless they have been restored and their integrity has been verified.

A verified backup of important data should exist, and ideally located in safe cold storage and/or off-site storage. i.e. a copy of the backup should exist in a location geographically different to your primary data copy/backup.

I find the

3-2-1 Backup⁠

rule near-perfect and it can significantly reduce the risk of data loss towards 0%. citing from

Wikipedia⁠

The 3-2-1 rule can aid in the backup process. It states that there should be at least 3 copies of the data, stored on 2 different types of storage media, and one copy should be kept off-site, in a remote location (this can include
cloud storage⁠
)

BUT... 3-2-1 stipulates 3 copies of all blocks, stored in geographically diverse and on diverse backup mediums which is cost prohibitive for SME/SOHO and private users.

3-2-1 is fine for large “money no object” budgets. Not so affordable and practical for Joe Bloggs and Erika Mustermann...

As of 2022-12-20 our family vault is contains ~14 TB of data:

1 primary data copy (zfs datasets).

1 backup copy (replicated zfs datasets) (pending co-location).

1 detached snapraid triple parity of the primary data - this is a near-time asynchronous parity strategy.

1 detached set of checksums (sfv).

So to summarise:

Things go wrong, mistakes are made, have a plan to recover from those situations.

Primary data should be checksummed and scrubbed regularly to detect issues early, ideally via automation + alerts.

Keep at least 1 backup copy of data, keep 2 backup copies on different media if cost is no object.

For DR at least 1 backup copy of data should be off-site geographically separate to the primary copy.

I recommend to setup an occasional process to verify older backups are still integral and functional, ideally via automation + alerts.

Consider if maintaining near-time OR real-time parity of primary data could be of benefit for primary data.

I recommend having a stand-alone/detached set of checksums for data that can be verified with utilities like cfv or rhash.

Single points of failure should be engineered out of infrastructure, systems and processes.

Data integrity and resilience should be engineered in to systems and processes.

Hardware and co

⁠

AIC server chassis⁠

(RSC-2AT) up to 24 2.5" hot swap SAS/SATA bays and dual hot swap PSU’s.

ASUS-Z10PR-D16 mainboard with iKVM, BIOS 3703.

2x Intel Xeon CPUs (6 core E5-2620 v3 @ 2.40GHz supporting up to DDR4 1866)

128 GB DDR4 ECC RAM @1866MT/s.

APC UPS with ~5 min graceful auto-shutdown window. 2x 10 GbE (OCP Mezzanine) with 1m DAC to 10 GbE switch.

NB: This chassis has a SAS/SATA backplane - it is a pre U.2/NVMe chassis DOM: ~2017.

Storage:

2x Broadcom (LSI) SAS/SATA 9305-16i HBA’s in pass-through IT mode. The I/O controller is the SAS3224. Up to 12 drives per HBA. Firmware: P13.

Storage cabling: 6x Mini-SAS HD SFF8643 (3 per HBA) into 6 ports on SAS 12Gb/s non-expander backplane. Providing connectivity up to 24 SATA/SAS drives.

Today up to 12 drives are used for active storage, and up to 12 drives for backup/cold storage.

~500 GB mirrored SATA SSD zpool for proxmox “root filesystem” aka rpool. Container/vm OS related volumes storage.

Intel SSD Optane 900p 280GB (AIC) PCIe 3.0 x4, NVMe (SSDPED1D280GA) for ZFS slog/cache and projects that require high performance IO.

⁠

Notes to self: the drive backplane and the HBA cards does not support U.2. NVMe drives. The next gen 9500-8i and 9500-16i support upto 32 NVMe drives. AIC has the

FB201-LX system⁠

that supports NVMe/U.2 drives, and

TYAN⁠

(

review⁠

) has some interesting stuff too. Looks like

Gigabyte⁠

and

Lenovo⁠

might be worth checking too. Big question is what I can actually procure in central Europe.

The base setup

The system runs

Proxmox VE⁠

(pve) hypervisor. Installed with OpenZFS ZoL as the boot+root file system. Container and kvm OS volumes are provisioned from ZFS backed SSD storage. Storage volumes are provisioned from the 2.5” 5600 rpm SMR Seagate Barracuda 5TB. A

Kernel-based Virtual Machine⁠

(kvm) is used to provision

OpenMediaVault⁠

for the home and office networks. Data storage is ~25 TiB useable, currently ~50% (~13 TiB) of that space is used.

Storage strategy from 2017 up until 2021-Q2

The data and parity physical disks were provisioned to the kvm via virtio_blk as full disks. Inside the kvm LUKS+XFS. OpenZFS was used for the rpool but not involved for data storage volumes. The omv kvm utilised

snapraid⁠

with triple parity - up to 3 disks could fail prior to data loss, and

mergerfs⁠

for the union file system. In my first hand experience since 2017 snapraid is a tried a tested data integrity and recovery tool with great extensibility. Snapraid can be considered an asynchronous or near-time parity system vs. ZFS which is typically a synchronous or real-time parity system.

Pros and Cons of single disks + snapraid

Pros

🟢 There is no array. Just a collection of simple storage disks which can be mounted individually and independently on any system supporting LUKS + xfs.

🟢 The use of mergerfs to span multiple disks/volumes into a single file system hierarchy.

🟢 Thanks to snapraid triple parity - up to 3 disks can fail concurrently without data loss.

🟢 snapraid provides disk recovery, undelete and integrity scrubbing.

🟢 Simply add a disk/volume to extend capacity of mergerfs.

🟢 No complex dependency on hardware+firmware or software raid-like systems.

🟢 Maximal storage space.

Cons

🔴 No block level replication or sync possible to 2nd copy of data. rsync or similar can be used on the file system level.

🔴 No snapshots.

🔴 Maximum read/write speeds limited to a single disk - already saturated 1GbE.

🔴🔥 No 2nd or 3rd copy of data, only parity.

Storage strategy as of 2021-05-04

To cover a catastrophic data loss e.g. fire and theft DR scenarios, I want to establish an off-site co-located zfs send target for my storage disks. This would mean converting my simple storage disks to single disk vdev zpools.

Storing 13+ TiB securely on cloud storage is cost prohibitive -

see my research here⁠

So why single vdev pools?

In my time I’ve experienced 1st hand, and witnessed multiple scenarios where arrays of disks have gone wrong and the full dataset is lost until it can be restored or recovered. I don’t want to buy/have that problem for my data. The cost+risk vs. benefit evaluation tells me to just keep it as simple as possible.

I want to be in a position where I do not care if a single vdev zpool gets damaged, corrupted or zapped by a cosmic ray or something equivalent… e.g. if I fat finger a command on a zpool or child dataset… no big deal. Lets just make a new zpool and let snapraid rebuild the missing data from parity. With snapraid triple parity I would be able to sustain 3 single vdev zpool losses in a single location without data loss on that site. I could rebuild the pools from snapraid parity, or restore from cold storage and/or off-site backup.

Implementation

Disk by disk: convert each simple LUKS+XFS disk to single vdev zpools to get most of the zfs benefits, without the risk of having all vdevs in a single pool and/or disks being interdependent on each other. Also avoid potential risk and complexity of raidz etc.

keep the zfs code base and/or feature dependency as limited and simple as possible.

create a dependency on zfs for data block integrity, backups and snapshots but not local vdev block resilience.

For global data replication and redundancy use zfs snapshot, send and replication capabilities.

For local data integrity and redundancy utilise the existing snapraid triple parity and zfs checksum integrity.

Storage capacity

Snapraid is configured for triple parity (3 disks), all the remaining disks can be used for storage. So the new strategy max capacity with 5TB 2.5 disks is ~95TB which should be fine for the next 5-10 years based on current growth and usage.

Compression, block contiguity and fragmentation

Storage capacity will be increased for non binary/compressible blocks via zfs lz4 compression. I can largely stop caring about compressing files individually unless I'm looking for some ultra compression level. Noting there is still a benefit of creating non-compressed archives e.g. tar for block contiguity optimisation i.e. group lots of small files/dirs in tar archives to optimise extent allocation. This should help to optimise block usage (more free space), and file fragmentation.

Privacy and data protection

I can switch from LUKS to OpenZFS native aes-256-gcm encryption, and one would assume online

in-place re-encryption⁠

to future algorithms will be possible. My testing suggests there is near zero performance penalty for enabling encryption for CPUs that support hardware accelerated encryption.

So the offline blocks will remain very well protected, as they were with LUKS.

As of writing my residence country currently has a solid constitution and laws that mean

key disclosure⁠

is not a major concern.

I.e. in an offline context, my data only exists for someone who has been granted or has forced access to the encryption keys (or used some other exploit to obtain or bypass the keys which I calculate to be a highly improbable scenario for the value of my data to a 3rd party).

On-site snapshots

Implementing OpenZFS it becomes possible to use a sanoid-like strategy to keep regular rolling snapshots of zfs datasets, providing hourly/daily/weekly rollback and recovery options.

Off-site backups & ZFS replication

Once a disk is changed to the zpool approach, thanks to the nature of CoW file systems and differential snapshots I can zfs send encrypted differential snapshots to an off-site target and/or even local backup media i.e. zfs replication.

Data redundancy

In the local context, due to maintaining snapraid triple parity on plain disks, upto 3 zpools can concurrently fail without data loss.

If I experience 4 concurrent disk failures (4 zpools) then the 3+1 zpools are lost but the other remaining zpools are OK. If disk failures are greater than 3, then data recovery from cold storage/offsite backups is required.

I.e. If the data loss is more catastrophic than local resilience levels then rely on cold storage/off-site backups to recover data.

E.g. If a disk fails, create a new zpool with a new disk, let snapraid rebuild the missing blocks from parity on the file system level. Supports upto 3 concurrent zpool/disk failures.

Data integrity & Alerting

Any ZFS errors will be alerted by the ZED daemon, who listens for zpool events.

Real-time write operations are contained to a single vdev/volume running a journaled filesystem. There are no real-time mirror vdev or parity writes. This means I should be immune to

write holes⁠

The base dataset is zfs which implements CoW and block level hierarchical checksumming.

snapraid is an on demand near-time parity, when sync’ing the array my approach is:

verify no zpool errors.

snapraid sync --pre-hash

with --prehash snapraid calculates file checksums twice: before and during the sync as additional integrity fail-safe, and also verifies files being moved around within the array.

zfs indirectly verifies checksums for all read+written blocks.

verify no zpool errors.

snapraid scrub -p new

for all newly synced blocks - verify data checksums and check parity for errors.

zfs indirectly verifies checksums for all read blocks.

verify no zpool errors.

If no zpool errors, or snapraid errors exist after this approach then the data and parity should be integral.

I benefit from multiple data scrubbing capabilities:

Regular zpool scrubs to detect if a zpool has experienced corruption - currently all zpools are scrubbed on the 2nd Sunday of every month. If corruption is detected:

Send alerts via ZED (ZFS Event Daemon).

How bad is the corruption? disk replacement time? check SMART stats and error log. Replace disk if needed.

snapraid check -a -d storeX and then snapraid repair and snapraid fix to restore corrupt data.

restore data from related replicated zpool.

Regular zfs checksum verification on access will increment error counters on zpools, if yes:

follow step 1b) and so on.

Regular snapraid scrub checks and/or stand-alone checksum checks to detect and repair file corruption.

Today I have a nightly systemd timer that scrubs 5% of blocks older than 10 days. This maintains the oldest scrubbed block at around 20 days old for ~12.7 TiB of data.

The timer service will send alerts in the case of errors.

Ad-hoc verifications with zfs-autobackup project zfs-check utility to manually calculate checksums for datasets or snapshots - useful to verify datasets and snapshots after zfs send/recv and replication.

Deduplication

Handled on the file system level manually by invoking snapraid dup functionality. After de-duplicating run a discard on the xfs volumes to reclaim space so the blocks can be freed on zfs datasets (note that the freed blocks will still exist in dataset snapshots).

Efficiency and speed ups

encryption workload on the hypervisor rather than kvm

Today the LUKS encryption is handled in the omv kvm, in the future zfs native encryption workload is handled directly on the host hypervisor, not in the guest. One less layer of virtualization, and all of the hypervisors cores are available to perform the IOPS workload.

efficient off-site and cold storage backups

zfs send fully encrypted differential snapshots for off-site/cold storage backups. This is a very efficient and secure way to keep TIB's of data backed up regardless of what sequence of data blocks are modified. Low bandwidth and low IOPS.

ARC and L2ARC caches and slog

Noted in benchmarking that ARC cache can make spinning rust perform like fast NVMe SSD’s for read workloads.

cite

napp-it⁠

1. The most important factor is RAM.

Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as fast as an ultimate Optane pool.

and

2. Even a pure HD pool can be nearly as fast as a NVMe pool.

In my tests I used a pool from 4 x HGST HE8 disks with a combined raw sequential read/write performance of more than 1000 MB/s. As long as you can process your workload mainly from RAM, it is tremendously fast. The huge fallback when using sync-write can be nearly eliminated by a fast Optane Slog like the 900P. Such a combination can be nearly as fast as a pure SSD pool at a fraction of the cost with higher capacity. Even an SMB filer with a secure write behaviour (sync-write=always) is now possible as a 4 x HGST HE8 pool

(Raid-0) and an Optane 900P Slog offered around 500-700 MB/s (needed for 10G networks) on OmniOS. Solaris

with native ZFS was even faster.

Future tweaks

I could consider migrating the parity disks to zfs for the added zfs benefits of single vdev zpools but it might be overkill because the multiple snapraid content files and snapraid scrubbing should mitigate the possibility of undetectable snapraid parity corruption. Changing to zpools for snapraid parity could reduce the necessity of snapraid regular scrubs, as checksum on access and zfs scrubs would cover all disks.

EDIT 2022-11-23 using zfs for snapraid parity would also provide the benefit of having coordinated data+parity snapshots.

Pros and Cons of single disks zpools + snapraid + pool replication

changes to the previous table highlighted in orange:

Pros

🟢 There is no array. Just a collection of simple storage disks single disk zpools which can be mounted imported/exported individually and independently on any system supporting LUKS + xfs ZoL.

🟢 The use of mergerfs to span multiple disks/volumes into a single file system hierarchy.

🟢 Thanks to snapraid triple parity - up to 3 disks (zpools) can fail concurrently without data loss.

🟢 snapraid provides disk recovery, undelete and integrity scrubbing.

🟢 Simply add a disk/volume to extend capacity of mergerfs.

No complex dependency on hardware+firmware or software raid-like systems.

🟢 Maximal storage space.

🟢 Nearly all the benefits of ZFS excluding block reliance.

CoW, checksums and scrubs, encryption, compression, dataset snapshots, pool checkpoints, easy and cost effective off-site replication and differential backups.

Cons

No block level replication or sync possible to 2nd copy of data. rsync or similar can be used on the file system level.

No snapshots.

🔴 Maximum read/write speeds limited to a single disk - already saturated 1GbE.

No 2nd or 3rd copy of data, only parity.

🔴 Dependency on OpenZFS/ZoL code to function as its designed which is inherently complex code. Evidenced by the plentiful OpenZFS github issues.

🔴 no real-time mirror/parity/block level resilience. zpool scrub cannot automatically fix errors. New pool required in the event of a disk failure.

2021-Q3 - Implemented OpenZFS for storage volumes

⁠

My research on how ZFS performs⁠

on this hardware has shown me that raw disks partitions stored on zfs datasets are the winner for kvm workloads, at least for my use cases. I tested dataset vs. zvol vs. raw disk partitions.

So my storage disks/volumes have been migrated from simple volume to zpool. Each storage disk is now in single drive ZoL zpool, providing most of the OpenZFS benefits excluding real-time block resilience. I also procured enough disks to keep a 2nd backup copy of all blocks.

Data redundancy and resilience is provided by a 2nd copy of the primary data pools i.e. “pool replication”.

A 1:1 copy of each zpool (thanks

syncoid⁠

). I still maintain the snapraid triple parity across the primary storage volumes.

The storage volumes are qemu raw disk partitions stored on the ZFS datasets. The raw disks are provisioned to the kvm via virtio_blk.

⁠

The same repurposing migration is planned for the snapraid parity disks one day, for now they remain virtio pass-through while I consider and test future changes.

The snapraid status from the kvm looks like this:

⁠

It would be possible adapt the approach to use zpools with mirror vdevs but this causes me some issues for my future plans with a second chassis and off-site co-located backup. Right now I have the luxury that I can easily take the backup zpools (half the disks) off-site for safe keeping and I don’t have to break vdev mirrors etc.

2022-Q1

Its March and I’m currently weighing up the pros and cons of consolidating one set of data storage pools (1 copy of my data).

striped mirror vdevs vs. raidz vs. dRAID vs. single disk optane boosted replicated zpools.

Its tricky weighing up the performance vs. resilience vs. storage capacity vs. simplicity aspects. Lots to think about.

❓ Do I really need real-time parity for long term glacial data storage? snapraid has served me so well since 2017.

❓ Do I really need sustained reads and writes above 100 MiB/s for my use cases?

Now that the main ZFS migration work is complete, I’m in the planning and procurement stage of another chassis, so I can then export/send one set of pools to the new hardware, and likely co-locate the older chassis in a regional data center i.e. one of the chassis’s will become the off-site backup to protect against force majeure, fire and theft etc.

Thereafter it should be fairly trivial to use sanoid and syncoid to keep the various datasets in sync via zfs replication, transferring just the modified blocks is highly efficient.

Note to self: Currently using 6x 5TB Seagate drives for ~30TB storage, with 3x 5TB drives for triple-parity.

In this configuration, in the current chassis, that would be ~30TB of ~95TB theoretical max based on current hardware maximums.

The kvm df reports each xfs storage volume as having 4396 GB blocks / 4095 GiB blocks. This is because I allocated the raw volumes 4TiB.

Each zpool has a 4.55 TiB SIZE disk, 4.52 FREE when allowing for spa_slop_shift) to keep zfs free space healthy circa 12%.

Maths: 24 drive slots: 24 - 2 (root zpool mirror) - 3 parity disks = 19 data disks.

19 disks x 5TB drives = 95 TB

In reality the allocatable blocks (accounting for zfs free space and slop) would be 19 x 4TiB

76

⁠

TiB / ~84TB.

Note: It would be possible to migrate the rpool to two internal mounts/ports to gain +2 on the storage disks, at the sacrifice of hot-swap. This would bump the storage from 95 TB to 105 TB.

I’ve also seen

some TYAN chassis's⁠

with 26 drive slots (26 SFF + 2 SFF) with 2 slots in the rear. Such a chassis would bump the storage from 105 TB to 115 TB (current hardware maximums) and also provide NVMe/U.2 drive support. AIC’s comparable system is the

FB201-LX⁠

In reality the allocatable blocks (accounting for zfs free space and slop) would be 23 x 4TiB

92

⁠

TiB / ~101TB.

If I ever needed more than 84TB capacity then I could look at a JBOD/JBOX chassis connected via SFF-8644 to SFF-8643 (I have one spare SFF-8644 per HBA card).

Note to self: Based on my current single disk zpool approach... growing storage in the future for higher density disks should roughly go as follows: add a larger mirror disk to a zpool, wait for the re-silvering, detach the smaller disk, grow the pool, grow the raw disk partition, grow the file system.

Pros and Cons of raidz + pool replication

changes to the previous table highlighted in green:

Pros

There is no array. Just a collection of simple storage disks single disk zpools which can be mounted imported/exported individually and independently on any system supporting LUKS + xfs ZoL.

🟢 The use of mergerfs to span multiple disks/volumes into a single file system hierarchy.

🟢 Thanks to snapraid raidz triple parity - up to 3 disks vdevs (zpools) can fail concurrently without data loss.

snapraid provides disk recovery, undelete and integrity scrubbing.

Simply add a disk/volume to extend capacity of mergerfs.

No complex dependency on hardware+firmware or software raid-like systems.

🟢 Maximal storage space.

🟢 Nearly All the benefits of ZFS excluding block reliance.

CoW, checksums and scrubs, encryption, compression, dataset snapshots, pool checkpoints, easy and cost effective off-site replication and differential backups.

🟢 real-time parity block redundancy/resilience.

🟢 standby spares.

🟢 re-silvering.

🟢 Latest OpenZFS supports raidz expansion.

Cons

No block level replication or sync possible to 2nd copy of data. rsync or similar can be used on the file system level.

No snapshots.

Maximum read/write speeds limited to a single disk - already saturated 1GbE.

No 2nd or 3rd copy of data, only parity.

🔴 Dependency on OpenZFS/ZoL code to function as its designed which is inherently complex code. Evidenced by the plentiful OpenZFS github issues.

no real-time mirror/parity/block level resilience. zpool scrub cannot automatically fix errors. New pool required in the event of a disk failure.

🔴 raidz IO does not scale well vs. stripped mirrors linear scaling. In most cases limited to a single disk - already saturated 1GbE.

🔴 stripe width will reduce storage capacity?

🔴 raidz expansion does not re-stripe data during rebalancing which will impact storage capacity.

details⁠

Pros and Cons of dRAID + pool replication

changes to the previous table highlighted in blue:

Pros

There is no array. Just a collection of simple storage disks single disk zpools which can be mounted imported/exported individually and independently on any system supporting LUKS + xfs ZoL.

🟢 The use of mergerfs to span multiple disks/volumes into a single file system hierarchy.

🟢 Thanks to snapraid dRAID triple parity - up to 3 disks vdevs (zpools) can fail concurrently without data loss.

snapraid provides disk recovery, undelete and integrity scrubbing.

Simply add a disk/volume to extend capacity of mergerfs.

No complex dependency on hardware+firmware or software raid-like systems.

🟢 Maximal storage space.

🟢 Nearly All the benefits of ZFS excluding block reliance.

CoW, checksums and scrubs, encryption, compression, dataset snapshots, pool checkpoints, easy and cost effective off-site replication and differential backups.

🟢 real-time parity block redundancy/resilience.

🟢 standby distributed active spares.

🟢 even faster re-silvering.

🟢 Latest OpenZFS supports raidz expansion.

Cons

No block level replication or sync possible to 2nd copy of data. rsync or similar can be used on the file system level.

No snapshots.

Maximum read/write speeds limited to a single disk - already saturated 1GbE.

No 2nd or 3rd copy of data, only parity.

🔴 Dependency on OpenZFS/ZoL code to function as its designed which is inherently complex code. Evidenced by the plentiful OpenZFS github issues.

no real-time mirror/parity/block level resilience. zpool scrub cannot automatically fix errors. New pool required in the event of a disk failure.

🔴 dRAID IO does not scale well vs. stripped mirrors linear scaling. In most cases limited to a single disk - already saturated 1GbE.

🔴 dRAID stripe width will may reduce storage capacity even more than raidz?

raidz expansion does not re-stripe data during rebalancing which will impact storage capacity.

details⁠

🔴 dRAID cannot be expanded.

Pros and Cons of striped mirrors + pool replication

changes to the previous table highlighted in purple:

Pros

There is no array. Just a collection of simple storage disks single disk zpools which can be mounted imported/exported individually and independently on any system supporting LUKS + xfs ZoL.

🟢 The use of mergerfs to span multiple disks/volumes into a single file system hierarchy.

🟢 Thanks to snapraid raidz mirror vdevs triple parity - up to 3 disks each mirror vdevs (zpools) can sustain 1 disk failure concurrently without data loss. (or more depending on how many mirrors are present in the vdev).

snapraid provides disk recovery, undelete and integrity scrubbing.

Simply add a disk/volume to extend capacity of mergerfs.

No complex dependency on hardware+firmware or software raid-like systems.

Maximal storage space.

🟢 Nearly All the benefits of ZFS excluding block reliance.

CoW, checksums and scrubs, encryption, compression, dataset snapshots, pool checkpoints, easy and cost effective off-site replication and differential backups.

🟢 real-time parity block redundancy/resilience.

🟢 standby distributed active spares.

even faster re-silvering.

Latest OpenZFS supports raidz expansion.

🟢 3 copies of data, mirror + off-site replicated pool.

🟢 IO performance : single mirror up to 2x read increase.

🟢 IO performance: each additional mirror vdev stripe increases read and write IO performance linearly.

🟢 storage expansion: each additional vdev mirror stripe expands capacity.

Cons

No block level replication or sync possible to 2nd copy of data. rsync or similar can be used on the file system level.

No snapshots.

Maximum read/write speeds limited to a single disk - already saturated 1GbE.

No 2nd or 3rd copy of data, only parity.

🔴 Dependency on OpenZFS/ZoL code to function as its designed which is inherently complex code. Evidenced by the plentiful OpenZFS github issues.

no real-time mirror/parity/block level resilience. zpool scrub cannot automatically fix errors. New pool required in the event of a disk failure.

dRAID IO does not scale well vs. stripped mirrors linear scaling. In most cases limited to a single disk - already saturated 1GbE.

dRAID stripe width will may reduce storage capacity even more than raidz?

raidz expansion does not re-stripe data during rebalancing which will impact storage capacity.

details⁠

dRAID cannot be expanded.

🔴 storage capacity is halved (at least per chassis).

Some notes on storage bandwidth/performance

For my primary use case of data archiving and backups, I don’t really need read/write speeds beyond a single disk. My data archive drives provide 130-140 MiB/s seq read and 90-110 MiB/s seq write.

If I have a workload that demands something faster I can provision volumes on SSD or NVMe.

⁠

I’ve observed that when copying a few GB data over CIFS/SMB shares to the omv kvm, the transfer bandwidth is much faster than the spinning disks (thanks 10GbE network and NVMe source drive). I’ve done some basic empirical measures of why this is, I’ve certainly measured that the kvm memory buff/cache grows the same size as the transferred data but I’m yet to correlate exactly how this interacts with ZFS ARC. The kvm page faults spike, and the kernel dirty (memory waiting to be written to disk) is increased and decreased in relation to the IO.

I have observed the ZFS ARC hash counts increase during such data operations, which suggests ARC is caching at least some of those blocks? Which would make sense because the primarycache setting on the data set is default (all).

The kvm netdata graphs for the disk show write speeds of 500 to 900 MiB/s which is obviously wildly beyond the 5400-RPM spindles, this would suggest CIFS/SMB is writing to page cache and ZFS on the host is first writing to ARC and sync=disabled on these datasets, the host netdata graphs show normal/expected async write speeds to the vdevs. Notes for the future:

Pay more attention to arc_summary next time I’m observing this behaviour. Is there such a thing as arctop/arcstat?

Why doesn’t netdata graph ARC writes the same as it graphs ARC reads? Counter-intuitive.

Is it possible to disable page cache or CIFS/SMB and/or enable directio? Would be an interesting test to bypass the page cache. Start nocache smbd ?

Update 2022-03-24: to boost write IO performance I’m trying to procure a

Intel® Optane™ SSD 900P⁠

(PCIe-card HHHL variant). The 900P is EoL and tricky to find in the market (phantom stock) but they are blazing fast with their 3D XPoint Lithography. I think I’m close to hunting a few down.

My aim would then be adding an optane slog device for certain zpools. My research suggests that I could then set sync=always on the zfs datasets to force write IO into the ZIL/slog, write speed should be much faster than the underlying physical disks, and eventually the blocks will flush to the slower vdevs. Confirmed 2022-07-10: I was able to procure an

Intel 280GB 900P NVMe PCIe 3.0 x4 (AIC/HHHL)⁠

which is crazy fast.

I made a short screencast⁠

to cover some quick testing and got very significant gains for randwrite IO. primarycache=none sync=always encryption=aes-256-gcm compression=lz4 zpool with single 2.5” 5600 rpm SMR Seagate Barracuda 5TB without slog vs. with slog: 60 second 4k randwrite fio test: 1531 times more IOPS with Intel 900P slog, and 1446 times more BW. 60 second 128k randwrite fio test: 288 times more IOPS with Intel 900P slog, and 2254 times more BW. 60 second 1M randwrite fio test: 11 times more IOPS and BW with Intel 900P slog.

For read IO boost, one has to consider the workload, cold/non-cached reads from 2.5” 5400-RPM spindles is never going to be blazing fast (mirror vdevs can provide up to a simple 2x... 2x striped mirror vdevs would provide up to 4x and so on). ARC and perhaps L2ARC can certainly be warmed up to speed up predictive/repetitive read workloads, which I’ve seen in my benchmarking.

💡 Rather than trying to make slow pools go faster - optane backed datasets for workloads that require high performance read IO should work very well.

A note on zpool replication verification and maintaining stand-alone checksums

This is something I’ve maintained for a number of reasons historically, and I thought with my adoption of ZFS + snapraid I might be able to retire the practice. Stand-alone or detached checksum lists have been especially useful to me after doing large copy operations such as duplicating/sending/coping a full disk/dataset/volume etc.

A recent use case: I relied on them heavily in my migration from simple disks to zfs datasets to verify src and dst bytes/files. It gave me confidence in my methods and results, especially when finally retiring/wiping migrated disks and volumes.

If you consider a bug like

OpenZFS - send/recv with ashift 9->12 leads to data corruption · Issue #12762⁠

and the existence of

zfs-autobackup zfs-check⁠

feature... the practice and benefit of stand-alone checksums is really justified.

I use tools like cfv (hopefully the python3 branch hits mainline soon) and rhash to to create, maintain and verify my checksums.

Cloud storage costs

As of 2022-03-27 a snapshot of cloud storage costs:

⁠

💡 Research point: It would be interesting to see if Backblaze personal edition can backup cifs/smb shares mounted on a windows 10 machine? Answer: not possible,

citation link⁠

Backblaze online backup can theoretically backup a network drive, network share, or NAS device, but for business reasons do not allow it. Backing up mounted or network drives can easily be abused. A user could mount the 10 or 20 computers in their home or small business and back them all up to one account for $7/month.

However, using our B2 cloud storage service, NAS devices can be backed up off-site.

So this means at least 575 EUR year on year for ~11.8 TiB / ~13 TB cloud storage in aws s3 glacier.

As a comparison: 3 x 5TB disks ~ 120 * 3 =

360

⁠

EUR one-off cost to “just to store a copy of the blocks” as a simple cold storage backup.

Here its easy to see that ~360 EUR CapEx is much more cost effective than a 575 EUR year on year OpEx.

2022-Q3

I’ve noticed that Intel have large capacity 2.5” U.2 drives. Could be an interesting storage upgrade path over time.

e.g. of some 2nd hand drives

15.36TB Intel SSD P4500 Series NVME U.2 2.5" SSDPE2KX160T701 Solid State Drive⁠

~ 78 EUR per TB. They are fast too: sequential read up to 3200 MB/s, sequential write up to 1600 MB/s, random read 580,000 IOPS (4K Blocks).

⁠

Here is the Ark page⁠

. A prerequisite is a chassis that supports U.2. drives - capacity in the same 2U chassis form factor would be tripled. Current disks: 19 x 4TiB

76

⁠

TiB / ~84TB vs. 19 x 14TiB

266

⁠

TiB / ~292TB which is 3.5 times more storage.

It will be interesting to watch this kind of drive price vs. capacity over time.

2022-Q4

I started the migration process for the snapraid parity disks from virtio_blk full disks to raw disk partitions stored on OpenZFS data on single disk zpools (the same approach as for the storage data disks). Unfortunately I detected some

performance issues⁠

- so that is on hold for now.