Gallery
OpenZFS zvol performance issues
Share
Explore

raw zfs send / recv issue

When I first detected this issue the system had been up and stable 54 days running zfs 2.0.3-pve2 and kernel 5.4.101-1-pve. Since boot iostat showed ~75 TiB of reads from the disks and 6.4 TiB of writes. 16 of the 20 disks in the system use are incorporated into zpools.
I wanted to test changing my zpool backup approach from individual encryption keys for storeN and storeN-backup which worked well in practice for local zpools but had the side-effect that syncoid jobs required both pools to have their keys loaded, AND it was not possible to perform incremental raw send and receive because the crypto keys were different. The long term goal is to move the storeN-backup pools off-site, so raw incremental operations were a MUST HAVE. Why?
So the blocks being transmitted would be encrypted with their key during send+recv operations.
AND so I can perform incremental/delta block replication between the src and dst zpools over WAN in the future.
So I did a bit of research and went about performing a zfs send --replicate --raw equivalent with syncoid to seed storeN-backup as identically as possible to storeN, and then tested the raw incremental operations. In initial tests everything looked good...
I did some testing first with a spare disk, and the moved on to perform this for the smallest store6 and then store1. For store1 I ran into some issues as follows:
started a syncoid job to recursively seed store6-backup zpool.
time syncoid --compress=none --sendoptions=Rw --no-sync-snap store6/data store6-backup/data
no issues detected.
started a syncoid job to recursively seed store1-backup zpool.
time syncoid --compress=none --sendoptions=Rw --no-sync-snap store1/data store1-backup/data
🔥🔥🔥 significant performance drop off without recovery after ~3h of transfer, kernel logging:
task txg_sync:3748 blocked for more than 120 seconds
CTRL+C the syncoid process.
double checked the dataset sync=disabled settings were being inherited in the send process, increased mbuffer size and resumed point 2 job, had to be specific to the dataset to resume:
time syncoid --compress=none --mbuffer-size=32M --sendoptions=w --no-sync-snap store1/data/vm/raw store1-backup/data/vm/raw
performance did not recover to expected levels, apt --full-upgrade which included upgrades to kernel and zfs:
kernel 5.4.101-1-pve to 5.4.157-1-pve
zfsutils-linux/stable 2.0.3-pve2 to 2.0.6-pve1~bpo10+1 amd64
After the reboot resumed the job per step 5
time syncoid --compress=none --mbuffer-size=32M --sendoptions=w --no-sync-snap store1/data/vm/raw store1-backup/data/vm/raw
Let the job finish and then followed up with:
time syncoid --sendoptions=w --recursive store1/data store1-backup/data
✅ to check that the recursive send+recv is good.
Note, I checked to see which HBA’s the disks were connected to
# check the disks LUN location:
root@viper:~# udevadm info --query=all --name=/dev/sdh | grep S:.*by-path # store6-bk
S: disk/by-path/pci-0000:02:00.0-sas-phy7-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdb | grep S:.*by-path # store1-bk
S: disk/by-path/pci-0000:02:00.0-sas-phy1-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdn | grep S:.*by-path # syncoid-test
S: disk/by-path/pci-0000:03:00.0-sas-phy6-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdi | grep S:.*by-path # store1
S: disk/by-path/pci-0000:02:00.0-sas-phy18-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdr | grep S:.*by-path # store6
S: disk/by-path/pci-0000:03:00.0-sas-phy18-lun-0
the store6 to syncoid-test was same-HBA ✅
the store6 to store6-backup was inter-HBA ✅
the store1 to store1-backup was same-HBA 🔥
No strong evidence on HBA issues there at the moment.
image.png
RAM+SWAP for the same time range:
image.png
store1-backup (sdb) notable disk stats.
note how average write I/O time jumped during the issue, and again during the later part of the operation after kernel upgrade.
it looks like the timestamp of the txg_sync logs match the degradation on graph here.
image.png
sample of 1 of 5 kernel log examples for txg_sync:
Dec 10 18:24:46 viper kernel: INFO: task txg_sync:3748 blocked for more than 120 seconds.
Dec 10 18:24:46 viper kernel: Tainted: P O 5.4.101-1-pve #1
Dec 10 18:24:46 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 10 18:24:46 viper kernel: txg_sync D 0 3748 2 0x80004000
Dec 10 18:24:46 viper kernel: Call Trace:
Dec 10 18:24:46 viper kernel: __schedule+0x2e6/0x6f0
Dec 10 18:24:46 viper kernel: schedule+0x33/0xa0
Dec 10 18:24:46 viper kernel: schedule_timeout+0x152/0x330
Dec 10 18:24:46 viper kernel: ? __next_timer_interrupt+0xd0/0xd0
Dec 10 18:24:46 viper kernel: io_schedule_timeout+0x1e/0x50
Dec 10 18:24:46 viper kernel: __cv_timedwait_common+0x138/0x170 [spl]
Dec 10 18:24:46 viper kernel: ? wait_woken+0x80/0x80
Dec 10 18:24:46 viper kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 10 18:24:46 viper kernel: zio_wait+0x139/0x280 [zfs]
Dec 10 18:24:46 viper kernel: ? _cond_resched+0x19/0x30
Dec 10 18:24:46 viper kernel: dsl_pool_sync+0xdc/0x510 [zfs]
Dec 10 18:24:46 viper kernel: spa_sync+0x5a4/0xfe0 [zfs]
Dec 10 18:24:46 viper kernel: ? mutex_lock+0x12/0x30
Dec 10 18:24:46 viper kernel: ? spa_txg_history_init_io+0x104/0x110 [zfs]
Dec 10 18:24:46 viper kernel: txg_sync_thread+0x2e1/0x4a0 [zfs]
Dec 10 18:24:46 viper kernel: ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
Dec 10 18:24:46 viper kernel: thread_generic_wrapper+0x74/0x90 [spl]
Dec 10 18:24:46 viper kernel: kthread+0x120/0x140
Dec 10 18:24:46 viper kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 10 18:24:46 viper kernel: ? kthread_park+0x90/0x90
Dec 10 18:24:46 viper kernel: ret_from_fork+0x35/0x40
iostat before the reboot:
kyle@viper:~$ uptime
22:44:53 up 54 days, 7:04, 1 user, load average: 0.61, 3.10, 4.54
kyle@viper:~$ iostat
Linux 5.4.101-1-pve (viper) 12/10/2021 _x86_64_ (24 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
0.54 0.00 0.91 0.25 0.00 98.31

Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.01 0.00 63649 0
sdg 19.26 21.82 228.97 102369166 1074133692
sdj 19.05 21.61 228.97 101351830 1074133692
sda 8.07 2129.45 15.54 9989454888 72889608
sde 0.66 602.21 0.52 2825035260 2450892
sdi (store1) 10.11 2323.14 15.54 10898051016 72905464
sdf 0.26 213.37 0.16 1000933524 770944
sdc 0.66 605.09 1.20 2838522632 5619112
sdd 0.66 593.64 0.48 2784796908 2255996
sdb (store1-bk) 1.18 585.46 271.15 2746468296 1272002744
sdh (store6-bk) 0.61 188.96 217.38 886421900 1019732520
sdr (store6) 4.58 1038.45 93.69 4871453828 439530028
sdl 5.96 1460.52 66.38 6851450978 311392323
sdq 9.31 2140.22 15.49 10039950012 72672928
sdo 5.96 1460.52 66.38 6851451078 311392340
sdm 5.96 1460.52 66.38 6851451298 311392319
sdp 3.02 605.20 15.17 2839068356 71175112
sdn (syncoid-test)0.31 0.02 172.87 86290 810945685
sdk 8.63 2106.64 15.70 9882432536 73669984
sds 0.00 0.00 0.00 3223 0

Upgraded kernel 5.4.101-1-pve to 5.4.157-1-pve.
Upgraded zfsutils-linux/stable 2.0.3-pve2 to 2.0.6-pve1~bpo10+1 amd64
Rebooted

During the final part of the send+recv there was one more example of the txg_sync message in the kernel log:
Dec 11 03:28:03 viper kernel: INFO: task txg_sync:11367 blocked for more than 120 seconds.
Dec 11 03:28:03 viper kernel: Tainted: P O 5.4.157-1-pve #1
Dec 11 03:28:03 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 03:28:03 viper kernel: txg_sync D 0 11367 2 0x80004000
Dec 11 03:28:03 viper kernel: Call Trace:
Dec 11 03:28:03 viper kernel: __schedule+0x2e6/0x6f0
Dec 11 03:28:03 viper kernel: schedule+0x33/0xa0
Dec 11 03:28:03 viper kernel: schedule_timeout+0x152/0x330
Dec 11 03:28:03 viper kernel: ? __next_timer_interrupt+0xd0/0xd0
Dec 11 03:28:03 viper kernel: io_schedule_timeout+0x1e/0x50
Dec 11 03:28:03 viper kernel: __cv_timedwait_common+0x138/0x170 [spl]
Dec 11 03:28:03 viper kernel: ? wait_woken+0x80/0x80
Dec 11 03:28:03 viper kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 11 03:28:03 viper kernel: zio_wait+0x139/0x280 [zfs]
Dec 11 03:28:03 viper kernel: ? _cond_resched+0x19/0x30
Dec 11 03:28:03 viper kernel: dsl_pool_sync+0xdc/0x510 [zfs]
Dec 11 03:28:03 viper kernel: spa_sync+0x584/0xff0 [zfs]
Dec 11 03:28:03 viper kernel: ? mutex_lock+0x12/0x30
Dec 11 03:28:03 viper kernel: ? spa_txg_history_init_io+0x104/0x110 [zfs]
Dec 11 03:28:03 viper kernel: txg_sync_thread+0x2e1/0x4a0 [zfs]
Dec 11 03:28:03 viper kernel: ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
Dec 11 03:28:03 viper kernel: thread_generic_wrapper+0x74/0x90 [spl]
Dec 11 03:28:03 viper kernel: kthread+0x120/0x140
Dec 11 03:28:03 viper kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 11 03:28:03 viper kernel: ? kthread_park+0x90/0x90
Dec 11 03:28:03 viper kernel: ret_from_fork+0x35/0x40

summary and open points

related github issue: . I made post on the issue
.
This initial issue documents a case with rsync but other thread contributors talk about having the issue with send / recv operations.
The kernel upgrade + reboot seemed to help some but its unclear which one, will continue to monitor.
Note that the txg_sync log entries happened before and after the kernel upgrade/reboot.
From the github issue kirscheGIT commented on 12 May:
After upgrade to zfs-2.0.4-pve1 the problem mentioned here disappeared.
So that is promising as I upgraded from 2.0.3 to 2.0.6.
However... matti commented on 17 Oct:
same here with proxmox 7.0, kernel 5.11. no smart errors, tried multiple disks.
I guess only time will tell as this stage.
Q: could it be that the performance of the operation has a relation to the layout and content of the blocks on the sender? store6 contains a lot of media backups with compressed binary data, and store1 is more of a mixed bag of various backups and file sizes and counts. I assume that because the transfer is --raw that the receiving side cannot perform any optimisation or such on the incoming blocks.
Q: TEST ME: can one create a dataset with the same key and perform a mixture of --raw and non --raw transfers? It sounds like this should be possible but it will break the raw incremental ability.
Note that if you do not use this flag (--raw) for sending encrypted datasets, data will be sent unencrypted and may be re-encrypted with a different encryption key on the receiving system, which will disable the ability to do a raw send to that system for incrementals.
A: 2021-12-13 citing from man zfs-receive:
The added security provided by raw sends adds some restrictions to the send and receive process. ZFS will not allow a mix of raw receives and non-raw receives. Specifically, ⚠ any raw incremental receives that are attempted after a non-raw receive will fail ⚠. Non-raw receives do not have this restriction and, therefore, are always possible. Because of this, it is best practice to always use either raw sends for their security benefits or non-raw sends for their flexibility when working with encrypted datasets, but not a combination.
Overall HBA or disk performance does NOT appear to be an issue, the following screenshot shows the system under heavy load. 1.65 GiB/s read throughput during the active scrub of 13 zpools made of single vdev 2.5” HDD.
TODO: re-patch the zfs scrub script to scrub max n zpools concurrently which was overwritten during the zfsutils upgrade. Why didn’t dpkg prompt to keep this modified file? because it was outside of /etc? make the file immutable? will that fail package updates?
image.png

2021-12-13 update

Happened again after ~3.5h during store2 → store2-backup --raw --replicate send / receive.
Dec 13 11:19:07 viper kernel: INFO: task txg_sync:11792 blocked for more than 120 seconds.
Dec 13 11:19:07 viper kernel: Tainted: P O 5.4.157-1-pve #1
Dec 13 11:19:07 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 13 11:19:07 viper kernel: txg_sync D 0 11792 2 0x80004000
Dec 13 11:19:07 viper kernel: Call Trace:
Dec 13 11:19:07 viper kernel: __schedule+0x2e6/0x6f0
Dec 13 11:19:08 viper kernel: schedule+0x33/0xa0
Dec 13 11:19:08 viper kernel: schedule_timeout+0x152/0x330
Dec 13 11:19:08 viper kernel: ? __next_timer_interrupt+0xd0/0xd0
Dec 13 11:19:08 viper kernel: io_schedule_timeout+0x1e/0x50
Dec 13 11:19:08 viper kernel: __cv_timedwait_common+0x138/0x170 [spl]
Dec 13 11:19:08 viper kernel: ? wait_woken+0x80/0x80
Dec 13 11:19:08 viper kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 13 11:19:08 viper kernel: zio_wait+0x139/0x280 [zfs]
Dec 13 11:19:08 viper kernel: ? _cond_resched+0x19/0x30
Dec 13 11:19:08 viper kernel: dsl_pool_sync+0xdc/0x510 [zfs]
Dec 13 11:19:08 viper kernel: spa_sync+0x584/0xff0 [zfs]
Dec 13 11:19:08 viper kernel: ? mutex_lock+0x12/0x30
Dec 13 11:19:08 viper kernel: ? spa_txg_history_init_io+0x104/0x110 [zfs]
Dec 13 11:19:08 viper kernel: txg_sync_thread+0x2e1/0x4a0 [zfs]
Dec 13 11:19:08 viper kernel: ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
Dec 13 11:19:08 viper kernel: thread_generic_wrapper+0x74/0x90 [spl]
Dec 13 11:19:08 viper kernel: kthread+0x120/0x140
Dec 13 11:19:08 viper kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 13 11:19:08 viper kernel: ? kthread_park+0x90/0x90
Dec 13 11:19:08 viper kernel: ret_from_fork+0x35/0x40
image.png
image.png
image.png
image.png
image.png

2022-11-26 update - rsync issue

hypervisor graphs
image.png
image.png
image.png
image.png
image.png
Nothing remarkable in the ZFS Cache graphs
Nov 26 08:48:18 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 08:48:18 viper kernel: Tainted: P O 5.13.19-4-pve #1
Nov 26 08:48:18 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 08:48:18 viper kernel: task:txg_sync state:D stack: 0 pid:447490 ppid: 2 flags:0x00004000
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.