When I first detected this issue the system had been up and stable 54 days running zfs 2.0.3-pve2 and kernel 5.4.101-1-pve. Since boot iostat showed ~75 TiB of reads from the disks and 6.4 TiB of writes. 16 of the 20 disks in the system use are incorporated into zpools.
I wanted to test changing my zpool backup approach from individual encryption keys for storeN and storeN-backup which worked well in practice for local zpools but had the side-effect that syncoid jobs required both pools to have their keys loaded, AND it was not possible to perform incremental raw send and receive because the crypto keys were different. The long term goal is to move the storeN-backup pools off-site, so raw incremental operations were a MUST HAVE. Why?
So the blocks being transmitted would be encrypted with their key during send+recv operations. AND so I can perform incremental/delta block replication between the src and dst zpools over WAN in the future. So I did a bit of research and went about performing a zfs send --replicate --raw equivalent with syncoid to seed storeN-backup as identically as possible to storeN, and then tested the raw incremental operations. In initial tests everything looked good...
I did some testing first with a spare disk, and the moved on to perform this for the smallest store6 and then store1. For store1 I ran into some issues as follows:
started a syncoid job to recursively seed store6-backup zpool. time syncoid --compress=none --sendoptions=Rw --no-sync-snap store6/data store6-backup/data
no issues detected.
started a syncoid job to recursively seed store1-backup zpool. time syncoid --compress=none --sendoptions=Rw --no-sync-snap store1/data store1-backup/data
🔥🔥🔥 significant performance drop off without recovery after ~3h of transfer, kernel logging: task txg_sync:3748 blocked for more than 120 seconds
CTRL+C the syncoid process. double checked the dataset sync=disabled settings were being inherited in the send process, increased mbuffer size and resumed point 2 job, had to be specific to the dataset to resume: time syncoid --compress=none --mbuffer-size=32M --sendoptions=w --no-sync-snap store1/data/vm/raw store1-backup/data/vm/raw
performance did not recover to expected levels, apt --full-upgrade which included upgrades to kernel and zfs: kernel 5.4.101-1-pve to 5.4.157-1-pve zfsutils-linux/stable 2.0.3-pve2 to 2.0.6-pve1~bpo10+1 amd64 After the reboot resumed the job per step 5 time syncoid --compress=none --mbuffer-size=32M --sendoptions=w --no-sync-snap store1/data/vm/raw store1-backup/data/vm/raw
Let the job finish and then followed up with:
time syncoid --sendoptions=w --recursive store1/data store1-backup/data
✅ to check that the recursive send+recv is good.
Note, I checked to see which HBA’s the disks were connected to
# check the disks LUN location:
root@viper:~# udevadm info --query=all --name=/dev/sdh | grep S:.*by-path # store6-bk
S: disk/by-path/pci-0000:02:00.0-sas-phy7-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdb | grep S:.*by-path # store1-bk
S: disk/by-path/pci-0000:02:00.0-sas-phy1-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdn | grep S:.*by-path # syncoid-test
S: disk/by-path/pci-0000:03:00.0-sas-phy6-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdi | grep S:.*by-path # store1
S: disk/by-path/pci-0000:02:00.0-sas-phy18-lun-0
root@viper:~# udevadm info --query=all --name=/dev/sdr | grep S:.*by-path # store6
S: disk/by-path/pci-0000:03:00.0-sas-phy18-lun-0
the store6 to syncoid-test was same-HBA ✅
the store6 to store6-backup was inter-HBA ✅
the store1 to store1-backup was same-HBA 🔥
No strong evidence on HBA issues there at the moment.
RAM+SWAP for the same time range:
store1-backup (sdb) notable disk stats.
note how average write I/O time jumped during the issue, and again during the later part of the operation after kernel upgrade.
it looks like the timestamp of the txg_sync logs match the degradation on graph here.
sample of 1 of 5 kernel log examples for txg_sync:
Dec 10 18:24:46 viper kernel: INFO: task txg_sync:3748 blocked for more than 120 seconds.
Dec 10 18:24:46 viper kernel: Tainted: P O 5.4.101-1-pve #1
Dec 10 18:24:46 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 10 18:24:46 viper kernel: txg_sync D 0 3748 2 0x80004000
Dec 10 18:24:46 viper kernel: Call Trace:
Dec 10 18:24:46 viper kernel: __schedule+0x2e6/0x6f0
Dec 10 18:24:46 viper kernel: schedule+0x33/0xa0
Dec 10 18:24:46 viper kernel: schedule_timeout+0x152/0x330
Dec 10 18:24:46 viper kernel: ? __next_timer_interrupt+0xd0/0xd0
Dec 10 18:24:46 viper kernel: io_schedule_timeout+0x1e/0x50
Dec 10 18:24:46 viper kernel: __cv_timedwait_common+0x138/0x170 [spl]
Dec 10 18:24:46 viper kernel: ? wait_woken+0x80/0x80
Dec 10 18:24:46 viper kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 10 18:24:46 viper kernel: zio_wait+0x139/0x280 [zfs]
Dec 10 18:24:46 viper kernel: ? _cond_resched+0x19/0x30
Dec 10 18:24:46 viper kernel: dsl_pool_sync+0xdc/0x510 [zfs]
Dec 10 18:24:46 viper kernel: spa_sync+0x5a4/0xfe0 [zfs]
Dec 10 18:24:46 viper kernel: ? mutex_lock+0x12/0x30
Dec 10 18:24:46 viper kernel: ? spa_txg_history_init_io+0x104/0x110 [zfs]
Dec 10 18:24:46 viper kernel: txg_sync_thread+0x2e1/0x4a0 [zfs]
Dec 10 18:24:46 viper kernel: ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
Dec 10 18:24:46 viper kernel: thread_generic_wrapper+0x74/0x90 [spl]
Dec 10 18:24:46 viper kernel: kthread+0x120/0x140
Dec 10 18:24:46 viper kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 10 18:24:46 viper kernel: ? kthread_park+0x90/0x90
Dec 10 18:24:46 viper kernel: ret_from_fork+0x35/0x40
iostat before the reboot:
kyle@viper:~$ uptime
22:44:53 up 54 days, 7:04, 1 user, load average: 0.61, 3.10, 4.54
kyle@viper:~$ iostat
Linux 5.4.101-1-pve (viper) 12/10/2021 _x86_64_ (24 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.54 0.00 0.91 0.25 0.00 98.31
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.01 0.00 63649 0
sdg 19.26 21.82 228.97 102369166 1074133692
sdj 19.05 21.61 228.97 101351830 1074133692
sda 8.07 2129.45 15.54 9989454888 72889608
sde 0.66 602.21 0.52 2825035260 2450892
sdi (store1) 10.11 2323.14 15.54 10898051016 72905464
sdf 0.26 213.37 0.16 1000933524 770944
sdc 0.66 605.09 1.20 2838522632 5619112
sdd 0.66 593.64 0.48 2784796908 2255996
sdb (store1-bk) 1.18 585.46 271.15 2746468296 1272002744
sdh (store6-bk) 0.61 188.96 217.38 886421900 1019732520
sdr (store6) 4.58 1038.45 93.69 4871453828 439530028
sdl 5.96 1460.52 66.38 6851450978 311392323
sdq 9.31 2140.22 15.49 10039950012 72672928
sdo 5.96 1460.52 66.38 6851451078 311392340
sdm 5.96 1460.52 66.38 6851451298 311392319
sdp 3.02 605.20 15.17 2839068356 71175112
sdn (syncoid-test)0.31 0.02 172.87 86290 810945685
sdk 8.63 2106.64 15.70 9882432536 73669984
sds 0.00 0.00 0.00 3223 0
Upgraded kernel 5.4.101-1-pve to 5.4.157-1-pve. Upgraded zfsutils-linux/stable 2.0.3-pve2 to 2.0.6-pve1~bpo10+1 amd64
During the final part of the send+recv there was one more example of the txg_sync message in the kernel log:
Dec 11 03:28:03 viper kernel: INFO: task txg_sync:11367 blocked for more than 120 seconds.
Dec 11 03:28:03 viper kernel: Tainted: P O 5.4.157-1-pve #1
Dec 11 03:28:03 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 03:28:03 viper kernel: txg_sync D 0 11367 2 0x80004000
Dec 11 03:28:03 viper kernel: Call Trace:
Dec 11 03:28:03 viper kernel: __schedule+0x2e6/0x6f0
Dec 11 03:28:03 viper kernel: schedule+0x33/0xa0
Dec 11 03:28:03 viper kernel: schedule_timeout+0x152/0x330
Dec 11 03:28:03 viper kernel: ? __next_timer_interrupt+0xd0/0xd0
Dec 11 03:28:03 viper kernel: io_schedule_timeout+0x1e/0x50
Dec 11 03:28:03 viper kernel: __cv_timedwait_common+0x138/0x170 [spl]
Dec 11 03:28:03 viper kernel: ? wait_woken+0x80/0x80
Dec 11 03:28:03 viper kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 11 03:28:03 viper kernel: zio_wait+0x139/0x280 [zfs]
Dec 11 03:28:03 viper kernel: ? _cond_resched+0x19/0x30
Dec 11 03:28:03 viper kernel: dsl_pool_sync+0xdc/0x510 [zfs]
Dec 11 03:28:03 viper kernel: spa_sync+0x584/0xff0 [zfs]
Dec 11 03:28:03 viper kernel: ? mutex_lock+0x12/0x30
Dec 11 03:28:03 viper kernel: ? spa_txg_history_init_io+0x104/0x110 [zfs]
Dec 11 03:28:03 viper kernel: txg_sync_thread+0x2e1/0x4a0 [zfs]
Dec 11 03:28:03 viper kernel: ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
Dec 11 03:28:03 viper kernel: thread_generic_wrapper+0x74/0x90 [spl]
Dec 11 03:28:03 viper kernel: kthread+0x120/0x140
Dec 11 03:28:03 viper kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 11 03:28:03 viper kernel: ? kthread_park+0x90/0x90
Dec 11 03:28:03 viper kernel: ret_from_fork+0x35/0x40
summary and open points
related github issue: . I made post on the issue . This initial issue documents a case with rsync but other thread contributors talk about having the issue with send / recv operations.
The kernel upgrade + reboot seemed to help some but its unclear which one, will continue to monitor.
Note that the txg_sync log entries happened before and after the kernel upgrade/reboot.
From the github issue kirscheGIT commented on 12 May:
After upgrade to zfs-2.0.4-pve1 the problem mentioned here disappeared.
So that is promising as I upgraded from 2.0.3 to 2.0.6.
However... matti commented on 17 Oct:
same here with proxmox 7.0, kernel 5.11. no smart errors, tried multiple disks.
I guess only time will tell as this stage.
Q: could it be that the performance of the operation has a relation to the layout and content of the blocks on the sender? store6 contains a lot of media backups with compressed binary data, and store1 is more of a mixed bag of various backups and file sizes and counts. I assume that because the transfer is --raw that the receiving side cannot perform any optimisation or such on the incoming blocks.
Q: TEST ME: can one create a dataset with the same key and perform a mixture of --raw and non --raw transfers? It sounds like this should be possible but it will break the raw incremental ability.
Note that if you do not use this flag (--raw) for sending encrypted datasets, data will be sent unencrypted and may be re-encrypted with a different encryption key on the receiving system, which will disable the ability to do a raw send to that system for incrementals.