Hi, I am trying some experiments to evaluate performance of peer2peer dma.
I am using spdk to control the nvme drives and fio-plugin compiled with
spdk. I am seeing a weird behavior where when I run 4K IOs with IO-Depth of
1 peer2peer DMA from nvme drive to some pci device (which exposes memory
via Bar1) in a different numa node has a 50th percentile latency of 17
usecs. The same experiment but where nvme device and pcie device in same
numa node (node 0) has a latency of 38 usecs. In both cases fio was running
in node 0 cpu core and pci device (which exposes memory via Bar1) is
connected to node 1. DMA from nvme device to host memory also takes 38
To summarize the cases below
1. nvme (numa node 0) - pci device (numa node 1) --- 18 usecs
2. nvme (numa node 1) - pci device (numa node 1) --- 38 usecs
3. nvme (numa node 0) - host memory --- 38 usecs
fio running in numa node 0 cpu core in all cases.
For higher IO Depth values cross numa case (case 1 above), latency
increases steeply and performs poorly than case 2 and case 3.
Any pointers on why this could be happening?
The nvme devices used are both identical intel datacenter ssd 400G.