This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
There has been a rash of failures on the test pool starting last night. I was able to root cause the failures to a point in the NVMe-oF shutdown tests. The main substance of the failure is that QAT and the DPDK framework don't always play well with secondary dpdk processes. In the interest of avoiding these failures on future builds, please rebase your changes on the following patch series which includes the fix of not running bdevperf as a secondary process in the NVMe-oF shutdown tests.
I was out of town last week and missed the meeting but saw on Trello you had the topic below:
"a few idea: log structured data store , data store with compression, and metadata replication of Blobstore"
Which I'd be pretty interested in working on with you or at least hearing more about it. When you get a chance, no hurry, can you please expand a little on how the conversation went and what you're looking at specifically?
I have submitted the py-spdk code on https://review.gerrithub.io/#/c/379741/, please take some time to visit it, I will be very grateful to you.
The py-spdk is client which can help the upper-level app to communicate with the SPDK-based app (such as: nvmf_tgt, vhost, iscsi_tgt, etc.). Should I submit it into the other repo I rebuild rather than SPDK repo? Because I think it is a relatively independent kit upon the SPDK.
If you have some thoughts about the py-spdk, please share with me.
I was trying to share DMA memory between two processes with SPDK in this
way but failed:
* Process A creates a memory region ( 2M aligned pinned memory from
anonymous mmap with MAP_HUGETLB)
* Process B then registers the above memory region from A, so that SPDK
can use it as DMA memory for nvme driver.
My current approach is (if there are better ways please let me know~):
1. Use a kernel module to store the starting physical address of this
region shared from process A.
2. process B then also uses the kernel module to map this range of
physical region to its virtual space (phys_addr->virt_addr_2) (using
3. process B uses spdk_mem_register to register virt_addr_2.
My questions are:
i. The remap_pfn_range in step 2 only gives me 4K mappings(but they are
actually backed by hugepages allocated in process A). Does this violate
"The memory region must map to pinned huge pages (2MB or greater"
requirement in spdk_mem_register? (since the mappings are in 4K). This
ii. Currently the registration failed with -14 in process B. I traced
into spdk and found out that the VFIO_IOMMU_MAP_DMA ioctrol failed in
vtophys_iommu_map_dma (note that I had vfio enabled). Am i missing some
steps (like in
https://lists.01.org/pipermail/spdk/2018-May/001884.html)? Or the
failure might just be caused by the 4k mapping i had?
iii. if memory from process A is mmaped from a dax device, are there any
pitfalls i need to be aware of(in the presence of vfio)?
I'm writing a simple application that starts off by initializing rte_eal then the spdk_thread_lib_init before spdk_thread creation. however, I'n not sure why spdk_thread_lib_init(NULL,0) keeps returning -1 causing a segmentation fault to the app although I have allocated enough hugepages in the begining (100 of 1G pages). Does anyone have an idea of what would be causing the error?
SPDK CI Downtime Announcement
July 5th 4:00 PM UTC to July 8th 4:00 AM UTC
Power shutdown in some of the buildings on campus.
Some of the buildings on campus will be shut down for electrical maintenance.
Our CI servers will be powered, but there may be outage in network connectivity and CI may be disrupted.
Since today FreeBSD machines have test with FIO enabled, but for it to work a patch merged over week ago is required:
As a result failures on fio plugin compilation on FreeBSD will occur on patches that weren't rebased recently.
If you see those on your patches, please rebase them on top of master.
I was looking at the NVMe Data-in and Data-out flow for NVMeOF RDMA transport, looks like SPDK is capable of handling in-capsule data (for Write operations from the host) by pre-posting in-capsule data buffer SGL to RDMA recv buffers. But, looks like SPDK doesn't use in-line data buffers while sending response to the NVMe read commands from the host and eventually using RDMA write to send the data, can someone please confirm whether my observation is correct or not?
I've been waiting for this this patchset:
What version of vpp is this supposed to be used with? I assume it's
19.01+. I want to use 19.01 and it looks like this new session API is
required to work with the later versions of vpp. Just wondering
the timeline on this being merged.