This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
There has been a rash of failures on the test pool starting last night. I was able to root cause the failures to a point in the NVMe-oF shutdown tests. The main substance of the failure is that QAT and the DPDK framework don't always play well with secondary dpdk processes. In the interest of avoiding these failures on future builds, please rebase your changes on the following patch series which includes the fix of not running bdevperf as a secondary process in the NVMe-oF shutdown tests.
I was out of town last week and missed the meeting but saw on Trello you had the topic below:
"a few idea: log structured data store , data store with compression, and metadata replication of Blobstore"
Which I'd be pretty interested in working on with you or at least hearing more about it. When you get a chance, no hurry, can you please expand a little on how the conversation went and what you're looking at specifically?
I have submitted the py-spdk code on https://review.gerrithub.io/#/c/379741/, please take some time to visit it, I will be very grateful to you.
The py-spdk is client which can help the upper-level app to communicate with the SPDK-based app (such as: nvmf_tgt, vhost, iscsi_tgt, etc.). Should I submit it into the other repo I rebuild rather than SPDK repo? Because I think it is a relatively independent kit upon the SPDK.
If you have some thoughts about the py-spdk, please share with me.
I'm SPDK core maintainer responsible for the vhost library.
I saw your virtio-vhost-user patch series on gerrithub. I know you've
been talking about it on SPDK community meeting over a month ago,
although I was on holiday at that time.
I wanted to give you some background of what is currently going on
around SPDK vhost.
SPDK currently keeps an internal copy of DPDK's rte_vhost with a
couple of storage specific changes. We have tried to upstream those
changes to DPDK, but they were rejected . Although they were
critical to support vhost-scsi or vhost-blk, they also altered how
vhost-net operated and that was DPDK's major concern. We kept the
internal rte_vhost copy but still haven't decided whether to try to
switch to DPDK's version or to completely derive from DPDK and
maintain our own vhost library. At one point we've also put together a
list of rte_vhost issues - one of which was vhost-user specification
incompliance that eventually made our vhost-scsi unusable with QEMU
2.12+. The amount of "fixes" that rte_vhost required was huge.
Instead, we tried to create a new, even lower level vhost library in
DPDK . The initial API proposal was warmly welcomed , but a few
months later, after a PoC implementation was ready, the whole library
was rejected as well . [One of the concerns the new library would
address was creating an abstraction and environment for
virtio-vhost-user, but apparently DPDK team didn't find that useful at
We still have the rte_vhost copy in SPDK and we still haven't decided
on its future strategy, which is why we were so reluctant to reviewing
Just last week we seem to have finally made some progress, as a DPDK
patch that would potentially allow SPDK to use DPDK's rte_vhost
directly  was approved for DPDK 19.05. Around the end of February I
believe SPDK will try to stop using its rte_vhost copy and switch to
DPDK's rte_vhost with the mentioned patch.
After that happens, I would like to ask you to rebase your patches on
latest DPDK's rte_vhost and resubmit them to DPDK. I can certainly
help with upstreaming vfio no-iommu support in SPDK and am even
willing to implement registering non-2MB-aligned memory, but rte_vhost
changes belong in DPDK.
I'm sorry for the previous lack of transparency in this matter.
I’d like to do some housecleaning on the open SPDK patches on GerritHub. I suspect a lot of older patches out there have been abandoned in mind, but not abandoned on GerritHub. Cleaning these up will make it easier for patch reviewers (especially the maintainers) to know what needs to be reviewed.
If you have a patch that has not been updated in the last 3 months, please do one of the following:
1. Abandon the patch in GerritHub yourself if it’s no longer relevant.
2. Rebase your patch on top of latest master and push the new revision to GerritHub. This will reset the clock and indicate the patch is still relevant and in need of review.
Any patch with a last update of 3 months or more will be abandoned by one of the core maintainers starting 2 weeks from now – April 11th.
Also note that any patch that is abandoned is not deleted – you still have the option to restore the patch and then push a rebased version.
I was surprised to see that bdev descriptors can be closed only from the same thread that opened them. Vhost doesn't respect this rule. As expected, I was able to trigger assert(desc->thread == spdk_get_thread()) while closing a vhost scsi descriptor using the latest SPDK master. This could be probably fixed by always scheduling the spdk_bdev_close() to proper thread. Maybe vhost could even immediately assume the descriptor is closed and set its `desc` pointer to NULL without waiting for spdk_bdev_close() to be actually called. But why the descriptor needs to be closed from a specific thread in the first place? Would it be possible for spdk_bdev_close() to internally schedule itself on desc->thread?
A descriptor cannot be closed until all associated channel have been destroyed - that's what the bdev programming guide says. When there are multiple I/O channels, there has to be some multi-thread management involved. Also, those channels can't be closed until all their pending I/O has finished. So closing a descriptor will likely have the following flow:
external request (e.g. bdev hotremove or some RPC) -> start draining I/O on all threads -> destroy each i/o channel after its pending i/o has finished -> on the last thread to destroy a channel schedule closing the desc to a proper thread -> close the desc
This additional scheduling of spdk_bdev_close() looks completely unnecessary - it also forces the upper layer to maintain a pointer to the desc thread somewhere, because desc->thread is private to the bdev layer. So to repeat the question again -
would it be possible for spdk_bdev_close() to internally schedule itself on desc->thread, so that spdk_bdev_close() can be called from any thread?
We are very fortunate to have received a large number of abstracts for SPDK
talks, so the conference is jam-packed with talks from all across the
storage industry. The topics range from cutting edge new development to
experiences with deploying in production.
This year the summit is expanding with three parallel tracks to cover a wide
range of topics. We are especially excited to have developers from the PMDK
community join us for several talks on persistent memory in one track, as well as
developers from Intel VTune Amplifier discussing performance monitoring and
optimization tools for both storage and persistent memory in another track.
April 16-17, 2019
Dolce Hayes Mansion
200 Edenvale Ave, San Jose, CA 95136
Register Here: http://cvent.com/d/3bqgy8
The summit remains free to attend. If you plan to book a room at the hotel,
we do ask that you do it through the conference registration system.
We'll see you there!
Is there a way to supply a log handler that can format and redirect SPDK output rather than dumping to stdout/err?
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
I'm trying to collect profiling information via vtune on an SPDK
application and found the --with-vtune=<path> configuration flag. But the
build fails with
vtune.c:47:10: fatal error: ittnotify_static.c: No such file or directory
bdev.c:55:10: fatal error: ittnotify.h: No such file or directory
which I sort of understand as neither of those files exist under
My question is, how do those files get installed? I'd assumed that vtune
took care of that part, but I'm now wondering if I missed something. TIA.