Comments inlined below:
On Jul 2, 2018, at 12:35 PM, Harris, James R
From: SPDK <spdk-bounces(a)lists.01.org <mailto:email@example.com>> on
behalf of Lance Hartmann ORACLE <lance.hartmann(a)oracle.com
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org
Date: Tuesday, June 26, 2018 at 9:56 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org
Subject: [SPDK] Best practices on driver binding for SPDK in production environments
4. With my new udev rules in place, I was successful getting specific NVMe controllers
(based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci.
However, I made a couple of observations in the kernel log (dmesg). In particular, I
was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a
udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme
driver’s probe() was still running on this controller, and perhaps the unbind request came
in before the probe() completed hence this “name1: failed to mark controller live”. This
has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind
occurs, that perhaps I should instead try to derive a trigger on the “last" udev
event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to
know ahead of time just how many namespaces exist on that controller if I were to do that
so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like
a complaint during the middle of probe() of that particular controller. Then, again,
maybe I can just safely ignore that and not worry about it at all? Thoughts?
[Jim] Can you confirm your suspicion - maybe add a 1 or 2 second delay after detecting
Event 2 before unbinding – and see if that eliminates the probe failures? I’m not
suggesting that as a workaround or solution – just want to know for sure if we need to
worry about deferring the unbind until after the kernel driver’s probe has completed. It
sounds like these error messages are benign but would be nice to avoid them.
For experimentation purposes, yes, I might be able to instrument a delay to see if the
kernel nvme probe failures go away. I don’t know if udev execution is multi-threaded or
not, and thus whether such a delay would block other udev events from getting processed
while mine sleeps, but I can explore this at least as an experiment.
Let me emphasize another point. While playing with this further, I did subsequently
discover that the end result, at least with my particular NVMe drives, was in fact not
benign. That is, although the NVMe controller did appear successfully bound to vfio-pci,
execution of any SPDK apps (e.g. perf, identify) returned a failure attempting to
communicate with the controller. I then removed my udev rule, then manually unbound the
controller from vfio-pci and rebound it to the kernel’s nvme driver. After doing that,
inspection of dmesg revealed a complaint from the nvme driver accessing the device. And,
so, I then reboot the system — again, having ensured that my udev rule was not in place
(neither in my rootfs nor the initramfs) — to see how the controller would behave
following a reboot and coming up with the kernel nvme driver in the default scenario.
Again, a dmesg revealed complaints about accessing that particular NVMe controller.
Finally, I power-cycled the host, and lo and behold after doing that, then the NVMe
controller came up fine.
In summary, I will at least attempt the delay-experiment and see if that helps us sidestep
the probe failure and leaving the NVMe controller in a bad state. If that should work, I
may then alter the udev rule to trigger instead on the add action of the last namespace
instead of the bind action to the nvme driver and see how that works.
Overall this seems like a reasonable approach though. How do you see this working if a
system has multiple NVMe SSDs – one of which has the OS install, and the rest should be
assigned to uio/vfio?
We do have this exact scenario; i.e. systems with NVMe controllers (on which file systems
are mounted) which depend on the kernel nvme driver where other NVMe controllers are
‘reserved’ for SPDK-use. Among the criteria in my udev rule’s trigger criteria is also
the BDF (bus-device-function), so this should work fine. We just have to make abundantly
clear how careful one must be configuring the system to use this mechanism to avoid
inadvertently triggering on a NVMe controller that’s needed for use with the kernel nvme