This email to the SPDK list is a follow-on to a brief discussion held during a recent SPDK community meeting (Tue Jun 26 UTC 15:00).
Lifted and edited from the Trello agenda item (https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd... <https://trello.com/c/U291IBYx/91-best-practices-on-driver-binding-for-spd...>):
During development many (most?) people rely on the run of SPDK's scripts/setup.sh to perform a number of initializations, among them the unbinding of the Linux kernel nvme driver from NVMe controllers targeted for use by the SPDK and then binding them to either uio_pci_generic or vfio-pci. This script is applicable for development environments, but not targeted for use in productions systems employing the SPDK.
I'd like to confer with my fellow SPDK community members on ideas, suggestions and best practices for handling this driver unbinding/binding. I wrote some udev rules along with updates to some other Linux system conf files for automatically loading either the uio_pci_generic or vfio-pci modules. I also had to update my initramfs so that when the system comes all the way up, the desired NVMe controllers are already bound to the needed driver for SPDK operation. And, as a bonus, it should "just work" when a hotplug occurs as well. However, there may be additional considerations I might have overlooked on which I'd appreciate input. Further, there's the matter of how and whether to semi-automate this configuration via some kind of script and how that might vary according to Linux distro to say nothing of the determination of employing uio_pci_generic vs vfio-pci.
And, now some details:
1. I performed this on an Oracle Linux (OL) distro. I’m currently unaware how and what configuration files might be different depending on the distro. Oracle Linux is RedHat-compatible, so I’m confident my implementation should run similarly on RedHat-based systems, but I’ve yet to delve into other distro’s like Debian, SuSE, etc.
2. In preparation to writing my own udev rules, I unbound a specific NVMe controller from the Linux nvme driver by hand. Then, in another window I launched: "udevadm monitor -k -p” so that I could observe the usual udev events when a NVMe controller is bound to the nvme driver. On my system, I observed four (4) udev kernel events (abbreviated/edited output to avoid this become excessively long):
KERNEL[382128.187273] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0 (nvme)
KERNEL[382128.244658] bind /devices/pci0000:00/0000:00:02.2/0000:30:00.0 (pci)
KERNEL[382130.697832] add /devices/virtual/bdi/259:0 (bdi)
KERNEL[382130.698192] add /devices/pci0000:00/0000:00:02.2/0000:30:00.0/nvme/nvme0/nvme0n1 (block)
3. My udev rule triggers on (Event 2) above: the bind action. Upon this action, my udev rule appends operations to the special udev RUN variable such that udev will essentially mirror that which is done in the SPDK’s scripts/setup.sh for unbinding from the nvme driver and binding to, in my case, the vfio-pci driver.
4. With my new udev rules in place, I was successful getting specific NVMe controllers (based on bus-device-function) to unbind from the Linux nvme driver and bind to vfio-pci. However, I made a couple of observations in the kernel log (dmesg). In particular, I was drawn to the following for an NVMe controller at BDF: 0000:40:00.0 for which I had a udev rule to unbind from nvme and bind to vfio-pci:
[ 35.534279] nvme nvme1: pci function 0000:40:00.0
[ 37.964945] nvme nvme1: failed to mark controller live
[ 37.964947] nvme nvme1: Removing after probe failure status: 0
One theory I have for the above is that my udev RUN rule was invoked while the nvme driver’s probe() was still running on this controller, and perhaps the unbind request came in before the probe() completed hence this “name1: failed to mark controller live”. This has left lingering in my mind that maybe instead of triggering on (Event 2) when the bind occurs, that perhaps I should instead try to derive a trigger on the “last" udev event, an “add”, where the NVMe namespace’s are instantiated. Of course, I’d need to know ahead of time just how many namespaces exist on that controller if I were to do that so I’d trigger on the last one. I’m wondering if that may help to avoid what looks like a complaint during the middle of probe() of that particular controller. Then, again, maybe I can just safely ignore that and not worry about it at all? Thoughts?
I discovered another issue during this experimentation that is somewhat tangential to this task, but I’ll write a separate email on that topic.
thanks for any feedback,
There has been a rash of failures on the test pool starting last night. I was able to root cause the failures to a point in the NVMe-oF shutdown tests. The main substance of the failure is that QAT and the DPDK framework don't always play well with secondary dpdk processes. In the interest of avoiding these failures on future builds, please rebase your changes on the following patch series which includes the fix of not running bdevperf as a secondary process in the NVMe-oF shutdown tests.
I was out of town last week and missed the meeting but saw on Trello you had the topic below:
"a few idea: log structured data store , data store with compression, and metadata replication of Blobstore"
Which I'd be pretty interested in working on with you or at least hearing more about it. When you get a chance, no hurry, can you please expand a little on how the conversation went and what you're looking at specifically?
I have submitted the py-spdk code on https://review.gerrithub.io/#/c/379741/, please take some time to visit it, I will be very grateful to you.
The py-spdk is client which can help the upper-level app to communicate with the SPDK-based app (such as: nvmf_tgt, vhost, iscsi_tgt, etc.). Should I submit it into the other repo I rebuild rather than SPDK repo? Because I think it is a relatively independent kit upon the SPDK.
If you have some thoughts about the py-spdk, please share with me.
We are using spdk in a multi processes environment, each process has a different roll, however the processes are communicating using the share memory pool and ring (using the SPDK api).
Because we are using the share memory functionality, we need to designate one of the processes as primary and all the other as secondary.
Here start our problems: We have two type of processes one type interfaces an nvme device and the other type interfaces the network.
We need the network process to serve as primary for the shared memory, but we don't want it to be exposed to the nvme devices.
* How the two type of processes can be initialise to meet our requirements ?
* one solution that cross my mind, was to mount two hugepages directories, one for shared memory and one for nvme driver, I couldn't find a way doing it.
While testing a new bdev module with bdevperf, I'm noticing the bdev's
module_fini function being called multiple times. The module finish is
synchronous and does not call spdk_bdev_module_finish_done(). Tracing
the execution of spdk_bdev_module_finish_iter() shows the function
working its way backwards through the list of modules until it reaches
the first module:
spdk_bdev_module_finish_iter:1151 start last(xxxx)
spdk_bdev_module_finish_iter:1159 module xxxx(has fini)
spdk_bdev_module_finish_iter:1159 module virtio_blk(no fini)
spdk_bdev_module_finish_iter:1159 module virtio_scsi(has async fini)
spdk_bdev_module_finish_iter:1155 start resume(aio)
spdk_bdev_module_finish_iter:1159 module aio(has fini)
spdk_bdev_module_finish_iter:1159 module nvme(has fini)
spdk_bdev_module_finish_iter:1159 module null(has async fini)
spdk_bdev_module_finish_iter:1155 start resume(malloc)
spdk_bdev_module_finish_iter:1159 module malloc(no fini)
spdk_bdev_module_finish_iter:1159 module lvol(no fini)
spdk_bdev_module_finish_iter:1159 module ftl(has async fini)
spdk_bdev_module_finish_iter:1155 start resume(passthru)
spdk_bdev_module_finish_iter:1159 module passthru(has fini)
spdk_bdev_module_finish_iter:1159 module error(has fini)
spdk_bdev_module_finish_iter:1159 module gpt(no fini)
spdk_bdev_module_finish_iter:1159 module split(has fini)
spdk_bdev_module_finish_iter:1159 module delay(has fini)
spdk_bdev_module_finish_iter:1159 module raid(has fini)
Note that blank lines in the above trace indicate entry to
spdk_bdev_module_finish_iter() and the 'xxxx' is the module under
development. After this process finishes, it appears that one of the
modules with an asynchronous finish functions kicks off the process
again. Stopping in the new .module_fini with gdb shows a backtrace of:
Thread 1 "reactor_0" hit Breakpoint 1, bdev_xxxx_finish () at bdev_xxxx.c:524
524 printf("%s: enter\n", __func__);
#0 bdev_xxxx_finish () at bdev_xxxx.c:524
#1 0x00005555555ffd61 in spdk_bdev_module_finish_iter (arg=0x0) at bdev.c:1170
#2 0x00005555555ffde9 in spdk_bdev_module_finish_done () at bdev.c:1191
#3 0x00005555555a304e in bdev_ftl_ftl_module_fini_cb (ctx=0x0,
status=0) at bdev_ftl.c:1075
#4 0x00005555555e3aff in ftl_anm_unregister_poller_cb
(ctx=0x555557a0c920) at ftl_anm.c:524
#5 0x0000555555612deb in _spdk_msg_queue_run_batch
(thread=0x555555df7a10, max_msgs=8) at thread.c:406
#6 0x0000555555613005 in spdk_thread_poll (thread=0x555555df7a10,
max_msgs=0, now=8284681201660426) at thread.c:462
#7 0x000055555560d1c9 in _spdk_reactor_run (arg=0x555555df7640) at
#8 0x000055555560d59d in spdk_reactors_start () at reactor.c:381
#9 0x000055555560bc96 in spdk_app_start (opts=0x7fffffffe260,
start_fn=0x555555573b14 <bdevperf_run>, arg1=0x0) at app.c:687
#10 0x0000555555574159 in main (argc=11, argv=0x7fffffffe3f8) at bdevperf.c:1404
Do bdev modules need to handle multiple calls to their .module_fini
function or does this behavior indicate a problem in my bdev module?
I have a couple of observations/questions regarding vendor specific
commands over NVMf and I was hoping to get your take on this.
- I can get vendor specific *IO *commands to work in my custom bdev by
supporting SPDK_BDEV_IO_TYPE_NVME_IO. However, doing the same for
fails. Is there a reason why we are blocking the vendor specific range of
C0h-FFh for Admin commands (see spdk_nvmf_ctrlr_process_admin_cmd).
- The current bdev API to complete a NVMe specific request is
spdk_bdev_io_complete_nvme_status(struct spdk_bdev_io *bdev_io, int sct,
int sc) but it only takes in the 2 status code that are written to the
completion queue status field. I would also like to set the CDW0 of the
completion queue. Are there any plans to support this or do we want to keep
the bdev API as front-end protocol agnostic as possible?
I guess one way to support it would be to add an additional field to
bdev_io's structure like there is for the NVMe/SCSI specific status code
handling, but again, do we want to add more protocol specifics to bdev_io?
Are the any other options to set CDW0?
Thanks for your help.
On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:
<<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys(). It would be an interesting experiment though. Your app is not in a VM right?>>
I am thinking of passing the physical address of the buffers in bio to spdk. I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
SPDK won’t be running in VM.
SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack. Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address. Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses. The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.
So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough. Those physical address regions would also need to get dynamically mapped into the SPDK address space.
Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system? There may be some optimization opportunities on the system calls to look at as well.
From: "Luse, Paul E" <paul.e.luse(a)intel.com>
Date: Sunday, August 11, 2019 at 12:53 PM
To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
Subject: RE: NBD with SPDK
Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys(). It would be an interesting experiment though. Your app is not in a VM right?
From: Mittal, Rishabh [mailto:email@example.com]
Sent: Saturday, August 10, 2019 6:09 PM
Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
Subject: NBD with SPDK
We are trying to use NBD and SPDK on client side. Data path looks like this
File System ----> NBD client ------>SPDK------->NVMEoF
Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
I think that there could be two ways to prevent data copy .
1. Memory mapped the kernel buffers to spdk virtual space. I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
2. If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
SPDK mailing list
I saw that the TSAS field, byte#1 (RDMA Provider Type - RDMA_PRTYPE) value is 0 (not specified).
We expect that this value will be 2 (Infiniband RoCEV2).
Everything is working properly and we succeeded to connect but I just wanted to verify that the 'not specified' value is OK and supports Infiniband RoCEV2.
From: Michal BenHaim <michal.benhaim(a)kaminario.com>
Sent: Sunday, August 11, 2019 11:42 AM
To: Limor Halutzi <Limor.Halutzi(a)kaminario.com>
Subject: Fw: [SPDK] Comparing discovery log page to the spec definition
Michal Ben Haim
From: Walker, Benjamin <benjamin.walker(a)intel.com>
Sent: Friday, August 9, 2019 7:26 PM
Cc: Michal BenHaim
Subject: Re: [SPDK] Comparing discovery log page to the spec definition
On Wed, 2019-08-07 at 12:54 +0000, Limor Halutzi wrote:
> Hi Guys,
> I am working on comparing the discovery log page to the spec and I noticed
> that there is a trello task on this issue (Audit and make fully spec compliant
> the implementation of the Discovery Log Page) and I have some questions:
> 1. TREQ (byte 03) - We don't need a secure session and in our system this
> field is set to not specified (0).
> Do we need to change it to not required (2)?
I think we should change this to not required as you indicate, but what we're
returning seems technically correct.
> What is the different between these two options?
> 2. I compared the log page entry struct with the log page entry from the
> spec and they are the same.
> In addition, I checked that the values are valid to my machine.
> What is required to close this task?
I moved it to closed. Thanks!
> SPDK mailing list