Many thanks for the detailed explanation. I will experiment with such a patch inclusive
of passing in some flags so that the driver is aware of the intent never to split, and
then attempt to evaluate for myself, at least initially, if the overhead and extra code is
worth this case. I’ll also dig into understanding the basis for the trackers. While I
did of course see them when I was going through the code, I had hoped that maybe they only
served some ancillary purpose involving something like performance or debugging. As
evidenced by the content of your reply, apparently that was a bad theory on my part,
*grin*, but your providing context to their relationship with PRP lists I’m sure will
serve as very helpful toward my understanding of their use. Thank you again.
On Aug 4, 2017, at 6:46 PM, Harris, James R
Great question. In a lot of cases, saving off cb_fn/cb_arg would work as you suggest.
But there are a few cases where it would not – I’ll try to explain.
The io_queue_requests value represents how many nvme_request structures are allocated per
IO queue pair. In the normal case, one call to spdk_nvme_read/write (or one of its
variants) consumes one nvme_request structure and results in one I/O to the SSD. But
there are a few cases where we need to split the user request into multiple I/O to the SSD
– each of which will consume its own nvme_request structure (plus one for the original
1) If the user request exceeds the controller’s MDTS (Max Data Transfer Size), SPDK will
split the request into multiple I/O, each of which is MDTS or less in size.
2) Some Intel NVMe SSDs benefit from splitting I/O in host software if the I/O would
otherwise span a 128KB boundary. For example, a 64KB I/O that starts at offset 96KB (LBA
192 on a 512 byte formatted namespace) would span such a 128KB boundary. The SSD will
handle such an I/O correctly, but performance will be better if the driver splits it into
two I/O – 32KB with offset LBA 192, and 32KB with offset LBA 256. The SPDK NVMe driver
does this splitting to ensure the best performance on these SSDs and relieve the
application from the burden of understanding these boundaries and doing the splits
3) Most NVMe SSDs do not support SGL for describing the host memory buffers, but instead
uses a format called PRP (Physical Region Pages). PRP has some limitations compared to
full SGL. If you’re not familiar with PRP, the NVMe specification has a good explanation.
Some usage models such as vhost generate I/O that are not PRP compatible, which also
requires splitting the I/O into two or more children so that each is PRP compatible.
So, in the scenario we are discussing, if one request completes, but the new I/O
(submitted from the user’s callback) requires more than one request due to one of these
conditions, the spdk_nvme_read/write function would still have an insufficient number of
Your suggestion is still a good one especially for applications that know they will not
submit I/O which require splitting, and that want to depend on an ENOMEM to stop
submitting new I/O requests and wait for a completion to submit a new one in the context
of the completion handler. Would you be willing to submit your patch to GerritHub so the
SPDK community can review it and merge it upstream?
P.S. - There is also the matter of nvme_tracker structures. An nvme_tracker structure is
tied to an nvme_request that has been submitted to the SSD. nvme_tracker structures have
memory allocated for a worst-case PRP list (4KB) and one is allocated per slot in the NVMe
qpair (io_queue_size). Note that this is a lot more memory than the size of an
nvme_request (248 bytes). If an nvme_request is submitted to the qpair and no
nvme_tracker is available, that nvme_request is queued and will be submitted later once an
nvme_tracker becomes available. So freeing the nvme_request earlier would usually allow
the application to submit a new I/O from the context of the completion callback, but there
would be a bit of inefficiency if the nvme_tracker was not also freed earlier. This would
add more complication to the I/O path. This needs more investigation though to quantify
the level of inefficiency to determine if the extra complication is even warranted.
On 8/4/17, 8:58 AM, "SPDK on behalf of Lance Hartmann ORACLE"
<spdk-bounces(a)lists.01.org on behalf of lance.hartmann(a)oracle.com> wrote:
I just joined the SPDK email list after dabbling with the SPDK for a few weeks. I
looked through the SPDK email list’s archives and didn’t see any subjects focused on the
topic of order of operations in NVMe completions, hence my inaugural posting to this list
begins with a question about this. I did see and read through the thread in July 2016
(Re: Callback passed to spdk_nvme_ns_cmd_read not being called sometimes) but it didn’t
touch on the specific question I have. Please accept my apologies in advance if I missed
something else related to NVMe completions when scanning through the SPDK email archives.
In the SPDK NVMe driver I noted that the order of operations in processing completions
is as follows:
1. If a callback was provided, issue the callback.
2. Invoke nvme_free_request().
As the caller’s callback info was saved in the nvme_request struct allocated during
the I/O request, it appears logical that one would reference those respective callback
fields and invoke the callback, before freeing the request struct that contains them.
However, I’m curious to know if there’s any risk or possible concern for diverging from
the programming model if the following sequence is done instead:
1. Save off the req->cb_fn and req->cb_arg in new temporary pointers (local
variables on the stack of the completion function).
2. Invoke nvme_free_request() for the req.
3. If a callback was provided (as saved in the new temporary pointer), issue the
callback via the temporary saved function pointer and with the temporary ptr to the
I’m inquiring about this alteration for the scenario where in the I/O completion
callback code path, the application may (and likely will) wish to issue a new I/O.
However, if this is the first completion invoked after a maximum number of requests (based
on opts->io_queue_requests) have already been submitted, then the I/O callback’s
attempt to issue a new I/O will fail because there are no free requests available based on
the current implementation of the completion code. If on the other hand, the completion
code is altered to invoke nvme_free_request() before calling the applications’s I/O
callback, a request entry will be available to that application as it was just freed, thus
enabling it to issue a new I/O successfully.
I made such a change in my sandbox and this appeared to operate, but as I’m new to the
SPDK NVMe driver I’m eager to solicit feedback from the seasoned developers here to learn
if I’m overlooking something.