On Wed, 2017-03-22 at 05:51 -0700, Paul Von-Stamwitz wrote:
We had an off-line discussion on implementing a NMVe pass-through
command at the bdev level, and I thought to include the community in
the discussion. Our primary use case is for the retrieval of
SMART/Health information via the Get Log page, but it could be used
for other purposes.
How do you envision this?
Should the upper layer send down a raw NVME command which gets passed
down to blockdev_nvme and is handled similarly to nvmef/direct.c?
Since multiple bdev contexts can share the same admin queue pair,
should we limit which context is allowed to use the pass-through?
Technically we could have an I/O pass-through, but I think we should
limit it to admin commands.
Should we put checks on what is allowed (i.e. read-only commands) or
let anything go through?
We've spent some time thinking about this so let me lay out how I think
this should work. The bdev layer today supports 5 commands -
read/write/unmap/flush/reset. We're (collectively) proposing to add a
6th - NVMe passthru. The command would consist of a 64 byte NVMe
command, an optional pointer to some data, and the length of the
pointer. This part is simple.
The tricky part, as you note, is that there are really two categories
of NVMe commands - I/O and Admin - and they need to be submitted on
different types of NVMe queue pairs. The API the user sees at the bdev
level exposes an spdk_io_channel object on which the user may submit
commands, but that channel is not typed. In reality, it is a thin
wrapper around an NVMe I/O queue pair, so it cannot be used for Admin
commands. This has worked well until now, because the user never needed
to submit any operation that resulted in an Admin command from the bdev
layer. The easiest way to implement NVMe passthru would be to only
allow I/O commands, but that isn't particularly interesting. All of the
commands that we envision people would want to send are Admin commands.
The SPDK NVMe driver already protects the one global Admin queue pair
per controller using a lock, so it's safe to submit Admin commands from
multiple threads. On the submission side in the bdev layer, then, we
can look at the NVMe command being passed in and decide if it is Admin
or I/O and route to the associated NVMe I/O queue pair or the global
Admin queue pair. That part will work out fine.
The challenge is on the completion side. The spdk_io_channel object is
tied to a thread, so that means each NVMe I/O queue pair is also tied
to a thread. When the user submits a command on a channel, they provide
a callback that will be called when the command completes. The bdev
layer guarantees that the callback will be called on the thread that
the command was submitted on (i.e. the one associated with the
channel). Today, since all the commands go through I/O queue pairs, we
set up a poller per channel (on the thread it is associated with) that
polls the underlying NVMe I/O queue pair. If we were to instead route
some commands to the global Admin queue pair, we'll run into the case
where that Admin queue pair was polled by a different thread, causing
the completion callback to execute on a different thread. This would
then require users of the bdev layer to coordinate with locks, which is
I think the solution is to add a completion queue to each
spdk_io_channel in the blockdev_nvme code. We can have a single thread
polling the Admin queue pair as we do today, but when each command
completes it drops a message onto the appropriate spdk_io_channel's
completion queue. The next time that spdk_io_channel is polled for
completions, it can execute the user callbacks (which will now be on
the correct thread).
There is another set of problems that I haven't touched on yet either.
The bdev layer doesn't expose the concept of a namespace or LUN - each
bdev is just one sequential collection of blocks. For devices that
support multiple namespaces/LUNs, we expose a different bdev for each
one. If the user is limited to just doing I/O commands, this works out
fine. However, a number of Admin commands can change the size or number
of namespaces, or change the state of the NVMe controller more
globally, and so sending an Admin command to a bdev may impact other
I think there are a few ways we could work this out. One way is to only
allow informational Admin commands through (log pages and such). This
mostly fixes the problem, except getting a log page actually does
update global state on the controller regarding asynchronous event
requests. However, if we don't allow the user to generate asynchronous
event requests through the bdev layer (and handle them entirely
internally), then I think we can still work this out.
The other option is to only allow NVMe passthrough on devices with one
namespace/LUN and just block it otherwise. This is also reasonably
simple and probably meets your needs.
I would appreciate your thoughts, since we would like to get started