On Wed, 2018-05-30 at 14:46 +0300, Baruch Even wrote:
As for threading, the only thing I need to make things integrate
my green-threads is to use a function call to do a context switch if needed
(spdk_yield() of sorts), that is empty/no-op for most users and can be a
compile time option (weak symbol? define?) to use some user-provided function.
This will immediately integrate into any green thread system by switching to
another user-thread and returning for a poll when other actions are taken.
This way the posix-thread will not block and there is no special async logic
that needs to happen.
Understood and we'd like to accommodate frameworks like you are using. See the
ongoing thread titled SPDK Dynamic Threading Model. We're just getting started
on that effort. In parallel, any time you see code that performs a blocking or
excessively long operation, please do let us know (or help us remove it).
As for the low level api, my application already has its own queue for each
device, and we split our requests as we need to and combine them as possible
to improve performance. It took us by surprise to find that SPDK internally
breaks out the requests to smaller chunks, queues them internally and
completions may not really complete the full request. There was even was
scenario that we had a deadlock because we overflowed the spdk queues. Once we
found these out we made sure to expose to our application all the device
constraints and break the IOs with the device limitations in mind and now all
the extra work and checks that SPDK does are a waste of cpu cycles as far as
we are concerned.
SPDK will indeed split I/O automatically based on device characteristics.
Splitting is non-trivial because I/O may need to be split for a large number of
reasons (max I/O size, device-internal striping, PRP/SGL restrictions, etc.).
Getting this right is very difficult and I'd highly recommend that people avoid
re-implementing it (see the top half of lib/nvme/nvme_ns_cmd.c). SPDK will also
queue I/O sent beyond the available queue depth. These were both done to make
code using the NVMe driver not have to deal with device-specific concerns (i.e.
the splitting required on one device is different than the splitting on another,
and the queue depth available on one device is different than the queue depth
available on another).
The splitting and queueing should all be entirely transparent to the user
though. There shouldn't ever be a case where a user completion callback is
called prior to all split fragments of the I/O completing - you should get your
completion callback called just one time for each time you called to submit a
new I/O at the public NVMe API level.
There is also the issue that we track our IOs and SPDK because it does the
extra work has IO tracking of its own and it means there is an extra memory
allocation on the data path for no good reason from our point of view.
The memory that SPDK uses for this tracking is allocated when the queue pair is
allocated, so while certainly an extra cache line is accessed, it isn't a memory
When I'm building my command in the queue I need to keep the list of buffers
in some data structure, currently I keep it in a vector and then this gets
translated to PRP/SGL when the command is submitted. I could save that step if
I could just maintain the PRP/SGL myself and just pass the ready-made list to
the command. This will save me some memory and time.
Building valid PRP/SGL data structures is challenging, but for those up to the
task I'd be willing to consider an API that let the user provide them (we'll see
if the rest of the community agrees). I'm always in favor of improvements for
performance, but that seems like a lot of extra work to save what is probably
just a handful of CPU instructions.
All of this just means that spdk is wasting some cpu cycles on things we don't
want it to do and after all the main reason to use spdk and dpdk is to eke out
all the possible performance from the system (cpu/nvme devices) by a more
integrated design. I can understand that other users of spdk do want the
higher levels and do need the features provided but for us we have better
places to do these actions.
On Thu, May 24, 2018 at 7:44 PM Walker, Benjamin <benjamin.walker(a)intel.com>
> Hi Baruch,
> Regarding blocking threads – only the NVMe initialization path does that
> today. We considered doing a fully asynchronous initialization path for NVMe
> devices, but we received a lot of feedback that it makes it very difficult
> to use. I’d personally be open to adding a second initialization path that
> was entirely asynchronous and required the user to explicitly poll some
> initialization context to advance the state machine until the devices come
> online. This clearly makes much more sense in the context of hot plug since
> the device initialization may occur on a thread that is also processing I/O.
> Contributions are always very welcome!
> I do expect to see some fairly major movement this year regarding
> abstractions around threading models and I’d love to define some abstraction
> that allows SPDK to nicely integrate into applications using green threads
> or coroutines without tying SPDK to any specific implementation. This is
> definitely at the front of mind currently (as is dynamic memory management –
> the other major hurdle for integrating with existing applications). I think
> over the coming months there are going to be several lively discussions on
> the mailing list and in IRC as we try to sort all of this out.
> Regarding a lower level API – we have some lower level interfaces, although
> not as low-level as you’re talking about. See spdk_nvme_ctrlr_io_cmd_raw(),
> for example, which lets you build an arbitrary NVMe command and send it. We
> haven’t exposed any knobs to control internal queueing or to allow a user to
> build their own SGL/PRP. Can you outline what the use cases for those are?
> Why would you want to deviate from what SPDK does by default today? We’re
> always amenable to providing more control, if that control has value.
> From: SPDK [mailto:firstname.lastname@example.org] On Behalf Of Baruch Even
> Sent: Thursday, May 24, 2018 1:17 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] Handling of physical disk removals
> I logged the issue as https://github.com/spdk/spdk/issues/310
> I do call detach but I do not use the empty probe call, instead I have
> another part of my application monitor the same uevent information and
> trigger a larger process to remove the drive. I also have this triggered by
> the timeout callback. I had to remove the timeout callback or I'd get
> bombarded with the callback until it actually happened, looks like the
> timeout callback assumes that the reset call will happen from inside it and
> complete before it returns but that doesn't work very well with our
> I have an issue with the threading model you are using as we are using user-
> space-threads (aka green threads or coroutines) and spdk will block the
> entire application for no good reason. It would have been nice if there was
> a yield callback I can provide you so you won't block the main thread. For
> now I need to resort to send the spdk admin calls to a thread to be done
> there since spdk may block.
> If I am on a ranting roll I also would have preferred a lower level
> interface than currently provided that doesn't implement internal queuing
> and internal allocations so much and doesn't hide all the nvme details (prp,
> sgl), though by that stage I'm probably better of with unvme instead of
> spdk. If they had support uio_pci_generic in addition to vfio-pci it would
> probably be a better direction for me.
> On Wed, May 23, 2018 at 8:18 PM Harris, James R <james.r.harris(a)intel.com>
> > Hi Baruch,
> > Thanks for raising this issue – there are absolutely changes that SPDK
> > needs to make here.
> > Can you describe your code path a bit more? Or let me try to guess and
> > you can tell me where I’m wrong.
> > 1) You’re using spdk_nvme_probe() with a NULL trid and a remove_cb
> > handler to detect physically removed devices.
> > 2) In your remove_cb handler, you call spdk_nvme_detach().
> > The SPDK bdev nvme module doesn’t call spdk_nvme_detach() in its remove_cb
> > which is why the SPDK automated tests don’t run into this issue. But I
> > don’t this is correct – we should be calling spdk_nvme_detach() at some
> > point to clean up any allocated resources. It needs to make sure any
> > associate IO channels are freed up first (to avoid racing between the
> > remove callback and different threads submitting IO to that removed
> > controller).
> > Could you file this as a bug in github? Please add any additional details
> > on how you’re hitting this issue if it’s different than what I’ve guessed
> > above.
> > https://github.com/spdk/spdk/issues
> > Thanks,
> > Jim
> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Baruch Even
> > ka.io>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 1:44 AM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] Handling of physical disk removals
> > Hi,
> > I'm using spdk for local nvme through the nvme interface, I find that
> > physical disk removals are not handled properly for my use case and wonder
> > if others see it that way as well and if there is an intention to fix
> > this.
> > Our system uses long running processes that control one or more disks at a
> > time, if a disk fails it may drop completely from the pcie bus and it will
> > also look like that if the disk is physically removed (say a technician
> > mistakes the disk that he should replace).
> > The problem that I see is that spdk doesnt consider a device completely
> > disappearing from the bus and will try to release the io qpair by sending
> > the delete io sq and delete io cq commands, both of these will never get
> > an answer (the device is not on the pcie device anymore) and there is no
> > timeout logic in that code path. This means two things, the process will
> > halt forever and there is an effective memory leak which currently means
> > that we need to restart the process. Now, our system is resilient enough
> > that restarting the process is not a big deal but it is a very messy way
> > to go about handlign a physical drive removal.
> > Have others seen this behavior? Does it bother others?
> > For my own use I put a timeout in there of a few seconds and that solves
> > it for me.
> > Baruch Even
> > --
> > Baruch Even, Software Developer
> > E baruch(a)weka.io
> > www.weka.io
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> Baruch Even, Software Developer
> E baruch(a)weka.io
> SPDK mailing list
Baruch Even, Software Developer
SPDK mailing list