I read from the SPDK doc "NVMe Driver Design -- Scaling Performance" (here
<http://www.spdk.io/doc/nvme.html#nvme_design>), which saids:
" For example, if a device claims to be capable of 450,000 I/O per second
at queue depth 128, in practice it does not matter if the driver is using 4
queue pairs each with queue depth 32, or a single queue pair with queue
Does this consider the queuing latency? I am guessing the latency in the
two cases will be different ( in qp/qd = 4/32 and in qp/qd = 1/128). In the
4 threads case, the latency will be 1/4 of the 1 thread case. Do I get it
If so, then I got confused as the document also says:
"In order to take full advantage of this scaling, applications should
consider organizing their internal data structures such that data is
assigned exclusively to a single thread."
Please correct me if I get it wrong. I understand that if the dedicate I/O
thread has the total ownership of the I/O data structures, there is no lock
contention to slow down the I/O. I believe that BlobFS is also designed in
this philosophy in that only one thread is doing I/O.
But considering the RocksDB case, if the shared data structure has already
been largely taken care of by the RocksDB logic via locking (which is
inevitable anyway), the I/O requests each RocksDB thread sends to the
BlobFS could also has its own queue pair to do I/O. More I/O threads means
shorter queue depth and smaller queuing delay.
Even if there is some FS metadata operations that may require some locking,
but I would guest such metadata operation takes only a small portion.
Therefore, is it a viable idea to have more I/O threads in the BlobFS to
serve the multi-threaded RocksDB for a smaller delay? What will be the
pitfalls, or challenges?
Any thoughts/comments are appreciated. Thank you very much!
On our system, we make extensive use of hugepages, so only a fraction of the hugepages are for spdk usage, and the memory allocated may be fragmented at the hugepage level.
Initially we used "--socket-mem=2048,0", but init time was very long, probably since dpdk built its hugepage info from all the hugepages on the system.
Currently I am working around the long init time by this patch to dpdk:
diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
index 18858e2..f7e8199 100644
@@ -97,6 +97,10 @@ get_num_hugepages(const char *subdir)
if (num_pages > UINT32_MAX)
num_pages = UINT32_MAX;
+#define MAX_NUM_HUGEPAGES (2048)
+ if (num_pages > MAX_NUM_HUGEPAGES)
+ num_pages = MAX_NUM_HUGEPAGES;
For the fragmentation I am running a small program that initializes dpdk before the rest of the hugepage owners start allocating their pages.
Is there a better way to limit the # of pages that dpdk works on, and to preallocate a contiguous amount of hugepages?
Recently I am writing a spdk block device in multi-core environment.
initialize iscsi target by multi-core(./app/iscsi_tgt/iscsi_tgt -c
./etc/spdk/iscsi.conf -m 0xFF0).
I use spdk_bdev_write_blocks() to send IO to lower layer of my device, that
is a nvme device.
But I found the lcore execution the call back of IO is
not the same as the lcore who call spdk_bdev_write_blocks().
Is this behavior correct ??
Or I make any mistake ??
I separate the data structure into multi partition and assign each
lcore a partition.
If the lcore execute call back is not the same as the one who send IO, then
use event architecture to send the IO result to correct lcore. That is a
little troublesome .
So, does anybody can give me some hints for how to make the lcore
who send IO
and the lcore who execution IO call back the same ??
Any hints are appreciated
Thank you very much.