Persistent memory is cool. But, currently, you have to rewrite
your applications to use it. Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory? Well... have I got the patch
series for you!
This series adds a new "driver" to which pmem devices can be
attached. Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory. On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
This is highly RFC, and I really want the feedback from the
nvdimm/pmem folks about whether this is a viable long-term
perversion of their code and device mode. It's insufficiently
documented and probably not bisectable either.
1. The device re-binding hacks are ham-fisted at best. We
need a better way of doing this, especially so the kmem
driver does not get in the way of normal pmem devices.
2. When the device has no proper node, we default it to
NUMA node 0. Is that OK?
3. We muck with the 'struct resource' code quite a bit. It
definitely needs a once-over from folks more familiar
with it than I.
4. Is there a better way to do this than starting with a
copy of pmem.c?
Here's how I set up a system to test this thing:
1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory. Reserving a spot a 2GB
physical seems to work: memmap=512M!0x0000000080000000
This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
ndctl create-namespace -fe namespace0.0 -m dax
4. In the background, the kmem driver will probably bind to the
5. Now, online the new memory sections. Perhaps:
grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
echo $f: `cat $f`
echo online > $f
grep ^MemTotal /proc/meminfo
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Ross Zwisler <zwisler(a)kernel.org>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Fengguang Wu <fengguang.wu(a)intel.com>
This patch set provides functionality that will help to improve the
locality of the async_schedule calls used to provide deferred
This patch set originally started out focused on just the one call to
async_schedule_domain in the nvdimm tree that was being used to defer the
device_add call however after doing some digging I realized the scope of
this was much broader than I had originally planned. As such I went
through and reworked the underlying infrastructure down to replacing the
queue_work call itself with a function of my own and opted to try and
provide a NUMA aware solution that would work for a broader audience.
In addition I have added several tweaks and/or clean-ups to the front of the
patch set. Patches 1 through 4 address a number of issues that actually were
causing the existing async_schedule calls to not show the performance that
they could due to either not scaling on a per device basis, or due to issues
that could result in a potential deadlock. For example, patch 4 addresses the
fact that we were calling async_schedule once per driver instead of once
per device, and as a result we would have still ended up with devices
being probed on a non-local node without addressing this first.
Dropped nvdimm patch to submit later.
It relies on code in libnvdimm development tree.
Simplified queue_work_near to just convert node into a CPU.
Split up drivers core and PM core patches.
Renamed queue_work_near to queue_work_node
Added WARN_ON_ONCE if we use queue_work_node with per-cpu workqueue
Added Acked-by for queue_work_node patch
Continued rename from _near to _node to be consistent with queue_work_node
Renamed async_schedule_near_domain to async_schedule_node_domain
Renamed async_schedule_near to async_schedule_node
Added kerneldoc for new async_schedule_XXX functions
Updated patch description for patch 4 to include data on potential gains
Added patch to consolidate use of need_parent_lock
Make asynchronous driver probing explicit about use of drvdata
Added patch to move async_synchronize_full to address deadlock
Added bit async_probe to act as mutex for probe/remove calls
Added back nvdimm patch as code it relies on is now in Linus's tree
Incorporated review comments on parent & device locking consolidation
Rebased on latest linux-next
Drop the "This patch" or "This change" from start of patch descriptions.
Drop unnecessary parenthesis in first patch
Use same wording for "selecting a CPU" in comments added in first patch
Added kernel documentation for async_probe member of device
Fixed up comments for async_schedule calls in patch 2
Moved code related setting async driver out of device.h and into dd.c
Added Reviewed-by for several patches
Fixed typo which had kernel doc refer to "lock" when I meant "unlock"
Dropped "bool X:1" to "u8 X:1" from patch description
Added async_driver to device_private structure to store driver
Dropped unecessary code shuffle from async_probe patch
Reordered patches to move fixes up to front
Added Reviewed-by for several patches
Updated cover page and patch descriptions throughout the set
Alexander Duyck (9):
driver core: Move async_synchronize_full call
driver core: Establish clear order of operations for deferred probe and remove
device core: Consolidate locking and unlocking of parent and device
driver core: Probe devices asynchronously instead of the driver
workqueue: Provide queue_work_node to queue work near a given NUMA node
async: Add support for queueing on specific NUMA node
driver core: Attach devices on CPU local to device node
PM core: Use new async_schedule_dev command
libnvdimm: Schedule device registration on node local to the device
drivers/base/base.h | 4 +
drivers/base/bus.c | 46 ++---------
drivers/base/dd.c | 182 +++++++++++++++++++++++++++++++++++++++------
drivers/base/power/main.c | 12 +--
drivers/nvdimm/bus.c | 11 ++-
include/linux/async.h | 82 ++++++++++++++++++++
include/linux/device.h | 3 +
include/linux/workqueue.h | 2
kernel/async.c | 53 +++++++------
kernel/workqueue.c | 84 +++++++++++++++++++++
10 files changed, 380 insertions(+), 99 deletions(-)
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
## Changes Since Last Version
- Updated patchset to apply cleanly on 4.19.
- Stripped down patchset to focus on just the core features (I dropped
mocking, spying, and the MMIO stuff for now; you can find these
patches here: https://kunit-review.googlesource.com/c/linux/+/1132),
as suggested by Rob.
- Cleaned up some of the commit messages and tweaked commit order a
bit based on suggestions.
On Mon, Nov 26, 2018 at 11:40:20AM +0100, gregkh(a)linuxfoundation.org wrote:
> The patch below does not apply to the 4.19-stable tree.
> If someone wants it applied there, or to any other stable or longterm
> tree, then please email the backport, including the original git commit
> id to <stable(a)vger.kernel.org>.
The fix for 4.19 is rather more complex because we don't have the right information in the right places. Dan, does this look right to you?
diff --git a/fs/dax.c b/fs/dax.c
index 0fb270f0a0ef6..b8dd66f1951a6 100644
@@ -227,7 +227,9 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
* Must be called with the i_pages lock held.
static void *__get_unlocked_mapping_entry(struct address_space *mapping,
- pgoff_t index, void ***slotp, bool (*wait_fn)(void))
+ pgoff_t index, void ***slotp,
+ bool (*wait_fn)(struct address_space *mapping,
+ pgoff_t index, void *entry))
void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
@@ -253,7 +255,7 @@ static void *__get_unlocked_mapping_entry(struct address_space *mapping,
- revalidate = wait_fn();
+ revalidate = wait_fn(mapping, index, entry);
@@ -261,7 +263,8 @@ static void *__get_unlocked_mapping_entry(struct address_space *mapping,
-static bool entry_wait(void)
+static bool entry_wait(struct address_space *mapping, unsigned long index,
+ void *entry)
@@ -393,12 +396,18 @@ static struct page *dax_busy_page(void *entry)
-static bool entry_wait_revalidate(void)
+static bool entry_wait_revalidate(struct address_space *mapping,
+ unsigned long index, void *entry)
+ * We're not going to do anything with this entry; wake the next
+ * task in line
+ put_unlocked_mapping_entry(mapping, index, entry);
* Tell __get_unlocked_mapping_entry() to take a break, we need
* to revalidate page->mapping after dropping locks
> ------------------ original commit in Linus's tree ------------------
> >From 25bbe21bf427a81b8e3ccd480ea0e1d940256156 Mon Sep 17 00:00:00 2001
> From: Matthew Wilcox <willy(a)infradead.org>
> Date: Fri, 16 Nov 2018 15:50:02 -0500
> Subject: [PATCH] dax: Avoid losing wakeup in dax_lock_mapping_entry
> After calling get_unlocked_entry(), you have to call
> put_unlocked_entry() to avoid subsequent waiters losing wakeups.
> Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
> Cc: stable(a)vger.kernel.org
> Signed-off-by: Matthew Wilcox <willy(a)infradead.org>
> diff --git a/fs/dax.c b/fs/dax.c
> index cf2394e2bf4b..9bcce89ea18e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -391,6 +391,7 @@ bool dax_lock_mapping_entry(struct page *page)
> entry = get_unlocked_entry(&xas);
> + put_unlocked_entry(&xas, entry);
Attempts to build ndctl in an epel-6 environment fail with the following
In file included from util/filter.c:24:
./daxctl/libdaxctl.h:22: error: redefinition of typedef 'uuid_t'
./ndctl/libndctl.h:26: note: previous declaration of 'uuid_t' was here
Newer compilers discard the error since it sees the types are
compatible, but the definition should be coming from uuid.h when built
with libuuid. Arrange for an AC_DEFINE for the HAVE_UUID symbol when
libuuid is detected by PKG_CHECK_MODULES.
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
configure.ac | 3 ++-
daxctl/libdaxctl.h | 2 +-
ndctl/libndctl.h | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/configure.ac b/configure.ac
index bb6b03324ea8..de5b84cec670 100644
@@ -112,7 +112,8 @@ AM_CONDITIONAL([ENABLE_POISON],
+ [AC_DEFINE([HAVE_UUID], , [Define to 1 if using libuuid])])
diff --git a/daxctl/libdaxctl.h b/daxctl/libdaxctl.h
index 21bc376ce629..1d13ea291f6f 100644
@@ -16,7 +16,7 @@
typedef unsigned char uuid_t;
diff --git a/ndctl/libndctl.h b/ndctl/libndctl.h
index 62cef9e82da3..c81cc032ebae 100644
@@ -20,7 +20,7 @@
typedef unsigned char uuid_t;
These both fix race conditions in dax_lock_mapping_entry(). I've tagged
them both for 4.19 backport, which will fail and I'll do the equivalent
patch for it. Dan, do you want to take these through your tree?
Matthew Wilcox (2):
dax: Check page->mapping isn't NULL
dax: Don't access a freed inode
fs/dax.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
From: Huaisheng Ye <yehs1(a)lenovo.com>
Add NULL funtions for origin_dax_direct_access and
origin_dax_copy_from/to_iter in order to avoid building
error when CONFIG_DAX_DRIVER has NOT been enabled.
This series patches are used to realize the dax_operations for dm-snapshot
with persistent memory device.
Here are the steps about how to verify the function.
1. Configure the persistent memory to fs-dax mode and create namespace with ndctl;
2. find them in /dev;
# ndctl list
3. create lv_pmem (here is 4G size) for testing;
# pvcreate /dev/pmem0
# vgcreate vg_pmem /dev/pmem0
# lvcreate -L 4G -n lv_pmem vg_pmem
4. create filesystem (ext2 or ext4) to /dev/pmem0
# mkfs.ext2 -b 4096 /dev/vg_pmem/lv_pmem
5. mount pmem with DAX way;
# mkdir /mnt/lv_pmem
# mount -o dax /dev/vg_pmem/lv_pmem /mnt/lv_pmem/
6. cp some files to /mnt/lv_pmem;
# cp linear_table03.log /mnt/lv_pmem/
# cp test0.log /mnt/lv_pmem/
7. create snapshot for test (here I limit it to 1G size);
# lvcreate -L 1G -n snap_pmem -s /dev/vg_pmem/lv_pmem
8. modify the files copied with vim or copy more other new files;
# vim /mnt/lv_pmem/test0.log
9. umount the pmem device;
# umount /mnt/lv_pmem/
10.merge the snapshot back to origin;
# lvconvert --merge /dev/vg_pmem/snap_pmem
11.mount pmem device again for checking the content of files;
# mount -o dax /dev/vg_pmem/lv_pmem /mnt/lv_pmem/
Huaisheng Ye (3):
dm: enable dax_operations for dm-snapshot
dm: expand hc_map in mapped_device for lack of map
dm: expand valid types for dm-ioctl
drivers/md/dm-core.h | 1 +
drivers/md/dm-ioctl.c | 4 +++-
drivers/md/dm-snap.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
drivers/md/dm.c | 15 +++++++++++++++
4 files changed, 68 insertions(+), 3 deletions(-)