[LSF/MM TOPIC] The end of the DAX experiment
by Dan Williams
Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
cases.
An enumeration of remaining projects follows, please expand this list
if I missed something:
* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.
* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.
* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.
Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.
* Userfaultfd for file-backed mappings and DAX
Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations
1 year, 2 months
[RFC v3 00/19] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/4.19/v3
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/4.19/v3 branch.
## Changes Since Last Version
- Changed namespace prefix from `test_*` to `kunit_*` as requested by
Shuah.
- Started converting/cleaning up the device tree unittest to use KUnit.
- Started adding KUnit expectations with custom messages.
--
2.20.0.rc0.387.gc7a69e6b6c-goog
1 year, 5 months
[PATCH] libnvdimm, region: Use struct_size() in kzalloc()
by Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct nd_region {
...
struct nd_mapping mapping[0];
};
instance = kzalloc(sizeof(struct nd_region) + sizeof(struct nd_mapping) *
count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, mapping, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo(a)embeddedor.com>
---
drivers/nvdimm/region_devs.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..88becc87e234 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1027,10 +1027,9 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
}
region_buf = ndbr;
} else {
- nd_region = kzalloc(sizeof(struct nd_region)
- + sizeof(struct nd_mapping)
- * ndr_desc->num_mappings,
- GFP_KERNEL);
+ nd_region = kzalloc(struct_size(nd_region, mapping,
+ ndr_desc->num_mappings),
+ GFP_KERNEL);
region_buf = nd_region;
}
--
2.21.0
1 year, 6 months
[PATCH v3 00/10] EFI Specific Purpose Memory Support
by Dan Williams
Changes since v2:
- Consolidate the new E820_TYPE and IORES_DESC and EFI configuration
symbol on an "_APPLICATION_RESERVED" suffix. (Ard).
- Rework the implementation to drop the new MEMBLOCK_APP_SPECIFIC
memblock and move the reservation earlier to e820__memblock_setup().
(Mike)
- Move efi_fake_mem support for EFI_MEMORY_SP to its own implementation
that does not require memblock allocations.
- Move is_efi_application_reserved() into the x86 efi implementation.
(Ard)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html
---
Merge logistics: These patches touch core-efi, acpi, device-dax, and
x86. Given the regression risk is highest for the x86 changes it seems
tip.git is the best tree to host the series.
---
The EFI 2.8 Specification [2] introduces the EFI_MEMORY_SP ("specific
purpose") memory attribute. This attribute bit replaces the deprecated
ACPI HMAT "reservation hint" that was introduced in ACPI 6.2 and removed
in ACPI 6.3.
Given the increasing diversity of memory types that might be advertised
to the operating system, there is a need for platform firmware to hint
which memory ranges are free for the OS to use as general purpose memory
and which ranges are intended for application specific usage. For
example, an application with prior knowledge of the platform may expect
to be able to exclusively allocate a precious / limited pool of high
bandwidth memory. Alternatively, for the general purpose case, the
operating system may want to make the memory available on a best effort
basis as a unique numa-node with performance properties by the new
CONFIG_HMEM_REPORTING [3] facility.
In support of optionally allowing either application-exclusive and
core-kernel-mm managed access to differentiated memory, claim
EFI_MEMORY_SP ranges for exposure as device-dax instances by default.
Such instances can be directly owned / mapped by a
platform-topology-aware application. Alternatively, with the new kmem
facility [4], the administrator has the option to instead designate that
those memory ranges be hot-added to the core-kernel-mm as a unique
memory numa-node. In short, allow for the decision about what software
agent manages specific-purpose memory to be made at runtime.
The patches are based on the new HMAT+HMEM_REPORTING facilities merged
for v5.2-rc1. The implementation is tested with qemu emulation of HMAT
[5] plus the efi_fake_mem facility for applying the EFI_MEMORY_SP
attribute.
[2]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
[5]: http://patchwork.ozlabs.org/cover/1096737/
---
Dan Williams (10):
acpi/numa: Establish a new drivers/acpi/numa/ directory
acpi/numa/hmat: Skip publishing target info for nodes with no online memory
efi: Enumerate EFI_MEMORY_SP
x86, efi: Push EFI_MEMMAP check into leaf routines
x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax
x86, efi: Add efi_fake_mem support for EFI_MEMORY_SP
lib/memregion: Uplevel the pmem "region" ida to a global allocator
device-dax: Add a driver for "hmem" devices
acpi/numa/hmat: Register HMAT at device_initcall level
acpi/numa/hmat: Register "specific purpose" memory as an "hmem" device
arch/x86/Kconfig | 21 +++++
arch/x86/boot/compressed/eboot.c | 5 +
arch/x86/boot/compressed/kaslr.c | 3 -
arch/x86/include/asm/e820/types.h | 9 ++
arch/x86/include/asm/efi.h | 15 ++++
arch/x86/kernel/e820.c | 12 ++-
arch/x86/kernel/setup.c | 21 +++--
arch/x86/platform/efi/efi.c | 40 ++++++++-
arch/x86/platform/efi/quirks.c | 3 +
drivers/acpi/Kconfig | 9 --
drivers/acpi/Makefile | 3 -
drivers/acpi/hmat/Makefile | 2
drivers/acpi/numa/Kconfig | 8 ++
drivers/acpi/numa/Makefile | 3 +
drivers/acpi/numa/hmat.c | 149 +++++++++++++++++++++++++++++++----
drivers/acpi/numa/srat.c | 0
drivers/dax/Kconfig | 27 +++++-
drivers/dax/Makefile | 2
drivers/dax/hmem.c | 58 ++++++++++++++
drivers/firmware/efi/Makefile | 5 +
drivers/firmware/efi/efi.c | 5 +
drivers/firmware/efi/esrt.c | 3 +
drivers/firmware/efi/fake_mem-x86.c | 69 ++++++++++++++++
drivers/firmware/efi/fake_mem.c | 26 +++---
drivers/firmware/efi/fake_mem.h | 10 ++
drivers/nvdimm/Kconfig | 1
drivers/nvdimm/core.c | 1
drivers/nvdimm/nd-core.h | 1
drivers/nvdimm/region_devs.c | 13 +--
include/linux/efi.h | 1
include/linux/ioport.h | 1
include/linux/memregion.h | 11 +++
lib/Kconfig | 7 ++
lib/Makefile | 1
lib/memregion.c | 15 ++++
35 files changed, 481 insertions(+), 79 deletions(-)
delete mode 100644 drivers/acpi/hmat/Makefile
rename drivers/acpi/{hmat/Kconfig => numa/Kconfig} (70%)
create mode 100644 drivers/acpi/numa/Makefile
rename drivers/acpi/{hmat/hmat.c => numa/hmat.c} (81%)
rename drivers/acpi/{numa.c => numa/srat.c} (100%)
create mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/firmware/efi/fake_mem-x86.c
create mode 100644 drivers/firmware/efi/fake_mem.h
create mode 100644 include/linux/memregion.h
create mode 100644 lib/memregion.c
1 year, 6 months
Re: Picking 0th namespace if it is idle
by Aneesh Kumar K.V
aneesh.kumar(a)linux.ibm.com (Aneesh Kumar K.V) writes:
> Hi Dan,
>
> With the patch series to mark the namespace disabled if we have mismatch
> in pfn superblock, we can endup with namespace0 marked idle/disabled.
>
> I am wondering why do do the below in ndctl.
>
>
> static struct ndctl_namespace *region_get_namespace(struct ndctl_region *region)
> {
> struct ndctl_namespace *ndns;
>
> /* prefer the 0th namespace if it is idle */
> ndctl_namespace_foreach(region, ndns)
> if (ndctl_namespace_get_id(ndns) == 0
> && !is_namespace_active(ndns))
> return ndns;
> return ndctl_region_get_namespace_seed(region);
> }
>
> I have a kernel patch that will create a namespace_seed even if we fail
> to ename a pfn backing device. Something like below
>
> @@ -747,12 +752,23 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
> }
> }
> if (dev->parent && is_nd_region(dev->parent) && probe) {
> nd_region = to_nd_region(dev->parent);
> nvdimm_bus_lock(dev);
> if (nd_region->ns_seed == dev)
> nd_region_create_ns_seed(nd_region);
> nvdimm_bus_unlock(dev);
> }
> +
> + if (dev->parent && is_nd_region(dev->parent) && !probe && (ret == -EOPNOTSUPP)) {
> + nd_region = to_nd_region(dev->parent);
> + nvdimm_bus_lock(dev);
> + if (nd_region->ns_seed == dev)
> + nd_region_create_ns_seed(nd_region);
> + nvdimm_bus_unlock(dev);
> + }
> +
>
> With that we can end up with something like the below after boot.
> :/sys/bus/nd/devices/region0$ sudo ndctl list -Ni
> [
> {
> "dev":"namespace0.1",
> "mode":"fsdax",
> "map":"mem",
> "size":0,
> "uuid":"00000000-0000-0000-0000-000000000000",
> "state":"disabled"
> },
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"mem",
> "size":2147483648,
> "uuid":"094e703b-4bf8-4078-ad42-50bebc03e538",
> "state":"disabled"
> }
> ]
>
> namespace0.0 is the one we failed to initialize due to PAGE_SIZE
> mismatch.
>
> We do have namespace_seed pointing to namespacece0.1 correct. But a ndtl
> create-namespace will pick namespace0.0 even if we have seed file
> pointing to namespacec0.1.
>
>
> I am trying to resolve the issues related to creation of new namespaces
> when we have some namespace marked disabled due to pfn_sb setting
> mismatch.
>
> -aneesh
With that ndctl namespace0.0 selection commented out, we do get pick the
right idle namespace.
#ndctl list -Ni
[
{
"dev":"namespace0.1",
"mode":"fsdax",
"map":"mem",
"size":0,
"uuid":"00000000-0000-0000-0000-000000000000",
"state":"disabled"
},
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":2147483648,
"uuid":"0c31ae4b-b053-43c7-82ff-88574e2585b0",
"state":"disabled"
}
]
after ndctl create-namespace -s 2G -r region0
# ndctl list -Ni
[
{
"dev":"namespace0.2",
"mode":"fsdax",
"map":"mem",
"size":0,
"uuid":"00000000-0000-0000-0000-000000000000",
"state":"disabled"
},
{
"dev":"namespace0.1",
"mode":"fsdax",
"map":"dev",
"size":2130706432,
"uuid":"60970059-9412-4eeb-9e7a-b314585a4da3",
"align":65536,
"blockdev":"pmem0.1",
"supported_alignments":[
65536
]
},
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":2147483648,
"uuid":"0c31ae4b-b053-43c7-82ff-88574e2585b0",
"state":"disabled"
}
]
1 year, 7 months
[PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines
by Vivek Goyal
Hi,
Here are the RFC patches for V2 of virtio-fs. These patches apply on top
of 5.1 kernel. These patches are also available here.
https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
Patches for V1 were posted here.
https://lwn.net/ml/linux-fsdevel/20181210171318.16998-1-vgoyal@redhat.com/
This is still work in progress. As of now one can passthrough a host
directory in to guest and it works reasonably well. pjdfstests test
suite passes and blogbench runs. But this dirctory can't be shared
between guests and host can't modify files in directory yet. That's
still TBD.
Posting another version to gather feedback and comments on progress so far.
More information about the project can be found here.
https://virtio-fs.gitlab.io/
Changes from V1
===============
- Various bug fixes
- virtio-fs dax huge page size working, leading to improved performance.
- Fixed kernel automated tests warnings.
- Better handling of shared cache region reporting by virtio device.
Description from V1 posting
---------------------------
Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.
Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously. File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.
Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.
Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.
- Use fuse protocol (instead of 9p) for communication between guest
and host. Guest kernel will be fuse client and a fuse server will
run on host to serve the requests. Benchmark results are encouraging and
show this approach performs well (2x to 8x improvement depending on test
being run).
- For data access inside guest, mmap portion of file in QEMU address
space and guest accesses this memory using dax. That way guest page
cache is bypassed and there is only one copy of data (on host). This
will also enable mmap(MAP_SHARED) between guests.
- For metadata coherency, there is a shared memory region which contains
version number associated with metadata and any guest changing metadata
updates version number and other guests refresh metadata on next
access. This is yet to be implemented.
How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).
DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.
By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).
These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.
HOWTO
======
We have put instructions on how to use it here.
https://virtio-fs.gitlab.io/
Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.
- cache=none
metadata, data and pathname lookup are not cached in guest. They are always
fetched from host and any changes are immediately pushed to host.
- cache=always
metadata, data and pathname lookup are cached in guest and never expire.
- cache=auto
metadata and pathname lookup cache expires after a configured amount of time
(default is 1 second). Data is cached while the file is open (close to open
consistency).
- writeback/no_writeback
These options control the writeback strategy. If writeback is disabled,
then normal writes will immediately be synchronized with the host fs. If
writeback is enabled, then writes may be cached in the guest until the file
is closed or an fsync(2) performed. This option has no effect on mmap-ed
writes or writes going through the DAX mechanism.
- shared/no_shared
These options control the use of the shared version table. If shared mode
is enabled then metadata and pathname lookup is cached in guest, but is
refreshed due to changes in another virtio-fs instance.
DAX
===
- dax can be turned on/off when mounting virtio-fs inside guest.
TODO
====
- Implement "cache=shared" option.
- Improve error handling on host. If page fault on host fails, we need
to propagate it into guest.
- Try to fine tune for performance.
- Bug fixes
RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).
https://github.com/pjd/pjdfstest
(one symlink test fails and that seems to be due xfs on host. Yet to
look into it).
- Ran blogbench and that works too.
Thanks
Vivek
Miklos Szeredi (2):
fuse: delete dentry if timeout is zero
fuse: Use default_file_splice_read for direct IO
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (10):
fuse: export fuse_end_request()
fuse: export fuse_len_args()
fuse: export fuse_get_unique()
fuse: extract fuse_fill_super_common()
fuse: add fuse_iqueue_ops callbacks
virtio_fs: add skeleton virtio_fs.ko module
dax: remove block device dependencies
fuse, dax: add fuse_conn->dax_dev field
virtio_fs, dax: Set up virtio_fs dax_device
fuse, dax: add DAX mmap support
Vivek Goyal (15):
fuse: Clear setuid bit even in cache=never path
fuse: Export fuse_send_init_request()
fuse: Separate fuse device allocation and installation in fuse_conn
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Keep a list of free dax memory ranges
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Maintain a list of busy elements
fuse: Add logic to free up a memory range
fuse: Release file in process context
fuse: Reschedule dax free work if too many EAGAIN attempts
fuse: Take inode lock for dax inode truncation
virtio-fs: Do not provide abort interface in fusectl
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 108 +++
fs/dax.c | 23 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/fuse/Kconfig | 11 +
fs/fuse/Makefile | 1 +
fs/fuse/control.c | 4 +-
fs/fuse/cuse.c | 5 +-
fs/fuse/dev.c | 80 +-
fs/fuse/dir.c | 28 +-
fs/fuse/file.c | 953 ++++++++++++++++++++++-
fs/fuse/fuse_i.h | 206 ++++-
fs/fuse/inode.c | 307 ++++++--
fs/fuse/virtio_fs.c | 1129 ++++++++++++++++++++++++++++
fs/splice.c | 3 +-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 6 +-
include/linux/fs.h | 2 +
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 34 +
include/uapi/linux/virtio_fs.h | 44 ++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 10 +
26 files changed, 2875 insertions(+), 149 deletions(-)
create mode 100644 fs/fuse/virtio_fs.c
create mode 100644 include/uapi/linux/virtio_fs.h
--
2.20.1
1 year, 7 months
[PATCH v4 0/6] Fixes related namespace alignment/page size/big endian
by Aneesh Kumar K.V
This series handle configs where hugepage support is not enabled by default.
Also, we update some of the information messages to make sure we use PAGE_SIZE instead
of SZ_4K. We now store page size and struct page size in pfn_sb and do extra check
before enabling namespace. There also an endianness fix.
The patch series is on top of subsection v10 patchset
http://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135....
Changes from V3:
* Dropped the change related PFN_MIN_VERSION
* for pfn_sb minor version < 4, we default page_size to PAGE_SIZE instead of SZ_4k.
Aneesh Kumar K.V (6):
nvdimm: Consider probe return -EOPNOTSUPP as success
mm/nvdimm: Add page size and struct page size to pfn superblock
mm/nvdimm: Use correct #defines instead of open coding
mm/nvdimm: Pick the right alignment default when creating dax devices
mm/nvdimm: Use correct alignment when looking at first pfn from a
region
mm/nvdimm: Fix endian conversion issues
arch/powerpc/include/asm/libnvdimm.h | 9 ++++
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/nvdimm.c | 34 +++++++++++++++
arch/x86/include/asm/libnvdimm.h | 19 +++++++++
drivers/nvdimm/btt.c | 8 ++--
drivers/nvdimm/bus.c | 4 +-
drivers/nvdimm/label.c | 2 +-
drivers/nvdimm/namespace_devs.c | 13 +++---
drivers/nvdimm/nd-core.h | 3 +-
drivers/nvdimm/nd.h | 6 ---
drivers/nvdimm/pfn.h | 5 ++-
drivers/nvdimm/pfn_devs.c | 62 ++++++++++++++++++++++++++--
drivers/nvdimm/pmem.c | 26 ++++++++++--
drivers/nvdimm/region_devs.c | 27 ++++++++----
include/linux/huge_mm.h | 7 +++-
kernel/memremap.c | 8 ++--
16 files changed, 194 insertions(+), 40 deletions(-)
create mode 100644 arch/powerpc/include/asm/libnvdimm.h
create mode 100644 arch/powerpc/mm/nvdimm.c
create mode 100644 arch/x86/include/asm/libnvdimm.h
--
2.21.0
1 year, 8 months
[PATCH] filesystem-dax: Disable PMD support
by Dan Williams
Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
been encountering intermittent lockups. The backtraces always include
the filesystem-DAX PMD path, multi-order entries have been a source of
bugs in the past, and disabling the PMD path allows a test that fails in
minutes to run for an hour.
The regression has been bisected to "dax: Convert page fault handlers to
XArray", but little progress has been made on the root cause debug.
Unless / until root cause can be identified mark CONFIG_FS_DAX_PMD
broken to preclude intermittent lockups. Reverting the Xarray conversion
also works, but that change is too big to backport. The implementation
is committed to Xarray at this point.
Link: https://lore.kernel.org/linux-fsdevel/CAPcyv4hwHpX-MkUEqxwdTj7wCCZCN4RV-L...
Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray")
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Jan Kara <jack(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Reported-by: Robert Barror <robert.barror(a)intel.com>
Reported-by: Seema Pandit <seema.pandit(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
fs/Kconfig | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/Kconfig b/fs/Kconfig
index f1046cf6ad85..85eecd0d4c5d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -66,6 +66,9 @@ config FS_DAX_PMD
depends on FS_DAX
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
+ # intermittent lockups since commit b15cd800682f "dax: Convert
+ # page fault handlers to XArray"
+ depends on BROKEN
# Selected by DAX drivers that do not expect filesystem DAX to support
# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
1 year, 8 months
[PATCH v5 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework
by Brendan Higgins
## TL;DR
A not so quick follow-up to Stephen's suggestions on PATCH v4. Nothing
that really changes any functionality or usage with the minor exception
of a couple public functions that Stephen asked me to rename.
Nevertheless, a good deal of clean up and fixes. See changes below.
As for our current status, right now we got Reviewed-bys on all patches
except:
- [PATCH v5 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
However, it would probably be good to get reviews/acks from the
subsystem maintainers on:
- [PATCH v5 06/18] kbuild: enable building KUnit
- [PATCH v5 08/18] objtool: add kunit_try_catch_throw to the noreturn
list
- [PATCH v5 15/18] Documentation: kunit: add documentation for KUnit
- [PATCH v5 17/18] kernel/sysctl-test: Add null pointer test for
sysctl.c:proc_dointvec()
- [PATCH v5 18/18] MAINTAINERS: add proc sysctl KUnit test to PROC
SYSCTL section
Other than that, I think we should be good to go.
One last thing, I updated the background to include my thoughts on KUnit
vs. in kernel testing with kselftest in the background sections as
suggested by Frank in the discussion on PATCH v2.
## Background
This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
(however, KUnit still allows you to run tests on test machines or in VMs
if you want[1]) and does not require tests to be written in userspace
running on a host kernel. Additionally, KUnit is fast: From invocation
to completion KUnit can run several dozen tests in under a second.
Currently, the entire KUnit test suite for KUnit runs in under a second
from the initial invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
### But wait! Doesn't kselftest support in kernel testing?!
In a previous version of this patchset Frank pointed out that kselftest
already supports writing a test that resides in the kernel using the
test module feature[2]. LWN did a really great summary on this
discussion here[3].
Kselftest has a feature that allows a test module to be loaded into a
kernel using the kselftest framework; this does allow someone to write
tests against kernel code not directly exposed to userland; however, it
does not provide much of a framework around how to structure the tests.
The kselftest test module feature just provides a header which has a
standardized way of reporting test failures, and then provides
infrastructure to load and run the tests using the kselftest test
harness.
The kselftest test module does not seem to be opinionated at all in
regards to how tests are structured, how they check for failures, how
tests are organized. Even in the method it provides for reporting
failures is pretty simple; it doesn't have any more advanced failure
reporting or logging features. Given what's there, I think it is fair to
say that it is not actually a framework, but a feature that makes it
possible for someone to do some checks in kernel space.
Furthermore, kselftest test module has very few users. I checked for all
the tests that use it using the following grep command:
grep -Hrn -e 'kselftest_module\.h'
and only got three results: lib/test_strscpy.c, lib/test_printf.c, and
lib/test_bitmap.c.
So despite kselftest test module's existence, there really is no feature
overlap between kselftest and KUnit, save one: that you can use either
to write an in-kernel test, but this is a very small feature in
comparison to everything that KUnit allows you to do. KUnit is a full
x-unit style unit testing framework, whereas kselftest looks a lot more
like an end-to-end/functional testing framework, with a feature that
makes it possible to write in-kernel tests.
### What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
### Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
### More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here[4].
Additionally for convenience, I have applied these patches to a
branch[5]. The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/v5.2-rc4/v5 branch.
## Changes Since Last Version
Aside from a couple public function renames, there isn't really anything
in here that changes any functionality.
- Went through and fixed a couple of anti-patterns suggested by Stephen
Boyd. Things like:
- Dropping an else clause at the end of a function.
- Dropping the comma on the closing sentinel, `{}`, of a list.
- Inlines a bunch of functions in the test case running logic in patch
01/18 to make it more readable as suggested by Stephen Boyd
- Found and fixed bug in resource deallocation logic in patch 02/18. Bug
was discovered as a result of making a change suggested by Stephen
Boyd. This does not substantially change how any of the code works
conceptually.
- Renamed new_string_stream() to alloc_string_stream() as suggested by
Stephen Boyd.
- Made string-stream a KUnit managed object - based on a suggestion made
by Stephen Boyd.
- Renamed kunit_new_stream() to alloc_kunit_stream() as suggested by
Stephen Boyd.
- Removed the ability to set log level after allocating a kunit_stream,
as suggested by Stephen Boyd.
[1] https://google.github.io/kunit-docs/third_party/kernel/docs/usage.html#ku...
[2] https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html#test-module
[3] https://lwn.net/Articles/790235/
[4] https://google.github.io/kunit-docs/third_party/kernel/docs/
[5] https://kunit.googlesource.com/linux/+/kunit/rfc/v5.2-rc4/v5
--
2.22.0.410.gd8fdbe21b5-goog
1 year, 8 months
[PATCH v4 00/10] EFI Specific Purpose Memory Support
by Dan Williams
Changes since v3 [1]:
- Clarify in the changelog that the policy decision of how to treat
specific-purpose memory is x86 only until other archs grow a
translation to IORES_DESC_APPLICATION_RESERVED. The EFI spec does not
mandate a behavior for the EFI_MEMORY_SP attribute so the decision is
kept out of the core EFI implementation. (prompted by Ard)
- Merge the memregion ida into kernel/resource.c and provide a static
inline wrappers around an exported 'struct ida memregion_ids'
instance. (Willy)
- Fix a set of compile errors in the CONFIG_EFI_FAKE_MEMMAP=n case.
(0day)
- Collect Dave's reviewed-by on the series.
[1]: https://lore.kernel.org/lkml/155993563277.3036719.17400338098057706494.st...
---
Merge logistics: These patches touch core-efi, acpi, device-dax, and
x86. Given the regression risk is highest for the x86 changes it seems
tip.git is the best tree to host the series.
---
The EFI 2.8 Specification [2] introduces the EFI_MEMORY_SP ("specific
purpose") memory attribute. This attribute bit replaces the deprecated
ACPI HMAT "reservation hint" that was introduced in ACPI 6.2 and removed
in ACPI 6.3.
Given the increasing diversity of memory types that might be advertised
to the operating system, there is a need for platform firmware to hint
which memory ranges are free for the OS to use as general purpose memory
and which ranges are intended for application specific usage. For
example, an application with prior knowledge of the platform may expect
to be able to exclusively allocate a precious / limited pool of high
bandwidth memory. Alternatively, for the general purpose case, the
operating system may want to make the memory available on a best effort
basis as a unique numa-node with performance properties by the new
CONFIG_HMEM_REPORTING [3] facility.
In support of optionally allowing either application-exclusive and
core-kernel-mm managed access to differentiated memory, claim
EFI_MEMORY_SP ranges for exposure as device-dax instances by default.
Such instances can be directly owned / mapped by a
platform-topology-aware application. Alternatively, with the new kmem
facility [4], the administrator has the option to instead designate that
those memory ranges be hot-added to the core-kernel-mm as a unique
memory numa-node. In short, allow for the decision about what software
agent manages specific-purpose memory to be made at runtime.
The patches build on the new HMAT+HMEM_REPORTING facilities merged
for v5.2-rc1. The implementation is tested with qemu emulation of HMAT
[5] plus the efi_fake_mem facility for applying the EFI_MEMORY_SP
attribute.
[2]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
[5]: http://patchwork.ozlabs.org/cover/1096737/
---
Dan Williams (10):
acpi/numa: Establish a new drivers/acpi/numa/ directory
acpi/numa/hmat: Skip publishing target info for nodes with no online memory
efi: Enumerate EFI_MEMORY_SP
x86, efi: Push EFI_MEMMAP check into leaf routines
x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax
x86, efi: Add efi_fake_mem support for EFI_MEMORY_SP
resource: Uplevel the pmem "region" ida to a global allocator
device-dax: Add a driver for "hmem" devices
acpi/numa/hmat: Register HMAT at device_initcall level
acpi/numa/hmat: Register "specific purpose" memory as an "hmem" device
arch/x86/Kconfig | 23 +++++
arch/x86/boot/compressed/eboot.c | 5 +
arch/x86/boot/compressed/kaslr.c | 3 -
arch/x86/include/asm/e820/types.h | 9 ++
arch/x86/include/asm/efi.h | 34 ++++++++
arch/x86/kernel/e820.c | 12 ++-
arch/x86/kernel/setup.c | 21 +++--
arch/x86/platform/efi/efi.c | 40 +++++++++
arch/x86/platform/efi/quirks.c | 3 +
drivers/acpi/Kconfig | 9 --
drivers/acpi/Makefile | 3 -
drivers/acpi/hmat/Makefile | 2
drivers/acpi/numa/Kconfig | 8 ++
drivers/acpi/numa/Makefile | 3 +
drivers/acpi/numa/hmat.c | 148 +++++++++++++++++++++++++++++++----
drivers/acpi/numa/srat.c | 0
drivers/dax/Kconfig | 27 +++++-
drivers/dax/Makefile | 2
drivers/dax/hmem.c | 57 +++++++++++++
drivers/firmware/efi/Makefile | 5 +
drivers/firmware/efi/efi.c | 5 +
drivers/firmware/efi/esrt.c | 3 +
drivers/firmware/efi/fake_mem.c | 26 +++---
drivers/firmware/efi/fake_mem.h | 10 ++
drivers/firmware/efi/x86-fake_mem.c | 69 ++++++++++++++++
drivers/nvdimm/Kconfig | 1
drivers/nvdimm/core.c | 1
drivers/nvdimm/nd-core.h | 1
drivers/nvdimm/region_devs.c | 12 +--
include/linux/efi.h | 3 -
include/linux/ioport.h | 32 ++++++++
kernel/resource.c | 6 +
lib/Kconfig | 3 +
33 files changed, 504 insertions(+), 82 deletions(-)
delete mode 100644 drivers/acpi/hmat/Makefile
rename drivers/acpi/{hmat/Kconfig => numa/Kconfig} (70%)
create mode 100644 drivers/acpi/numa/Makefile
rename drivers/acpi/{hmat/hmat.c => numa/hmat.c} (81%)
rename drivers/acpi/{numa.c => numa/srat.c} (100%)
create mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/firmware/efi/fake_mem.h
create mode 100644 drivers/firmware/efi/x86-fake_mem.c
1 year, 8 months