When creating a new namespace, the kernel needs to make sure namespace size and
start address are correctly aligned to SUBSECTION_SIZE. This ensures kernel can
enable/disable namespace without conflicting with memory hotplug rules. Without this,
a namespace that partially covers a SUBSECTION can prevent the creation of an
adjacent namespace because the hotplug subsystem will find the subsection already active.
To make sure new kernel don't break an existing install with an unaligned start/size attribute,
while initializing the namespace the kernel validates these attribute against direct-map
mapping page size rather than subsection size.
Aneesh Kumar K.V (6):
libnvdimm/namespace: Make namespace size validation arch dependent
libnvdimm/namespace: Validate namespace start addr and size
libnvdimm/namespace: Add arch dependent callback for namespace create
libnvdimm/namespace: Validate namespace size when creating a new
libnvdimm/namespace: Align DPA based on arch restrictions
libnvdimm/namespace: Expose arch specific supported size align value
arch/arm64/mm/flush.c | 13 +++++
arch/powerpc/lib/pmem.c | 20 ++++++++
arch/x86/mm/pageattr.c | 13 +++++
drivers/nvdimm/namespace_devs.c | 85 +++++++++++++++++++++++++++++++--
include/linux/libnvdimm.h | 2 +
5 files changed, 128 insertions(+), 5 deletions(-)
This series is a mix of bug fixes, cleanup and new support in KVM's
handling of huge pages. The series initially stemmed from a syzkaller
bug report, which is fixed by patch 02, "mm: thp: KVM: Explicitly
check for THP when populating secondary MMU".
While investigating options for fixing the syzkaller bug, I realized KVM
could reuse the approach from Barret's series to enable huge pages for DAX
mappings in KVM for all types of huge mappings, i.e. walk the host page
tables instead of querying metadata (patches 05 - 09).
Walking the host page tables sidesteps the issues with refcounting and
identifying THP mappings (in theory), and using a common method for
identifying huge mappings should improve (haven't actually measured) KVM's
overall page fault latency by eliminating the vma lookup that is currently
used to identify HugeTLB mappings. Eliminating the HugeTLB specific code
also allows for additional cleanup (patches 10 - 13).
Testing the page walk approach revealed several pre-existing bugs that
are included here (patches 01, 03 and 04) because the changes interact
with the rest of the series, e.g. without the read-only memslots fix,
walking the host page tables without explicitly filtering out HugeTLB
mappings would pick up read-only memslots and introduce a completely
unintended functional change.
Lastly, with the page walk infrastructure in place, supporting DAX-based
huge mappings becomes a trivial change (patch 14).
Based on kvm/queue, commit e41a90be9659 ("KVM: x86/mmu: WARN if root_hpa
is invalid when handling a page fault")
Paolo, assuming I understand your workflow, patch 01 can be squashed with
the buggy commit as it's still sitting in kvm/queue.
Sean Christopherson (14):
KVM: x86/mmu: Enforce max_level on HugeTLB mappings
mm: thp: KVM: Explicitly check for THP when populating secondary MMU
KVM: Use vcpu-specific gva->hva translation when querying host page
KVM: Play nice with read-only memslots when querying host page size
x86/mm: Introduce lookup_address_in_mm()
KVM: x86/mmu: Refactor THP adjust to prep for changing query
KVM: x86/mmu: Walk host page tables to find THP mappings
KVM: x86/mmu: Drop level optimization from fast_page_fault()
KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings
KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
KVM: x86/mmu: Zap any compound page when collapsing sptes
KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()
KVM: x86/mmu: Use huge pages for DAX-backed files
arch/powerpc/kvm/book3s_xive_native.c | 2 +-
arch/x86/include/asm/pgtable_types.h | 4 +
arch/x86/kvm/mmu/mmu.c | 208 ++++++++++----------------
arch/x86/kvm/mmu/paging_tmpl.h | 29 +---
arch/x86/mm/pageattr.c | 11 ++
include/linux/huge_mm.h | 6 +
include/linux/kvm_host.h | 3 +-
mm/huge_memory.c | 11 ++
virt/kvm/arm/mmu.c | 8 +-
virt/kvm/kvm_main.c | 24 ++-
10 files changed, 145 insertions(+), 161 deletions(-)
Changes since v2 :
- Fix numa_cleanup_meminfo() to skip trimming reserved ranges to max_pfn.
- Collect Michael's acked-by
x86 folks: This has an ack from Rafael for ACPI, and Michael for Power.
With an x86 ack I plan to take this through the libnvdimm tree provided
the x86 touches look ok to you.
Arrange for platform numa info to be preserved for determining
'target_node' data. Where a 'target_node' is the node a reserved memory
range will become when it is onlined.
This new infrastructure is expected to be more valuable over time for
Memory Tiers / Hierarchy management as more platforms (via the ACPI HMAT
and EFI Specific Purpose Memory) publish reserved or "soft-reserved"
ranges to Linux. Linux system administrators will expect to be able to
interact with those ranges with a unique numa node number when/if that
memory is onlined via the dax_kmem driver .
One configuration that currently fails to properly convey the target
node for the resulting memory hotplug operation is persistent memory
defined by the memmap=nn!ss parameter. For example, today if node1 is a
memory only node, and all the memory from node1 is specified to
memmap=nn!ss and subsequently onlined, it will end up being onlined as
node0 memory. As it stands, memory_add_physaddr_to_nid() can only
identify online nodes and since node1 in this example has no online cpus
/ memory the target node is initialized node0.
The fix is to preserve rather than discard the numa_meminfo entries that
are relevant for reserved memory ranges, and to uplevel the node
distance helper for determining the "local" (closest) node relative to
an initiator node.
Dan Williams (6):
ACPI: NUMA: Up-level "map to online node" functionality
mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
powerpc/papr_scm: Switch to numa_map_to_online_node()
x86/mm: Introduce CONFIG_KEEP_NUMA
x86/numa: Provide a range-to-target_node lookup facility
libnvdimm/e820: Retrieve and populate correct 'target_node' info
arch/powerpc/platforms/pseries/papr_scm.c | 21 --------
arch/x86/Kconfig | 1
arch/x86/mm/numa.c | 74 +++++++++++++++++++++++------
drivers/acpi/numa/srat.c | 41 ----------------
drivers/nvdimm/e820.c | 18 ++-----
include/linux/acpi.h | 23 +++++++++
include/linux/numa.h | 23 +++++++++
mm/Kconfig | 5 ++
mm/mempolicy.c | 35 ++++++++++++++
9 files changed, 149 insertions(+), 92 deletions(-)
Since the last RFC patch set much of the discussion of supporting RDMA with
FS DAX has been around the semantics of the lease mechanism. Within that
thread it was suggested I try and write some documentation and/or tests for the
new mechanism being proposed. I have created a foundation to test lease
functionality within xfstests. This should be close to being accepted.
Before writing additional lease tests, or changing lots of kernel code, this
email presents documentation for the new proposed "layout lease" semantic.
At Linux Plumbers just over a week ago, I presented the current state of the
patch set and the outstanding issues. Based on the discussion there, well as
follow up emails, I propose the following addition to the fcntl() man page.
<fcntl man page addition>
Layout (F_LAYOUT) leases are special leases which can be used to control and/or
be informed about the manipulation of the underlying layout of a file.
A layout is defined as the logical file block -> physical file block mapping
including the file size and sharing of physical blocks among files. Note that
the unwritten state of a block is not considered part of file layout.
**Read layout lease F_RDLCK | F_LAYOUT**
Read layout leases can be used to be informed of layout changes by the
system or other users. This lease is similar to the standard read (F_RDLCK)
lease in that any attempt to change the _layout_ of the file will be reported to
the process through the lease break process. But this lease is different
because the file can be opened for write and data can be read and/or written to
the file as long as the underlying layout of the file does not change.
Therefore, the lease is not broken if the file is simply open for write, but
_may_ be broken if an operation such as, truncate(), fallocate() or write()
results in changing the underlying layout.
**Write layout lease (F_WRLCK | F_LAYOUT)**
Write Layout leases can be used to break read layout leases to indicate that
the process intends to change the underlying layout lease of the file.
A process which has taken a write layout lease has exclusive ownership of the
file layout and can modify that layout as long as the lease is held.
Operations which change the layout are allowed by that process. But operations
from other file descriptors which attempt to change the layout will break the
lease through the standard lease break process. The F_LAYOUT flag is used to
indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT. In
the F_LAYOUT case opens for write do not break the lease. But some operations,
if they change the underlying layout, may.
The distinction between read layout leases and write layout leases is that
write layout leases can change the layout without breaking the lease within the
owning process. This is useful to guarantee a layout prior to specifying the
unbreakable flag described below.
**Unbreakable Layout Leases (F_UNBREAK)**
In order to support pinning of file pages by direct user space users an
unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
lease. When specified, F_UNBREAK indicates that any user attempting to break
the lease will fail with ETXTBUSY rather than follow the normal breaking
Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
specified. The difference between an unbreakable read layout lease and an
unbreakable write layout lease are that an unbreakable read layout lease is
_not_ exclusive. This means that once a layout is established on a file,
multiple unbreakable read layout leases can be taken by multiple processes and
used to pin the underlying pages of that file.
Care must therefore be taken to ensure that the layout of the file is as the
user wants prior to using the unbreakable read layout lease. A safe mechanism
to do this would be to take a write layout lease and use fallocate() to set the
layout of the file. The layout lease can then be "downgraded" to unbreakable
read layout as long as no other user broke the write layout lease.
</fcntl man page addition>