[RFC v2 0/2] kvm "fake DAX" device flushing
by Pankaj Gupta
This is RFC V2 for 'fake DAX' flushing interface sharing
for review. This patchset has two main parts:
- Guest virtio-pmem driver
Guest driver reads persistent memory range from paravirt
device and registers with 'nvdimm_bus'. 'nvdimm/pmem'
driver uses this information to allocate persistent
memory range. Also, we have implemented guest side of
VIRTIO flushing interface.
- Qemu virtio-pmem device
It exposes a persistent memory range to KVM guest which
at host side is file backed memory and works as persistent
memory device. In addition to this it provides virtio
device handling of flushing interface. KVM guest performs
Qemu side asynchronous sync using this interface.
Changes from previous RFC[1]:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Details of project idea for 'fake DAX' flushing interface is
shared [2] & [3].
Pankaj Gupta (2):
Add virtio-pmem guest driver
pmem: device flush over VIRTIO
[1] https://marc.info/?l=linux-mm&m=150782346802290&w=2
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html
drivers/nvdimm/region_devs.c | 7 ++
drivers/virtio/Kconfig | 12 +++
drivers/virtio/Makefile | 1
drivers/virtio/virtio_pmem.c | 118 +++++++++++++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 4 +
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 58 +++++++++++++++++++
7 files changed, 201 insertions(+)
2 years, 10 months
Re: [patch 4/4] dm-writecache: use new API for flushing
by Mike Snitzer
On Fri, May 25 2018 at 2:17am -0400,
Mikulas Patocka <mpatocka(a)redhat.com> wrote:
>
>
> On Thu, 24 May 2018, Dan Williams wrote:
>
> > On Fri, May 18, 2018 at 10:25 PM, Mikulas Patocka <mpatocka(a)redhat.com> wrote:
> > > Use new API for flushing persistent memory.
> > >
> > > The problem is this:
> > > * on X86-64, non-temporal stores have the best performance
> > > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> > > should flush cache as late as possible, because it performs better this
> > > way.
> > >
> > > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > > data persistently, all three functions must be called.
> > >
> > > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > >
> > > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > > pmem_commit is wmb.
> > >
> > > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > > pmem_commit is empty.
> >
> > I don't want to grow driver-local wrappers for pmem. You should use
> > memcpy_flushcache directly() and if an architecture does not define
> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this
> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't
> > see a need to add a standalone flush operation if all relevant archs
> > provide memcpy_flushcache(). As for commit, I'd say just use wmb()
> > directly since all archs define it. Alternatively we could introduce
> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy
> > routine and memcpy_flushcache() would imply a wmb().
>
> But memcpy_flushcache() on ARM64 is slow.
Yes, Dan can you please take some time to fully appreciate what this
small wrapper API is providing (maybe you've done that, but your recent
reply is mixed-message). Seems you're keeping the innovation it
provides at arms-length. Basically the PMEM APIs you've helped
construct are lacking, forcing DM developers to own fixing them is an
inversion that only serves to stonewall.
Please, less time on stonewalling and more time lifting this wrapper
API; otherwise the dm-writecache local wrapper API is a near-term means
to an end.
I revised the dm-writecache patch header yesterday, please review:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.gi...
(the last paragraph in particular, I intend to move forward with this
wrapper unless someone on the PMEM side of the house steps up and lifts
it up between now and when the 4.18 merge window opens)
dm: add writecache target
The writecache target caches writes on persistent memory or SSD.
It is intended for databases or other programs that need extremely low
commit latency.
The writecache target doesn't cache reads because reads are supposed to
be cached in page cache in normal RAM.
The following describes the approach used to provide the most efficient
flushing of persistent memory on X86_64 vs ARM64:
* On X86_64 non-temporal stores (i.e. memcpy_flushcache) are faster
than cached writes and cache flushing.
* The ARM64 architecture doesn't have non-temporal stores. So,
memcpy_flushcache on ARM does memcpy (that writes data to the cache)
and then flushes the cache. But this eager cache flushig is slower
than late cache flushing.
The optimal code sequence on ARM to write to persistent memory is to
call memcpy, then do something else, and then call arch_wb_cache_pmem as
late as possible. And this ARM-optimized code sequence is just horribly
slow on X86.
This issue can't be "fixed" in ARM-specific source code. The ARM
processor have characteristics such that eager cache flushing is slower
than late cache flushing - and that's it - you can't change processor
behavior.
We introduce a wrapper API for flushing persistent memory with functions
pmem_memcpy, pmem_flush and pmem_commit. To commit data persistently,
all three functions must be called.
The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
(unlike pmem_memcpy) guarantees that 8-byte values are written
atomically.
On X86_64, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
pmem_commit is wmb.
On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
pmem_commit is empty.
It is clear that this wrapper API for flushing persistent memory needs
to be elevated out of this dm-writecache driver. But that can happen
later without requiring DM developers to blaze new trails on pmem
specific implementation details/quirks (pmem developers need to clean up
their APIs given they are already spread across CONFIG_ARCH_HAS_PMEM_API
and CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE and they absolutely don't take
into account the duality of the different programming models needed to
achieve optimal cross-architecture use of persistent memory).
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <msnitzer(a)redhat.com>
2 years, 10 months
[PATCH v10] mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
by Dan Williams
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.
Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.
For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Reported-by: kbuild test robot <lkp(a)intel.com>
Reported-by: Thomas Meyer <thomas(a)m3y3r.de>
Reported-by: Dave Jiang <dave.jiang(a)intel.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Jan Kara <jack(a)suse.cz>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
This patch replaces and consolidates patch 2 [1] and 4 [2] from the v9
series [3] for "dax: fix dma vs truncate/hole-punch".
The original implementation which introduced fs_dax_claim() was broken
in the presence of partitions as filesystems on partitions of a pmem
device would collide when attempting to issue fs_dax_claim().
Instead, since this new page wakeup behavior is a property of
dev_pagemap pages and there is a 1:1 relationship between a pmem device
and its dev_pagemap instance, make the pmem driver own the page wakeup
initialization rather than the filesystem.
This simplifies the implementation considerably. The diffstat for the
series is now:
21 files changed, 546 insertions(+), 277 deletions(-)
...down from:
24 files changed, 730 insertions(+), 297 deletions(-)
The other patches in the series are not included since they did not
change in any meaningful way. Let me know if anyone wants a full resend,
it will otherwise be available in -next shortly. Given the change in
approach I did not carry Reviewed-by tags from patch 2 and 4 to this
patch.
[1]: [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
https://lists.01.org/pipermail/linux-nvdimm/2018-April/015459.html
[2]: [PATCH v9 4/9] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
https://lists.01.org/pipermail/linux-nvdimm/2018-April/015461.html
[3]: [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch
https://lists.01.org/pipermail/linux-nvdimm/2018-April/015457.html
drivers/dax/super.c | 18 ++++++++---
drivers/nvdimm/pfn_devs.c | 2 -
drivers/nvdimm/pmem.c | 20 +++++++++++++
fs/Kconfig | 1 +
include/linux/memremap.h | 41 ++++++++++----------------
include/linux/mm.h | 71 ++++++++++++++++++++++++++++++++++-----------
kernel/memremap.c | 37 +++++++++++++++++++++--
mm/Kconfig | 5 +++
mm/hmm.c | 13 +-------
mm/swap.c | 3 +-
10 files changed, 143 insertions(+), 68 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..4928b7fcfb71 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -134,15 +134,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
* on being able to do (page_address(pfn_to_page())).
*/
WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+ return 0;
} else if (pfn_t_devmap(pfn)) {
- /* pass */;
- } else {
- pr_debug("VFS (%s): error: dax support not enabled\n",
- sb->s_id);
- return -EOPNOTSUPP;
+ struct dev_pagemap *pgmap;
+
+ pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
+ if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX) {
+ put_dev_pagemap(pgmap);
+ return 0;
+ }
+ put_dev_pagemap(pgmap);
}
- return 0;
+ pr_debug("VFS (%s): error: dax support not enabled\n",
+ sb->s_id);
+ return -EOPNOTSUPP;
}
EXPORT_SYMBOL_GPL(__bdev_dax_supported);
#endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 30b08791597d..3f7ad5bc443e 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -561,8 +561,6 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
res->start += start_pad;
res->end -= end_trunc;
- pgmap->type = MEMORY_DEVICE_HOST;
-
if (nd_pfn->mode == PFN_MODE_RAM) {
if (offset < SZ_8K)
return -EINVAL;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..e6d94604e9a4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -294,6 +294,22 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
}
+static void pmem_release_pgmap_ops(void *__pgmap)
+{
+ dev_pagemap_put_ops();
+}
+
+static int setup_pagemap_fsdax(struct device *dev, struct dev_pagemap *pgmap)
+{
+ dev_pagemap_get_ops();
+ if (devm_add_action_or_reset(dev, pmem_release_pgmap_ops, pgmap))
+ return -ENOMEM;
+ pgmap->type = MEMORY_DEVICE_FS_DAX;
+ pgmap->page_free = generic_dax_pagefree;
+
+ return 0;
+}
+
static int pmem_attach_disk(struct device *dev,
struct nd_namespace_common *ndns)
{
@@ -353,6 +369,8 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = &q->q_usage_counter;
if (is_nd_pfn(dev)) {
+ if (setup_pagemap_fsdax(dev, &pmem->pgmap))
+ return -ENOMEM;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -364,6 +382,8 @@ static int pmem_attach_disk(struct device *dev,
} else if (pmem_should_map_pages(dev)) {
memcpy(&pmem->pgmap.res, &nsio->res, sizeof(pmem->pgmap.res));
pmem->pgmap.altmap_valid = false;
+ if (setup_pagemap_fsdax(dev, &pmem->pgmap))
+ return -ENOMEM;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pmem->pfn_flags |= PFN_MAP;
memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1e050e012eb9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+ select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..29ea63544c4d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_MEMREMAP_H_
#define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
#include <linux/ioport.h>
#include <linux/percpu-refcount.h>
@@ -30,13 +29,6 @@ struct vmem_altmap {
* Specialize ZONE_DEVICE memory into multiple types each having differents
* usage.
*
- * MEMORY_DEVICE_HOST:
- * Persistent device memory (pmem): struct page might be allocated in different
- * memory and architecture might want to perform special actions. It is similar
- * to regular memory, in that the CPU can access it transparently. However,
- * it is likely to have different bandwidth and latency than regular memory.
- * See Documentation/nvdimm/nvdimm.txt for more information.
- *
* MEMORY_DEVICE_PRIVATE:
* Device memory that is not directly addressable by the CPU: CPU can neither
* read nor write private memory. In this case, we do still have struct pages
@@ -53,11 +45,19 @@ struct vmem_altmap {
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
* type. Any page of a process can be migrated to such memory. However no one
* should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * Host memory that has similar access semantics as System RAM i.e. DMA
+ * coherent and supports page pinning. In support of coordinating page
+ * pinning vs other operations MEMORY_DEVICE_FS_DAX arranges for a
+ * wakeup event whenever a page is unpinned and becomes idle. This
+ * wakeup is used to coordinate physical address space management (ex:
+ * fs truncate/hole punch) vs pinned pages (ex: device dma).
*/
enum memory_type {
- MEMORY_DEVICE_HOST = 0,
- MEMORY_DEVICE_PRIVATE,
+ MEMORY_DEVICE_PRIVATE = 1,
MEMORY_DEVICE_PUBLIC,
+ MEMORY_DEVICE_FS_DAX,
};
/*
@@ -123,15 +123,18 @@ struct dev_pagemap {
};
#ifdef CONFIG_ZONE_DEVICE
+void generic_dax_pagefree(struct page *page, void *data);
void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap);
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
#else
+static inline void generic_dax_pagefree(struct page *page, void *data)
+{
+}
+
static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
{
@@ -161,20 +164,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif /* CONFIG_ZONE_DEVICE */
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
- return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
- return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
{
if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..6e19265ee8f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,27 +821,65 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
+{
+ if (!static_branch_unlikely(&devmap_managed_key))
+ return false;
+ if (!is_zone_device_page(page))
+ return false;
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_PUBLIC:
+ case MEMORY_DEVICE_FS_DAX:
+ __put_devmap_managed_page(page);
+ return true;
+ default:
+ break;
+ }
+ return false;
+}
+
+static inline bool is_device_private_page(const struct page *page)
{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}
-#define IS_HMM_ENABLED 0
+
+static inline bool is_device_public_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline void dev_pagemap_get_ops(void)
+{
+}
+
+static inline void dev_pagemap_put_ops(void)
+{
+}
+
+static inline bool put_devmap_managed_page(struct page *page)
+{
+ return false;
+}
+
static inline bool is_device_private_page(const struct page *page)
{
return false;
}
+
static inline bool is_device_public_page(const struct page *page)
{
return false;
}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
static inline void get_page(struct page *page)
{
@@ -859,16 +897,13 @@ static inline void put_page(struct page *page)
page = compound_head(page);
/*
- * For private device pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the private device page is
- * free and we need to inform the device driver through callback. See
+ * For devmap managed pages we need to catch refcount transition from
+ * 2 to 1, when refcount reach one it means the page is free and we
+ * need to inform the device driver through callback. See
* include/linux/memremap.h and HMM for details.
*/
- if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
- unlikely(is_device_public_page(page)))) {
- put_zone_device_private_or_public_page(page);
+ if (put_devmap_managed_page(page))
return;
- }
if (put_page_testzero(page))
__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 37a9604133f6..3e546c6beb42 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -9,12 +9,19 @@
#include <linux/memory_hotplug.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/wait_bit.h>
static DEFINE_MUTEX(pgmap_lock);
static RADIX_TREE(pgmap_radix, GFP_KERNEL);
#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+void generic_dax_pagefree(struct page *page, void *data)
+{
+ wake_up_var(&page->_refcount);
+}
+EXPORT_SYMBOL_GPL(generic_dax_pagefree);
+
static unsigned long order_at(struct resource *res, unsigned long pgoff)
{
unsigned long phys_pgoff = PHYS_PFN(res->start) + pgoff;
@@ -301,8 +308,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
return pgmap;
}
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) || IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL_GPL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+ if (atomic_inc_return(&devmap_enable) == 1)
+ static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+ if (atomic_dec_and_test(&devmap_enable))
+ static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
{
int count = page_ref_dec_return(page);
@@ -322,5 +351,5 @@ void put_zone_device_private_or_public_page(struct page *page)
} else if (!count)
__put_page(page);
}
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index d5004d82a1d6..bf9d6366bced 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -692,6 +692,9 @@ config ARCH_HAS_HMM
config MIGRATE_VMA_HELPER
bool
+config DEV_PAGEMAP_OPS
+ bool
+
config HMM
bool
select MIGRATE_VMA_HELPER
@@ -712,6 +715,7 @@ config DEVICE_PRIVATE
bool "Unaddressable device memory (GPU memory, ...)"
depends on ARCH_HAS_HMM
select HMM
+ select DEV_PAGEMAP_OPS
help
Allows creation of struct pages to represent unaddressable device
@@ -722,6 +726,7 @@ config DEVICE_PUBLIC
bool "Addressable device memory (like GPU memory)"
depends on ARCH_HAS_HMM
select HMM
+ select DEV_PAGEMAP_OPS
help
Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 486dc394a5a3..de7b6bf77201 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
#if IS_ENABLED(CONFIG_HMM_MIRROR)
static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
@@ -1167,7 +1158,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
resource_size_t addr;
int ret;
- static_branch_enable(&device_private_key);
+ dev_pagemap_get_ops();
devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
GFP_KERNEL, dev_to_node(device));
@@ -1261,7 +1252,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
return ERR_PTR(-EINVAL);
- static_branch_enable(&device_private_key);
+ dev_pagemap_get_ops();
devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 3dd518832096..26fc9b5f1b6c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/backing-dev.h>
+#include <linux/memremap.h>
#include <linux/memcontrol.h>
#include <linux/gfp.h>
#include <linux/uio.h>
@@ -743,7 +744,7 @@ void release_pages(struct page **pages, int nr)
flags);
locked_pgdat = NULL;
}
- put_zone_device_private_or_public_page(page);
+ put_devmap_managed_page(page);
continue;
}
2 years, 10 months
[PATCH] nvdimm: preserve read-only setting for pmem devices
by Robert Elliott
The pmem driver don't honor a forced read-only setting for very long:
$ blockdev --setro /dev/pmem0
$ blockdev --getro /dev/pmem0
1
followed by various commands like these:
$ blockdev --rereadpt /dev/pmem0
or
$ mkfs.ext4 /dev/pmem0
results in this in the kernel serial log:
nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write
with the read-only setting lost:
$ blockdev --getro /dev/pmem0
0
That's from bus.c nvdimm_revalidate_disk(), which always applies the
setting from nd_region (which is initially based on the ACPI NFIT
NVDIMM state flags not_armed bit).
In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when
re-reading partition") fixed this issue for SCSI devices to preserve
the previous setting if it was set to read-only.
This patch modifies bus.c to preserve any previous read-only setting.
It also eliminates the kernel serial log print except for cases where
read-write is changed to read-only, so it doesn't print read-only to
read-only non-changes.
Signed-off-by: Robert Elliott <elliott(a)hpe.com>
---
drivers/nvdimm/bus.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index a64023690cad..87762afa45b0 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -566,14 +566,17 @@ int nvdimm_revalidate_disk(struct gendisk *disk)
{
struct device *dev = disk_to_dev(disk)->parent;
struct nd_region *nd_region = to_nd_region(dev->parent);
- const char *pol = nd_region->ro ? "only" : "write";
+ int disk_ro = get_disk_ro(disk);
- if (nd_region->ro == get_disk_ro(disk))
+ /* upgrade to read-only if the region is read-only
+ * preserve as read-only if the disk is already read-only
+ */
+ if (disk_ro || nd_region->ro == disk_ro)
return 0;
- dev_info(dev, "%s read-%s, marking %s read-%s\n",
- dev_name(&nd_region->dev), pol, disk->disk_name, pol);
- set_disk_ro(disk, nd_region->ro);
+ dev_info(dev, "%s read-only, marking %s read-only\n",
+ dev_name(&nd_region->dev), disk->disk_name);
+ set_disk_ro(disk, 1);
return 0;
--
2.14.3
2 years, 10 months
[PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages
by Dan Williams
As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses add
new locking in the fsdax implementation to prevent races between
page-address_space disassociation events and the rmap performed in the
memory_failure() path. Additionally, since dev_pagemap pages are hidden
from the page allocator, add a mechanism to determine the size of the
mapping that encompasses a given poisoned pfn. Lastly, since pmem errors
can be repaired, change the speculatively accessed poison protection,
mce_unmap_kpfn(), to be reversible and otherwise allow ongoing access
from the kernel.
---
Dan Williams (11):
device-dax: convert to vmf_insert_mixed and vm_fault_t
device-dax: cleanup vm_fault de-reference chains
device-dax: enable page_mapping()
device-dax: set page->index
filesystem-dax: set page->index
filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock
mm, madvise_inject_error: fix page count leak
x86, memory_failure: introduce {set,clear}_mce_nospec()
mm, memory_failure: pass page size to kill_proc()
mm, memory_failure: teach memory_failure() about dev_pagemap pages
libnvdimm, pmem: restore page attributes when clearing errors
arch/x86/include/asm/set_memory.h | 29 ++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 ---
arch/x86/kernel/cpu/mcheck/mce.c | 38 +-------
drivers/dax/device.c | 91 ++++++++++++--------
drivers/nvdimm/pmem.c | 26 ++++++
drivers/nvdimm/pmem.h | 13 +++
fs/dax.c | 102 ++++++++++++++++++++--
include/linux/huge_mm.h | 5 +
include/linux/set_memory.h | 14 +++
mm/huge_memory.c | 4 -
mm/madvise.c | 11 ++
mm/memory-failure.c | 133 +++++++++++++++++++++++++++--
12 files changed, 370 insertions(+), 111 deletions(-)
2 years, 10 months
Re:newsletter service
by Katie Burgos
Hey there,
I can send out millions of email newsletters on behalf of you to promote your website or your business for a very good price.
Please let me know if you need this service.
Thanks.
2 years, 10 months
arch_wb_cache_pmem on ARM64
by Mikulas Patocka
Hi
I'd like to ask what's the purpose of dmb(osh) in the function
arch_wb_cache_pmem in arch/arm64/mm/flush.c?
void arch_wb_cache_pmem(void *addr, size_t size)
{
/* Ensure order against any prior non-cacheable writes */
dmb(osh);
__clean_dcache_area_pop(addr, size);
}
The processor may flush the cache spontaneously, that means that all the
flushing may actually happen before the dmb(osh) instruction - so what
does that dmb instruction guard against?
Mikulas
2 years, 10 months
Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing
by Christoph Hellwig
On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> Use new API for flushing persistent memory.
The sentence doesnt make much sense. 'A new API', 'A better
abstraction' maybe?
>
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> should flush cache as late as possible, because it performs better this
> way.
>
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
>
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
>
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.
All these should be provided by the pmem layer, and be properly
documented. And be sorted before adding your new target that uses
them.
2 years, 10 months
Re:newsletter service
by Malcolm Wolfe
Hey there,
I can send out millions of email newsletters on behalf of you to promote your website or your business for a very good price.
Please let me know if you need this service.
Thanks.
2 years, 10 months
[ndctl PATCH] ndctl, list: display the 'map' location in listings
by Vishal Verma
For 'fsdax' and 'devdax' namespaces, a 'map' location may be specified
for page structures storage. This can be 'mem', for system RAM, or 'dev'
for using pmem as the backing storage. Once set, there was no way of
telling using ndctl, which of the two locations a namespace was
configured for. Add this in util_namespace_to_json so that all
namespace listings contain the map location.
Reported-by: "Yigal Korman" <yigal.korman(a)netapp.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
util/json.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/util/json.c b/util/json.c
index c606e1c..17dd90c 100644
--- a/util/json.c
+++ b/util/json.c
@@ -667,11 +667,17 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
{
struct json_object *jndns = json_object_new_object();
struct json_object *jobj, *jbbs = NULL;
+ const char *locations[] = {
+ [NDCTL_PFN_LOC_NONE] = "none",
+ [NDCTL_PFN_LOC_RAM] = "mem",
+ [NDCTL_PFN_LOC_PMEM] = "dev",
+ };
unsigned long long size = ULLONG_MAX;
unsigned int sector_size = UINT_MAX;
enum ndctl_namespace_mode mode;
const char *bdev = NULL, *name;
unsigned int bb_count = 0;
+ enum ndctl_pfn_loc loc;
struct ndctl_btt *btt;
struct ndctl_pfn *pfn;
struct ndctl_dax *dax;
@@ -749,6 +755,12 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
jobj = util_raw_uuid(ndns);
if (jobj)
json_object_object_add(jndns, "raw_uuid", jobj);
+ loc = ndctl_pfn_get_location(pfn);
+ jobj = json_object_new_string(locations[loc]);
+ if (!jobj)
+ goto err;
+ if (jobj)
+ json_object_object_add(jndns, "map", jobj);
bdev = ndctl_pfn_get_block_device(pfn);
} else if (dax) {
struct daxctl_region *dax_region;
@@ -763,6 +775,12 @@ struct json_object *util_namespace_to_json(struct ndctl_namespace *ndns,
jobj = util_raw_uuid(ndns);
if (jobj)
json_object_object_add(jndns, "raw_uuid", jobj);
+ loc = ndctl_dax_get_location(dax);
+ jobj = json_object_new_string(locations[loc]);
+ if (!jobj)
+ goto err;
+ if (jobj)
+ json_object_object_add(jndns, "map", jobj);
if ((flags & UTIL_JSON_DAX) && dax_region) {
jobj = util_daxctl_region_to_json(dax_region, NULL,
flags);
--
2.17.0
2 years, 10 months