[PATCH v3 0/6] BTT error clearing rework
by Vishal Verma
Changes in v3:
- Change the dynamically allocated (during IO) zerobuf to the kernel's
ZERO_PAGE for error clearing (patch 5) (Dan).
- Move the NOIO fixes a level down into nvdimm_clear_poison since both
btt and pmem poison clearing goes through that (Dan).
Changes in v2:
- Drop the ACPI allocation change patch. Instead use
memalloc_noio_{save,restore} to set the GFP_NOIO flag around anything
that can be expected to call into ACPI for clearing errors. (Rafael, Dan).
Clearing errors or badblocks during a BTT write requires sending an ACPI
DSM, which means potentially sleeping. Since a BTT IO happens in atomic
context (preemption disabled, spinlocks may be held), we cannot perform
error clearing in the course of an IO. Due to this error clearing for
BTT IOs has hitherto been disabled.
This series fixes these problems by moving the error clearing out of
the atomic sections in the BTT.
Also fix a potential deadlock that can occur while clearing errors
from either BTT or pmem due to memory allocations in the IO path.
Vishal Verma (6):
btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
btt: refactor map entry operations with macros
btt: ensure that flags were also unchanged during a map_read
btt: cache sector_size in arena_info
libnvdimm, btt: rework error clearing
libnvdimm: fix potential deadlock while clearing errors
drivers/nvdimm/btt.c | 117 ++++++++++++++++++++++++++++++++++++++++++-------
drivers/nvdimm/btt.h | 11 +++++
drivers/nvdimm/bus.c | 6 +++
drivers/nvdimm/claim.c | 9 +---
4 files changed, 118 insertions(+), 25 deletions(-)
--
2.9.3
3 years, 5 months
Re: [PATCH v8 1/1] f2fs: dax: implement direct access
by Dan Williams
[ adding linux-nvdimm ]
On Thu, Jul 20, 2017 at 5:10 AM, sunqiuyang <sunqiuyang(a)huawei.com> wrote:
> From: Qiuyang Sun <sunqiuyang(a)huawei.com>
>
> This patch implements Direct Access (DAX) in F2FS, including:
> - a mount option to choose whether to enable DAX or not
We're in the process of walking back and potentially deprecating the
use of the dax mount option for xfs and ext4 since dax can have
negative performance implications if page cache memory happens to be
faster than pmem. It should be limited to applications that
specifically want the semantic, not globally enabled for the entire
mount. xfs has went ahead and added the XFS_DIFLAG2_DAX indoe flag for
per-inode enabling of dax.
I'm wondering if any new filesystem that adds dax support at this
point should do so with inode flags and not a mount option?
3 years, 5 months
[ndctl PATCH] ndctl, create-namespace: fix size alignment validation
by Dan Williams
Linda reports:
I was creating a namespace on a 4-way interleave set and got an error
I didn't expect:
$ sudo ndctl create-namespace -m sector -s 10G -n number2
Error: '--size=' must align to interleave-width: 4 and alignment: 2097152
did you intend --size=12G?
What's happening is that since I specified my size in units of G,
the function wants the namespace to be 1G * 4 aligned rather than
2M * 4 aligned.
Fix this by using the base "align * interleave_ways" for the alignment
check, and later calculate a 'friendly' recommended size value that
takes into account the units specified for --size.
Reported-by: Linda Knippers <linda.knippers(a)hpe.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
ndctl/namespace.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 73a422633a0f..c4d70c39c6c4 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -557,8 +557,7 @@ static int validate_namespace_options(struct ndctl_region *region,
/* (re-)validate that the size satisfies the alignment */
ways = ndctl_region_get_interleave_ways(region);
- size_align = max(units, size_align) * ways;
- if (p->size % size_align) {
+ if (p->size % (size_align * ways)) {
char *suffix = "";
if (units == SZ_1K)
@@ -570,6 +569,12 @@ static int validate_namespace_options(struct ndctl_region *region,
else if (units == SZ_1T)
suffix = "T";
+ /*
+ * Make the recommendation in the units of the '--size'
+ * option
+ */
+ size_align = max(units, size_align) * ways;
+
p->size /= size_align;
p->size++;
p->size *= size_align;
3 years, 5 months
ndctl: 10G not a multiple of 2M?
by Linda Knippers
Hi Dan,
I was creating a namespace on a 4-way interleave set and got an error
I didn't expect:
$ sudo ndctl create-namespace -m sector -s 10G -n number2
Error: '--size=' must align to interleave-width: 4 and alignment: 2097152
did you intend --size=12G?
I think there's a bug in validate_namespace_options().
What's happening is that since I specified my size in units of G,
the function wants the namespace to be 1G * 4 aligned rather than
2M * 4 aligned. I suspect if I specified my size in M, it would
have worked but I can't test that at the moment.
> size_align = max(units, size_align) * ways;
Why is units part of the equation?
-- ljk
3 years, 5 months
[PATCH][V2] dm,dax: Make sure dm_dax_flush() is called if device supports it
by Vivek Goyal
Right now, dm_dax_flush() is not being called and I think it is not being
called becuase DAXDEV_WRITE_CACHE is not set on dm dax device.
If underlying dax device supports write cache, set DAXDEV_WRITE_CACHE on
dm dax device. This will get dm_dax_flush() being called.
Acked-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
---
drivers/dax/super.c | 6 ++++++
drivers/md/dm-table.c | 35 +++++++++++++++++++++++++++++++++++
include/linux/dax.h | 1 +
3 files changed, 42 insertions(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce9e563..938eb48 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -278,6 +278,12 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc)
}
EXPORT_SYMBOL_GPL(dax_write_cache);
+bool dax_write_cache_enabled(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index a39bcd9..28a4071 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <linux/blk-mq.h>
#include <linux/mount.h>
+#include <linux/dax.h>
#define DM_MSG_PREFIX "table"
@@ -1630,6 +1631,37 @@ static bool dm_table_supports_flush(struct dm_table *t, unsigned long flush)
return false;
}
+static int device_dax_write_cache_enabled(struct dm_target *ti,
+ struct dm_dev *dev, sector_t start,
+ sector_t len, void *data)
+{
+ struct dax_device *dax_dev = dev->dax_dev;
+
+ if (!dax_dev)
+ return false;
+
+ if (dax_write_cache_enabled(dax_dev))
+ return true;
+ return false;
+}
+
+static int dm_table_supports_dax_write_cache(struct dm_table *t)
+{
+ struct dm_target *ti;
+ unsigned i;
+
+ for (i = 0; i < dm_table_get_num_targets(t); i++) {
+ ti = dm_table_get_target(t, i);
+
+ if (ti->type->iterate_devices &&
+ ti->type->iterate_devices(ti,
+ device_dax_write_cache_enabled, NULL))
+ return true;
+ }
+
+ return false;
+}
+
static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1785,6 +1817,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
}
blk_queue_write_cache(q, wc, fua);
+ if (dm_table_supports_dax_write_cache(t))
+ dax_write_cache(t->md->dax_dev, true);
+
/* Ensure that all underlying devices are non-rotational. */
if (dm_table_all_devices_attribute(t, device_is_nonrot))
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7948118..df97b7a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -87,6 +87,7 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t size);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
+bool dax_write_cache_enabled(struct dax_device *dax_dev);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
--
2.5.5
3 years, 5 months
Re: KVM "fake DAX" flushing interface - discussion
by Pankaj Gupta
>
> On Tue, 2017-07-25 at 07:46 -0700, Dan Williams wrote:
> > On Tue, Jul 25, 2017 at 7:27 AM, Pankaj Gupta <pagupta(a)redhat.com>
> > wrote:
> > >
> > > Looks like only way to send flush(blk dev) from guest to host with
> > > nvdimm
> > > is using flush hint addresses. Is this the correct interface I am
> > > looking?
> > >
> > > blkdev_issue_flush
> > > submit_bio_wait
> > > submit_bio
> > > generic_make_request
> > > pmem_make_request
> > > ...
> > > if (bio->bi_opf & REQ_FLUSH)
> > > nvdimm_flush(nd_region);
> >
> > I would inject a paravirtualized version of pmem_make_request() that
> > sends an async flush operation over virtio to the host. Don't try to
> > use flush hint addresses for this, they don't have the proper
> > semantics. The guest should be allowed to issue the flush and receive
> > the completion asynchronously rather than taking a vm exist and
> > blocking on that request.
>
> That is my feeling, too. A slower IO device benefits
> greatly from an asynchronous flush mechanism.
Thanks for all the suggestions!
Just want to summarize here(high level):
This will require implementing new 'virtio-pmem' device which presents
a DAX address range(like pmem) to guest with read/write(direct access)
& device flush functionality. Also, qemu should implement corresponding
support for flush using virtio.
Thanks,
Pankaj
>
> --
> All rights reversed
3 years, 5 months