[PATCH v3 0/2] Support ACPI 6.1 update in NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v3:
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
(commit 138a95547ab0)
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
---
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
2 years, 8 months
DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
by Dan Williams
[ adding linux-fsdevel and linux-nvdimm ]
On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
<guangrong.xiao(a)linux.intel.com> wrote:
[..]
> However, it is not easy to handle the case that the new VMA overlays with
> the old VMA
> already got by userspace. I think we have some choices:
> 1: One way is completely skipping the new VMA region as current kernel code
> does but i
> do not think this is good as the later VMAs will be dropped.
>
> 2: show the un-overlayed portion of new VMA. In your case, we just show the
> region
> (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
> created
> region with different attributions.
>
> 3: completely show the new VMA as this patch does.
>
> Which one do you prefer?
>
I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?
My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed. XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag. We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable. I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping. Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...
4 years, 2 months
[PATCH 0/20 v3] dax: Clear dirty bits after flushing caches
by Jan Kara
Hello,
this is a third revision of my patches to clear dirty bits from radix tree of
DAX inodes when caches for corresponding pfns have been flushed. This patch set
is significantly larger than the previous version because I'm changing how
->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
fault so that we don't have to leak details about DAX locking into the generic
code. In principle, these patches enable handlers to easily update PTEs and do
other work necessary to finish the fault without duplicating the functionality
present in the generic code. I'd be really like feedback from mm folks whether
such changes to fault handling code are fine or what they'd do differently.
The patches pass testing with xfstests on ext4 and xfs on my end
- just be aware they are basis for further DAX fixes without which some
stress tests can still trigger failures. I'll be sending these fixes separately
in order to keep the series of reasonable size. For full testing, you
can pull all the patches from
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git dax
but be aware I will likely rebase that branch and do other nasty stuff with
it so don't use it as a basis of your git trees.
Changes since v2:
* rebased on top of 4.8-rc8 - this involved dealing with new fault_env
structure
* changed calling convention for fault helpers
Changes since v1:
* make sure all PTE updates happen under radix tree entry lock to protect
against races between faults & write-protecting code
* remove information about DAX locking from mm/memory.c
* smaller updates based on Ross' feedback
----
Background information regarding the motivation:
Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) flushes all the dirty pfns again and again. This patches implement
clearing of the dirty tag in the radix tree so that we issue flush only when
needed.
The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).
The effect of the patch set is easily visible:
Writing 1 GB of data via mmap, then fsync twice.
Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
1 us.
As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
fs.
Patches have passed xfstests for both xfs and ext4.
Honza
4 years, 4 months
[PATCH v4 00/12] re-enable DAX PMD support
by Ross Zwisler
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.
Ted, can you please take the ext2 + ext4 patches through your tree? Dave,
can you please take the rest through the XFS tree?
Changes since v3:
- Corrected dax iomap code namespace for functions defined in fs/dax.c.
(Dave Chinner)
- Added leading "dax" namespace to new static functions in fs/dax.c.
(Dave Chinner)
- Made all DAX PMD code in fs/dax.c conditionally compiled based on
CONFIG_FS_DAX_PMD. Otherwise a stub in include/linux/dax.h that just
returns VM_FAULT_FALLBACK will be used. (Dave Chinner)
- Removed dynamic debugging messages from DAX PMD fault path. Debugging
tracepoints will be added later to both the PTE and PMD paths via a
later patch set. (Dave Chinner)
- Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
PMD faults in ext2. (Dave Chinner)
This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
and Dan Williams. Dan's patch has already been merged for v4.8, and
Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
inclusion.
Here is a tree containing my changes and all the fixes that I've been testing:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax...
This tree has passed xfstests for ext2, ext4 and XFS both with and without DAX,
and has passed targeted testing where I inserted, removed and flushed DAX PTEs
and PMDs in every combination I could think of.
Previously reported performance numbers:
In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults. Here is the FIO script I
used for my testing:
[global]
bs=4k
size=2G
directory=/mnt/pmem0
ioengine=mmap
[randrw]
rw=randrw
Here are the performance results with XFS using only pte faults:
READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec
Here are performance numbers for that same test using PMD faults:
READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec
This was done on a random lab machine with a PMEM device made from memmap'd
RAM. To get XFS to use PMD faults, I did the following:
mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
mount -o dax /dev/pmem0 /mnt/pmem0
xfs_io -c "extsize 2m" /mnt/pmem0
Ross Zwisler (12):
ext4: allow DAX writeback for hole punch
ext4: tell DAX the size of allocation holes
dax: remove buffer_size_valid()
ext2: remove support for DAX PMD faults
dax: make 'wait_table' global variable static
dax: consistent variable naming for DAX entries
dax: coordinate locking for offsets in PMD range
dax: remove dax_pmd_fault()
dax: correct dax iomap code namespace
dax: add struct iomap based DAX PMD support
xfs: use struct iomap based DAX PMD fault path
dax: remove "depends on BROKEN" from FS_DAX_PMD
fs/Kconfig | 1 -
fs/dax.c | 650 +++++++++++++++++++++++++++-------------------------
fs/ext2/file.c | 35 +--
fs/ext4/inode.c | 7 +-
fs/xfs/xfs_aops.c | 25 +-
fs/xfs/xfs_aops.h | 3 -
fs/xfs/xfs_file.c | 10 +-
include/linux/dax.h | 48 +++-
mm/filemap.c | 6 +-
9 files changed, 402 insertions(+), 383 deletions(-)
--
2.7.4
4 years, 5 months
[PATCH 0/6] dax: Page invalidation fixes
by Jan Kara
Hello,
these patches fix races when invalidating hole pages in DAX mappings. See
changelogs for details. The series is based on my patches to write-protect
DAX PTEs because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid
dirty bits leading to missed cache flushes on fsync(2).
Honza
4 years, 5 months
[ndctl PATCH v2] test/clear.sh: test to making sure cleared badblocks don't reappear
by Vishal Verma
>From v4.9 onwards, cleared badblocks won't reappear on an ARS or simply
after disabling/re-enabling a namespace. Add a test to make sure this
doesn't regress.
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
v2: Add a test to run the stale entries test only for kernels >= 4.9.0
test/clear.sh | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/test/clear.sh b/test/clear.sh
index 7765c10..ddae0e6 100755
--- a/test/clear.sh
+++ b/test/clear.sh
@@ -13,6 +13,15 @@ err() {
exit $rc
}
+check_min_kver()
+{
+ local ver="$1"
+ ${KVER:=$(uname -r)}
+
+ [ -n "$ver" ] || return 1
+ [[ "$ver" == "$(echo -e "$ver\n$KVER" | sort -V | head -1)" ]]
+}
+
eval $(uname -r | awk -F. '{print "maj="$1 ";" "min="$2}')
if [ $maj -lt 4 ]; then
echo "kernel $maj.$min lacks clear poison support"
@@ -69,6 +78,18 @@ if read sector len < /sys/block/$blockdev/badblocks; then
echo "fail: $LINENO" && exit 1
fi
+if check_min_kver "4.9.0"; then
+ # check for re-appearance of stale badblocks from poison_list
+ $NDCTL disable-region $BUS all
+ $NDCTL enable-region $BUS all
+
+ # since we have cleared the errors, a disable/reenable shouldn't bring them back
+ if read sector len < /sys/block/$blockdev/badblocks; then
+ # fail if reading badblocks returns data
+ echo "fail: $LINENO" && exit 1
+ fi
+fi
+
$NDCTL disable-region $BUS all
$NDCTL disable-region $BUS1 all
modprobe -r nfit_test
--
2.7.4
4 years, 5 months
[PATCH 0/3] misc updates for Address Range Scrub
by Vishal Verma
Changes in v3:
- Rename MCE_SCRUB* to HW_ERROR_SCRUB* (Dan)
- Make the default scrub_mode '0' so it doesn't have to be set
explicitly (Dan)
Changes in v2:
- Change the 'scrub' attribute to only show the number of completed scrubs,
and start a new one ondemand. (Dan)
- Add a new attribute 'hw_error_scrub' which controls whether or not a full
scrub will run on hardware memory errors. (Dan)
Patch 1 changes the default behaviour on machine check exceptions to
just adding the error address to badblocks accounting instead of starting
a full ARS. The old behaviour can be enabled via sysfs.
Patch 2 and 3 fix a problem where stale badblocks could show up after an
on-demand ARS or an MCE triggered scrub, or even a namespace disable/enable
cycle because when clearing poison, we didn't clear the internal
nvdimm_bus->poison_list.
Vishal Verma (3):
nfit: don't start a full scrub by default for an MCE
pmem: reduce kmap_atomic sections to the memcpys only
libnvdimm: clear the internal poison_list when clearing badblocks
drivers/acpi/nfit/core.c | 53 ++++++++++++++++++++++++++++++++++
drivers/acpi/nfit/mce.c | 24 ++++++++++++----
drivers/acpi/nfit/nfit.h | 6 ++++
drivers/nvdimm/bus.c | 2 ++
drivers/nvdimm/core.c | 73 ++++++++++++++++++++++++++++++++++++++++++++---
drivers/nvdimm/pmem.c | 28 ++++++++++++++----
include/linux/libnvdimm.h | 2 ++
7 files changed, 174 insertions(+), 14 deletions(-)
--
2.7.4
4 years, 5 months
[ndctl PATCH] test/clear.sh: test to making sure cleared badblocks don't reappear
by Vishal Verma
>From v4.9 onwards, cleared badblocks won't reappear on an ARS or simply
after disabling/re-enabling a namespace. Add a test to make sure this
doesn't regress.
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
test/clear.sh | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/test/clear.sh b/test/clear.sh
index 7765c10..336ee44 100755
--- a/test/clear.sh
+++ b/test/clear.sh
@@ -69,6 +69,16 @@ if read sector len < /sys/block/$blockdev/badblocks; then
echo "fail: $LINENO" && exit 1
fi
+# check for re-appearance of stale badblocks from poison_list
+$NDCTL disable-region $BUS all
+$NDCTL enable-region $BUS all
+
+# since we have cleared the errors, a disable/reenable shouldn't bring them back
+if read sector len < /sys/block/$blockdev/badblocks; then
+ # fail if reading badblocks returns data
+ echo "fail: $LINENO" && exit 1
+fi
+
$NDCTL disable-region $BUS all
$NDCTL disable-region $BUS1 all
modprobe -r nfit_test
--
2.7.4
4 years, 5 months
[PATCH v2 0/3] misc updates for Address Range Scrub
by Vishal Verma
Changes in v2:
- Change the 'scrub' attribute to only show the number of completed scrubs,
and start a new one ondemand. (Dan)
- Add a new attribute 'hw_error_scrub' which controls whether or not a full
scrub will run on hardware memory errors. (Dan)
Patch 1 changes the default behaviour on machine check exceptions to
just adding the error address to badblocks accounting instead of starting
a full ARS. The old behaviour can be enabled via sysfs.
Patch 2 and 3 fix a problem where stale badblocks could show up after an
on-demand ARS or an MCE triggered scrub because when clearing poison, we
didn't clear the internal nvdimm_bus->poison_list.
Vishal Verma (3):
nfit: don't start a full scrub by default for an MCE
pmem: reduce kmap_atomic sections to the memcpys only
libnvdimm: clear the internal poison_list when clearing badblocks
drivers/acpi/nfit/core.c | 54 +++++++++++++++++++++++++++++++++++
drivers/acpi/nfit/mce.c | 24 ++++++++++++----
drivers/acpi/nfit/nfit.h | 6 ++++
drivers/nvdimm/bus.c | 2 ++
drivers/nvdimm/core.c | 73 ++++++++++++++++++++++++++++++++++++++++++++---
drivers/nvdimm/pmem.c | 28 ++++++++++++++----
include/linux/libnvdimm.h | 2 ++
7 files changed, 175 insertions(+), 14 deletions(-)
--
2.7.4
4 years, 5 months
[GIT PULL] libnvdimm fixes for 4.8
by Williams, Dan J
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive the following:
- (4) fixes for "flush hint" support. Flush hints are addresses
advertised by the ACPI 6+ NFIT (NVDIMM Firmware Interface Table) that
when written and fenced guarantee that writes pending in platform write
buffers (outside the cpu) have been flushed to media. They might also
be used by hypervisors as a trigger condition to flush guest-persistent
memory ranges to storage.
Fix a potential data corruption issue, a broken definition of the hint
array, a wrong allocation size for the unit test implementation of the
flush hint table, and missing NULL check in an error path. The unit
test, while it did not prevent these bugs from being merged, at least
triggered occasional crashes in advance of production usages.
- Fix handling of ACPI DSM error status results. The DSM mechanism
allows communication with platform and memory device firmware. We
correctly parse known errors, but were silently ignoring others. Fix
it to consistently fail any command with a non-zero status return that
we otherwise do not interpret / handle.
These changes have a build success notification from the 0day robot and
have appeared in a -next release over the past week.
The following changes since commit 3be7988674ab33565700a37b210f502563d932e6:
Linux 4.8-rc7 (2016-09-18 17:27:41 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to 595c73071e6641e59b83911fbb4026e767471000:
libnvdimm, region: fix flush hint table thinko (2016-09-24 11:45:38 -0700)
----------------------------------------------------------------
Dan Williams (4):
tools/testing/nvdimm: fix allocation range for mock flush hint tables
libnvdimm: fix devm_nvdimm_memremap() error path
nfit: fail DSMs that return non-zero status by default
libnvdimm, region: fix flush hint table thinko
Oliver O'Halloran (1):
nvdimm: fix PHYS_PFN/PFN_PHYS mixup
drivers/acpi/nfit/core.c | 48 +++++++++++++++++++++++-----------------
drivers/nvdimm/core.c | 8 ++++++-
drivers/nvdimm/nd.h | 22 ++++++++++++++++--
drivers/nvdimm/region_devs.c | 22 ++++++++++--------
tools/testing/nvdimm/test/nfit.c | 3 ++-
5 files changed, 70 insertions(+), 33 deletions(-)
commit 480b6837aa579991c6acc113bccf838e6a90843c
Author: Oliver O'Halloran <oohall(a)gmail.com>
Date: Mon Sep 19 20:19:00 2016 +1000
nvdimm: fix PHYS_PFN/PFN_PHYS mixup
nd_activate_region() iomaps any hint addresses required when activating
a region. To prevent duplicate mappings it checks the PFN of the hint to
be mapped against the PFNs of the already mapped hints. Unfortunately it
doesn't convert the PFN back into a physical address before passing it
to devm_nvdimm_ioremap(). Instead it applies PHYS_PFN a second time
which ends about as well as you would imagine.
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 9d15ce9caaf9ecbec74e3be156a4a57451ed16c2
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Mon Sep 19 13:49:48 2016 -0700
tools/testing/nvdimm: fix allocation range for mock flush hint tables
Commit 480b6837aa57 "nvdimm: fix PHYS_PFN/PFN_PHYS mixup" identified
that we were passing an invalid address to devm_nvdimm_ioremap(). With
that fixed it exposed a bug in the memory reservation size for flush
hint tables. Since we map a full page we need to mock a full page of
memory to back the flush hint table entries.
Cc: Oliver O'Halloran <oohall(a)gmail.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit ecfb6d8a041cc2ca80bc69ffc20c00067d190df5
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed Sep 21 09:22:33 2016 -0700
libnvdimm: fix devm_nvdimm_memremap() error path
The internal alloc_nvdimm_map() helper might fail, particularly if the
memory region is already busy. Report request_mem_region() failures and
check for the failure.
Reported-by: Ryan Chen <ryan.chan105(a)gmail.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 11294d63ac915230a36b0603c62134ef7b173d0a
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed Sep 21 09:21:26 2016 -0700
nfit: fail DSMs that return non-zero status by default
For the DSMs where the kernel knows the format of the output buffer and
originates those DSMs from within the kernel, return -EIO for any
non-zero status. If the BIOS is indicating a status that we do not know
how to handle, fail the DSM.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 595c73071e6641e59b83911fbb4026e767471000
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri Sep 23 17:53:52 2016 -0700
libnvdimm, region: fix flush hint table thinko
The definition of the flush hint table as:
void __iomem *flush_wpq[0][0];
...passed the unit test, but is broken as flush_wpq[0][1] and
flush_wpq[1][0] refer to the same entry. Fix this to use a helper that
calculates a slot in the table based on the geometry of flush hints in
the region. This is important to get right since virtualization
solutions use this mechanism to trigger hypervisor flushes to platform
persistence.
Reported-by: Dave Jiang <dave.jiang(a)intel.com>
Tested-by: Dave Jiang <dave.jiang(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
4 years, 5 months