[PATCH v7 0/3] Machine check recovery when kernel accesses poison
by Tony Luck
This series is initially targeted at the folks doing filesystems
on top of NVDIMMs. They really want to be able to return -EIO
when there is a h/w error (just like spinning rust, and SSD does).
I plan to use the same infrastructure to write a machine check aware
"copy_from_user()" that will SIGBUS the calling application when a
syscall touches poison in user space (just like we do when the application
touches the poison itself).
Changes V6-V7:
Boris: Why add/subtract 0x20000000? Added better comment provided by Andy
Boris: Churn. Part2 changes things only introduced in part1.
Merged parts 1&2 into one patch.
Ingo: Missing my sign off on part1. Added.
Changes V5-V6
Andy: Provoked massive re-write by providing what is now part1 of this
patch series. This frees up two bits in the exception table
fixup field that can be used to tag exception table entries
as different "classes". This means we don't need my separate
exception table fro machine checks. Also avoids duplicating
fixup actions for #PF and #MC cases that were in version 5.
Andy: Use C99 array initializers to tie the various class fixup
functions back to the defintions of each class. Also give the
functions meanningful names (not fixup_class0() etc.).
Boris: Cleaned up my lousy assembly code removing many spurious 'l'
modifiers on instructions.
Boris: Provided some helper functions for the machine check severity
calculation that make the code more readable.
Boris: Have __mcsafe_copy() return a structure with the 'remaining bytes'
in a separate field from the fault indicator. Boris had suggested
Linux -EFAULT/-EINVAL ... but I thought it made more sense to return
the exception number (X86_TRAP_MC, etc.) This finally kills off
BIT(63) which has been controversial throughout all the early versions
of this patch series.
Changes V4-V5
Tony: Extended __mcsafe_copy() to have fixup entries for both machine
check and page fault.
Changes V3-V4:
Andy: Simplify fixup_mcexception() by dropping used-once local variable
Andy: "Reviewed-by" tag added to part1
Boris: Moved new functions to memcpy_64.S and declaration to asm/string_64.h
Boris: Changed name s/mcsafe_memcpy/__mcsafe_copy/ to make it clear that this
is an internal function and that return value doesn't follow memcpy() semantics.
Boris: "Reviewed-by" tag added to parts 1&2
Changes V2-V3:
Andy: Don't hack "regs->ax = BIT(63) | addr;" in the machine check
handler. Now have better fixup code that computes the number
of remaining bytes (just like page-fault fixup).
Andy: #define for BIT(63). Done, plus couple of extra macros using it.
Boris: Don't clutter up generic code (like mm/extable.c) with this.
I moved everything under arch/x86 (the asm-generic change is
a more generic #define).
Boris: Dependencies for CONFIG_MCE_KERNEL_RECOVERY are too generic.
I made it a real menu item with default "n". Dan Williams
will use "select MCE_KERNEL_RECOVERY" from his persistent
filesystem code.
Boris: Simplify conditionals in mce.c by moving tolerant/kill_it
checks earlier, with a skip to end if they aren't set.
Boris: Miscellaneous grammar/punctuation. Fixed.
Boris: Don't leak spurious __start_mcextable symbols into kernels
that didn't configure MCE_KERNEL_RECOVERY. Done.
Tony: New code doesn't belong in user_copy_64.S/uaccess*.h. Moved
to new .S/.h files
Elliott:Cacheing behavior non-optimal. Could use movntdqa, vmovntdqa
or vmovntdqa on source addresses. I didn't fix this yet. Think
of the current mcsafe_memcpy() as the first of several functions.
This one is useful for small copies (meta-data) where the overhead
of saving SSE/AVX state isn't justified.
Changes V1->V2:
0-day: Reported build errors and warnings on 32-bit systems. Fixed
0-day: Reported bloat to tinyconfig. Fixed
Boris: Suggestions to use extra macros to reduce code duplication in _ASM_*EXTABLE. Done
Boris: Re-write "tolerant==3" check to reduce indentation level. See below.
Andy: Check IP is valid before searching kernel exception tables. Done.
Andy: Explain use of BIT(63) on return value from mcsafe_memcpy(). Done (added decode macros).
Andy: Untangle mess of code in tail of do_machine_check() to make it
clear what is going on (e.g. that we only enter the ist_begin_non_atomic()
if we were called from user code, not from kernel!). Done.
Tony Luck (3):
x86: Add classes to exception tables
x86, mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception
table entries
x86, mce: Add __mcsafe_copy()
arch/x86/Kconfig | 10 +++
arch/x86/include/asm/asm.h | 102 ++++++++++++++++------
arch/x86/include/asm/string_64.h | 10 +++
arch/x86/include/asm/uaccess.h | 17 +++-
arch/x86/kernel/cpu/mcheck/mce-severity.c | 32 ++++++-
arch/x86/kernel/cpu/mcheck/mce.c | 71 ++++++++--------
arch/x86/kernel/kprobes/core.c | 2 +-
arch/x86/kernel/traps.c | 6 +-
arch/x86/kernel/x8664_ksyms_64.c | 4 +
arch/x86/lib/memcpy_64.S | 136 ++++++++++++++++++++++++++++++
arch/x86/mm/extable.c | 66 ++++++++++-----
arch/x86/mm/fault.c | 2 +-
12 files changed, 369 insertions(+), 89 deletions(-)
--
2.1.4
5 years
A blocksize problem about dax and ext4
by Cholerae Hu
Hi all,
Recently I was doing experiment about persistent memory, so I followed the
quick setup guide on https://nvdimm.wiki.kernel.org to config dax. But when
I mounted /dev/pmem0 I got an error. The prompting message is:
[root@localhost cholerae]# mount -o dax /dev/pmem0 /mnt/mem
mount: wrong fs type, bad option, bad superblock on /dev/pmem0,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
[root@localhost cholerae]# dmesg | tail
[ 27.357100] cfg80211: DFS Master region: FCC
[ 27.357101] cfg80211: (start_freq - end_freq @ bandwidth),
(max_antenna_gain, max_eirp), (dfs_cac_time)
[ 27.357104] cfg80211: (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A,
2000 mBm), (N/A)
[ 27.357107] cfg80211: (5170000 KHz - 5250000 KHz @ 80000 KHz, 160000
KHz AUTO), (N/A, 2300 mBm), (N/A)
[ 27.357110] cfg80211: (5250000 KHz - 5330000 KHz @ 80000 KHz, 160000
KHz AUTO), (N/A, 2300 mBm), (0 s)
[ 27.357112] cfg80211: (5735000 KHz - 5835000 KHz @ 80000 KHz), (N/A,
3000 mBm), (N/A)
[ 27.357114] cfg80211: (57240000 KHz - 59400000 KHz @ 2160000 KHz),
(N/A, 2800 mBm), (N/A)
[ 27.357116] cfg80211: (59400000 KHz - 63720000 KHz @ 2160000 KHz),
(N/A, 4400 mBm), (N/A)
[ 27.357118] cfg80211: (63720000 KHz - 65880000 KHz @ 2160000 KHz),
(N/A, 2800 mBm), (N/A)
[ 81.779582] EXT4-fs (pmem0): error: unsupported blocksize for dax
My kernel version is 4.2.8, distribution is fedora23. I can't find out the
reason, can anyone give me a hand? Thanks.
5 years
[PATCH v3 0/5] fs, block: handle end of life
by Dan Williams
Changes since v2 [1]:
1/ Split "block: introduce del_gendisk_queue()" into a patch that
introduces del_gendisk_queue() with no functional difference to the
open coded version and "block: introduce force_failure_partition()
and unmap_dax_inodes()" which adds the new behavior.
2/ Collect Dave's ack on "xfs: unmap dax at shutdown (force_failure)"
---
As mentioned in [PATCH v3 3/5] "block: introduce
force_failure_partition() and unmap_dax_inodes()" historically we have
waited for filesystem specific heuristics to attempt to guess when a
block device is gone. Sometimes this works, but in other cases the
system can hang waiting for the fs to trigger its shutdown protocol.
Now with DAX we need new actions, like unmapping all inodes, to be taken
upon a device loss event or fs corruption event.
For now, the approach taken in the following patches only affects xfs
and block drivers that are converted to use del_gendisk_queue(). We can
add more filesystems and driver support over time.
---
Dan Williams (5):
block: prepare for del_gendisk_queue()
block: introduce del_gendisk_queue()
block: introduce force_failure_partition() and unmap_dax_inodes()
xfs: unmap dax at shutdown (force_failure)
block, xfs: implement 'force_failure' notifications
block/genhd.c | 87 +++++++++++++++++++++++++++++++++++-------
drivers/block/brd.c | 9 +---
drivers/nvdimm/pmem.c | 3 -
drivers/s390/block/dcssblk.c | 6 +--
fs/block_dev.c | 22 +++++++++++
fs/inode.c | 28 ++++++++++++++
fs/xfs/xfs_fsops.c | 9 ++++
fs/xfs/xfs_super.c | 8 ++++
include/linux/fs.h | 3 +
include/linux/genhd.h | 1
10 files changed, 150 insertions(+), 26 deletions(-)
5 years
[PATCH v2 0/4] fs, block: handle end of life
by Dan Williams
Changes since v1 [1]:
1/ move the del_gendisk() refactoring to its own patch (Dave)
2/ add unmap_dax_inodes to the xfs shutdown path (Dave)
3/ kill the unnecessary ->quiesce super operation and rename ->bdi_gone
to ->force_failure. (Dave)
4/ rework tricky call to get_super() with a NULL bdev parameter. (Dave)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003797.html
---
As mentioned in [PATCH v2 2/4] "block: introduce del_gendisk_queue()" ,
historically we have waited for filesystem specific heuristics to
attempt to guess when a block device is gone. Sometimes this works, but
in other cases the system can hang waiting for the fs to trigger its
shutdown protocol.
Now with DAX we need new actions, like unmapping all inodes, to be taken
upon a device loss event or fs corruption event.
For now, the approach taken in the following patches only affects xfs
and block drivers that are converted to use del_gendisk_queue(). We can
add more filesystems and driver support over time.
---
Dan Williams (4):
block: prepare for del_gendisk_queue()
block: introduce del_gendisk_queue()
xfs: unmap dax at shutdown (force_failure)
block, xfs: implement 'force_failure' notifications
block/genhd.c | 87 +++++++++++++++++++++++++++++++++++-------
drivers/block/brd.c | 9 +---
drivers/nvdimm/pmem.c | 3 -
drivers/s390/block/dcssblk.c | 6 +--
fs/block_dev.c | 22 +++++++++++
fs/inode.c | 28 ++++++++++++++
fs/xfs/xfs_fsops.c | 9 ++++
fs/xfs/xfs_super.c | 8 ++++
include/linux/fs.h | 3 +
include/linux/genhd.h | 1
10 files changed, 150 insertions(+), 26 deletions(-)
5 years
[PATCH] nfit: add a module parameter to ignore ars errors
by Vishal Verma
Normally, if a platform does not advertise support for Address Range
Scrub (ARS), we skip it. But if ARS is advertised, it is expected to
always succeed. If it fails, we normally fail initialization at that
point.
Add a module parameter to nfit that lets it ignore ARS failures and
continue with initialization for debugging.
Signed-off-by: Vishal Verma <vishal.l.verma(a)intel.com>
---
This applies on top of both of the previous error handling series
(badblocks and libnvdimm poison list). The tree at:
https://git.kernel.org/cgit/linux/kernel/git/vishal/nvdimm.git/log/?h=err...
has been updated with this patch.
drivers/acpi/nfit.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index ad6d8c6..0a152f1 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -34,6 +34,10 @@ static bool force_enable_dimms;
module_param(force_enable_dimms, bool, S_IRUGO|S_IWUSR);
MODULE_PARM_DESC(force_enable_dimms, "Ignore _STA (ACPI DIMM device) status");
+static bool ignore_ars;
+module_param(ignore_ars, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(ignore_ars, "Ignore ARS (Address Range Scrub) failures");
+
struct nfit_table_prev {
struct list_head spas;
struct list_head memdevs;
@@ -1786,7 +1790,10 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
dev_err(acpi_desc->dev,
"error while performing ARS to find poison: %d\n",
rc);
- return rc;
+ if (ignore_ars)
+ ; /* continue initialization */
+ else
+ return rc;
}
if (!nvdimm_pmem_region_create(nvdimm_bus, ndr_desc))
return -ENOMEM;
--
2.5.0
5 years
You have received a new fax, document 00592672
by Interfax Online
New incoming fax document.
Please, download fax document attached to this email.
Pages scanned: 12
Sender: Sam Flanagan
Scan duration: 53 seconds
Quality: 400 DPI
Filesize: 139 Kb
Fax name: scan-00592672.doc
Date: Thu, 7 Jan 2016 15:37:20 +0300
Thank you for using Interfax!
5 years
[PATCH v3 14/17] x86, nvdimm, kexec: Use walk_iomem_res_desc() for iomem search
by Toshi Kani
Change the callers of walk_iomem_res() with the following names
to use walk_iomem_res(), instead.
"ACPI Tables"
"ACPI Non-volatile Storage"
"Persistent Memory (legacy)"
"Crash kernel"
Note, the caller of walk_iomem_res() with "GART" will be removed
in a later patch.
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Minfei Huang <mhuang(a)redhat.com>
Cc: x86(a)kernel.org
Cc: linux-nvdimm(a)lists.01.org
Cc: kexec(a)lists.infradead.org
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
---
arch/x86/kernel/crash.c | 4 ++--
arch/x86/kernel/pmem.c | 4 ++--
drivers/nvdimm/e820.c | 2 +-
kernel/kexec_file.c | 8 ++++----
4 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 2c1910f..082373b 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -588,12 +588,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
/* Add ACPI tables */
cmd.type = E820_ACPI;
flags = IORESOURCE_MEM | IORESOURCE_BUSY;
- walk_iomem_res("ACPI Tables", flags, 0, -1, &cmd,
+ walk_iomem_res_desc(IORES_DESC_ACPI_TABLES, flags, 0, -1, &cmd,
memmap_entry_callback);
/* Add ACPI Non-volatile Storage */
cmd.type = E820_NVS;
- walk_iomem_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
+ walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, &cmd,
memmap_entry_callback);
/* Add crashk_low_res region */
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index 14415af..92f7014 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -13,11 +13,11 @@ static int found(u64 start, u64 end, void *data)
static __init int register_e820_pmem(void)
{
- char *pmem = "Persistent Memory (legacy)";
struct platform_device *pdev;
int rc;
- rc = walk_iomem_res(pmem, IORESOURCE_MEM, 0, -1, NULL, found);
+ rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
+ IORESOURCE_MEM, 0, -1, NULL, found);
if (rc <= 0)
return 0;
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index b0045a5..95825b3 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -55,7 +55,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
for (p = iomem_resource.child; p ; p = p->sibling) {
struct nd_region_desc ndr_desc;
- if (strncmp(p->name, "Persistent Memory (legacy)", 26) != 0)
+ if (p->desc != IORES_DESC_PERSISTENT_MEMORY_LEGACY)
continue;
memset(&ndr_desc, 0, sizeof(ndr_desc));
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index c245085..6e31cea 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -522,10 +522,10 @@ int kexec_add_buffer(struct kimage *image, char *buffer, unsigned long bufsz,
/* Walk the RAM ranges and allocate a suitable range for the buffer */
if (image->type == KEXEC_TYPE_CRASH)
- ret = walk_iomem_res("Crash kernel",
- IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
- crashk_res.start, crashk_res.end, kbuf,
- locate_mem_hole_callback);
+ ret = walk_iomem_res_desc(crashk_res.desc,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ crashk_res.start, crashk_res.end, kbuf,
+ locate_mem_hole_callback);
else
ret = walk_system_ram_res(0, -1, kbuf,
locate_mem_hole_callback);
5 years
[PATCH 0/9] libnvdimm, pmem: handle media errors
by Dan Williams
Following both the enabling to retrieve a platform provided list of
nvdimm media errors [1], and the new mcsafe_copy() enabling to recover
from media read errors [2], implement error handling in the pmem i/o
paths. This adds the following capabilities:
1/ Consult badblocks in the pmem_make_request() path to fail block
layer requests to platform advertised bad address ranges.
2/ Use mcsafe_copy(), on capable platforms, to handle dynamically
discovered media errors.
3/ Consult badblocks in the nvdimm_read_bytes() path to communicate
errors for namespace metadata reads and btt data reads.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003706.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003819.html
---
Dan Williams (9):
libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
block: clarify badblocks lifetime
pmem: fail io-requests to known bad blocks
block, dax: disable dax in the presence of bad blocks
x86, pmem: use __mcsafe_copy() for memcpy_from_pmem()
libnvdimm, pmem: prepare for handling badblocks via nvdimm_read_bytes()
block, badblocks, pmem: introduce devm_alloc_badblocks
libnvdimm, pmem: nvdimm_read_bytes() badblocks support
block: kill disk_{check|set|clear|alloc}_badblocks
arch/x86/include/asm/pmem.h | 16 ++++++++
block/badblocks.c | 66 ++++++++++++++++++++++++++++-----
block/genhd.c | 47 ------------------------
block/ioctl.c | 9 +++++
drivers/nvdimm/Kconfig | 1 +
drivers/nvdimm/btt.c | 11 ++++++
drivers/nvdimm/btt_devs.c | 10 +++++
drivers/nvdimm/core.c | 71 ++++++++++++++++++++++--------------
drivers/nvdimm/nd-core.h | 2 -
drivers/nvdimm/nd.h | 2 +
drivers/nvdimm/pfn_devs.c | 10 +++++
drivers/nvdimm/pmem.c | 83 ++++++++++++++++++++++++++++++------------
drivers/nvdimm/region_devs.c | 6 +++
include/linux/badblocks.h | 3 ++
include/linux/genhd.h | 5 ---
include/linux/nd.h | 2 +
include/linux/pmem.h | 17 +++++++--
17 files changed, 241 insertions(+), 120 deletions(-)
5 years