[PATCH v2 00/25] replace ioremap_{cache|wt} with memremap
by Dan Williams
Changes since v1 [1]:
1/ Drop the attempt at unifying ioremap() prototypes, just focus on
converting ioremap_cache and ioremap_wt over to memremap (Christoph)
2/ Drop the unrelated cleanups to use %pa in __ioremap_caller (Thomas)
3/ Add support for memremap() attempts on "System RAM" to simply return
the kernel virtual address for that range. ARM depends on this
functionality in ioremap_cache() and ACPI was open coding a similar
solution. (Mark)
4/ Split the conversions of ioremap_{cache|wt} into separate patches per
driver / arch.
5/ Fix bisection breakage and other reports from 0day-kbuild
---
While developing the pmem driver we noticed that the __iomem annotation
on the return value from ioremap_cache() was being mishandled by several
callers. We also observed that all of the call sites expected to be
able to treat the return value from ioremap_cache() as normal
(non-__iomem) pointer to memory.
This patchset takes the opportunity to clean up the above confusion as
well as a few issues with the ioremap_{cache|wt} interface, including:
1/ Eliminating the possibility of function prototypes differing between
architectures by defining a central memremap() prototype that takes
flags to determine the mapping type.
2/ Returning NULL rather than falling back silently to a different
mapping-type. This allows drivers to be stricter about the
mapping-type fallbacks that are permissible.
[1]: http://marc.info/?l=linux-arm-kernel&m=143735199029255&w=2
---
Dan Williams (22):
mm: enhance region_is_ram() to distinguish 'unknown' vs 'mixed'
arch, drivers: don't include <asm/io.h> directly, use <linux/io.h> instead
cleanup IORESOURCE_CACHEABLE vs ioremap()
intel_iommu: fix leaked ioremap mapping
arch: introduce memremap()
arm: switch from ioremap_cache to memremap
x86: switch from ioremap_cache to memremap
gma500: switch from acpi_os_ioremap to ioremap
i915: switch from acpi_os_ioremap to ioremap
acpi: switch from ioremap_cache to memremap
toshiba laptop: replace ioremap_cache with ioremap
memconsole: fix __iomem mishandling, switch to memremap
visorbus: switch from ioremap_cache to memremap
intel-iommu: switch from ioremap_cache to memremap
libnvdimm, pmem: switch from ioremap_cache to memremap
pxa2xx-flash: switch from ioremap_cache to memremap
sfi: switch from ioremap_cache to memremap
fbdev: switch from ioremap_wt to memremap
pmem: switch from ioremap_wt to memremap
arch: remove ioremap_cache, replace with arch_memremap
arch: remove ioremap_wt, replace with arch_memremap
pmem: convert to generic memremap
Toshi Kani (3):
mm, x86: Fix warning in ioremap RAM check
mm, x86: Remove region_is_ram() call from ioremap
mm: Fix bugs in region_is_ram()
arch/arc/include/asm/io.h | 1
arch/arm/Kconfig | 1
arch/arm/include/asm/io.h | 13 +++-
arch/arm/include/asm/xen/page.h | 4 +
arch/arm/mach-clps711x/board-cdb89712.c | 2 -
arch/arm/mach-shmobile/pm-rcar.c | 2 -
arch/arm/mm/ioremap.c | 12 +++-
arch/arm/mm/nommu.c | 11 ++-
arch/arm64/Kconfig | 1
arch/arm64/include/asm/acpi.h | 10 +--
arch/arm64/include/asm/dmi.h | 8 +--
arch/arm64/include/asm/io.h | 8 ++-
arch/arm64/kernel/efi.c | 9 ++-
arch/arm64/kernel/smp_spin_table.c | 19 +++---
arch/arm64/mm/ioremap.c | 20 ++----
arch/avr32/include/asm/io.h | 1
arch/frv/Kconfig | 1
arch/frv/include/asm/io.h | 17 ++---
arch/frv/mm/kmap.c | 6 ++
arch/ia64/Kconfig | 1
arch/ia64/include/asm/io.h | 11 +++
arch/ia64/kernel/cyclone.c | 2 -
arch/m32r/include/asm/io.h | 1
arch/m68k/Kconfig | 1
arch/m68k/include/asm/io_mm.h | 14 +---
arch/m68k/include/asm/io_no.h | 12 ++--
arch/m68k/include/asm/raw_io.h | 4 +
arch/m68k/mm/kmap.c | 17 +++++
arch/m68k/mm/sun3kmap.c | 6 ++
arch/metag/include/asm/io.h | 3 -
arch/microblaze/include/asm/io.h | 1
arch/mn10300/include/asm/io.h | 1
arch/nios2/include/asm/io.h | 1
arch/powerpc/kernel/pci_of_scan.c | 2 -
arch/s390/include/asm/io.h | 1
arch/sh/Kconfig | 1
arch/sh/include/asm/io.h | 20 ++++--
arch/sh/mm/ioremap.c | 10 +++
arch/sparc/include/asm/io_32.h | 1
arch/sparc/include/asm/io_64.h | 1
arch/sparc/kernel/pci.c | 3 -
arch/tile/include/asm/io.h | 1
arch/x86/Kconfig | 1
arch/x86/include/asm/efi.h | 3 +
arch/x86/include/asm/io.h | 17 +++--
arch/x86/kernel/crash_dump_64.c | 6 +-
arch/x86/kernel/kdebugfs.c | 8 +--
arch/x86/kernel/ksysfs.c | 28 ++++-----
arch/x86/mm/ioremap.c | 76 ++++++++++--------------
arch/xtensa/Kconfig | 1
arch/xtensa/include/asm/io.h | 9 ++-
drivers/acpi/apei/einj.c | 9 ++-
drivers/acpi/apei/erst.c | 6 +-
drivers/acpi/nvs.c | 6 +-
drivers/acpi/osl.c | 70 ++++++----------------
drivers/char/toshiba.c | 2 -
drivers/firmware/google/memconsole.c | 7 +-
drivers/gpu/drm/gma500/opregion.c | 2 -
drivers/gpu/drm/i915/intel_opregion.c | 2 -
drivers/iommu/intel-iommu.c | 10 ++-
drivers/iommu/intel_irq_remapping.c | 4 +
drivers/isdn/icn/icn.h | 2 -
drivers/mtd/devices/slram.c | 2 -
drivers/mtd/maps/pxa2xx-flash.c | 4 +
drivers/mtd/nand/diskonchip.c | 2 -
drivers/mtd/onenand/generic.c | 2 -
drivers/nvdimm/Kconfig | 2 -
drivers/pci/probe.c | 3 -
drivers/pnp/manager.c | 2 -
drivers/scsi/aic94xx/aic94xx_init.c | 7 --
drivers/scsi/arcmsr/arcmsr_hba.c | 5 --
drivers/scsi/mvsas/mv_init.c | 15 +----
drivers/scsi/sun3x_esp.c | 2 -
drivers/sfi/sfi_core.c | 4 +
drivers/staging/comedi/drivers/ii_pci20kc.c | 1
drivers/staging/unisys/visorbus/visorchannel.c | 16 +++--
drivers/staging/unisys/visorbus/visorchipset.c | 17 +++--
drivers/tty/serial/8250/8250_core.c | 2 -
drivers/video/fbdev/Kconfig | 2 -
drivers/video/fbdev/amifb.c | 5 +-
drivers/video/fbdev/atafb.c | 5 +-
drivers/video/fbdev/hpfb.c | 6 +-
drivers/video/fbdev/ocfb.c | 1
drivers/video/fbdev/s1d13xxxfb.c | 3 -
drivers/video/fbdev/stifb.c | 1
include/acpi/acpi_io.h | 6 +-
include/asm-generic/io.h | 8 ---
include/asm-generic/iomap.h | 4 -
include/linux/io-mapping.h | 2 -
include/linux/io.h | 9 +++
include/linux/mtd/map.h | 2 -
include/linux/pmem.h | 26 +++++---
include/video/vga.h | 2 -
kernel/Makefile | 2 +
kernel/memremap.c | 74 +++++++++++++++++++++++
kernel/resource.c | 43 +++++++-------
lib/Kconfig | 5 +-
lib/devres.c | 13 +---
lib/pci_iomap.c | 7 +-
tools/testing/nvdimm/Kbuild | 4 +
tools/testing/nvdimm/test/iomap.c | 34 ++++++++---
101 files changed, 482 insertions(+), 398 deletions(-)
create mode 100644 kernel/memremap.c
4 years, 10 months
[PATCH v1 00/10] uuid: convert users to generic UUID API
by Andy Shevchenko
There are few fumctions here and there along with type definitions that provide
UUID API. This series consolidates everything under one hood and converts
current users.
This has been tested for a while internally, however it doesn't mean we covered
all possible cases (especially accuracy of UUID constants after conversion).
So, please test this as much as you can and provide your tag. We appreciate the
effort.
Andy Shevchenko (10):
lib/vsprintf: simplify UUID printing
lib/uuid: move generate_random_uuid() to uuid.c
lib/uuid: introduce few more generic helpers for UUID
lib/uuid: remove FSF address
ACPI: switch to use generic UUID API
device property: switch to use UUID API
sysctl: drop away useless label
sysctl: use generic UUID library
efi: redefine type, constant, macro from generic code
efivars: use generic UUID library
drivers/acpi/acpi_extlog.c | 8 +-
drivers/acpi/bus.c | 29 +------
drivers/acpi/nfit.c | 34 ++++----
drivers/acpi/nfit.h | 3 +-
drivers/acpi/property.c | 18 ++---
drivers/acpi/utils.c | 4 +-
drivers/char/random.c | 21 +----
drivers/char/tpm/tpm_crb.c | 9 +--
drivers/char/tpm/tpm_ppi.c | 20 ++---
drivers/gpu/drm/i915/intel_acpi.c | 14 ++--
drivers/gpu/drm/nouveau/nouveau_acpi.c | 20 +++--
drivers/gpu/drm/nouveau/nvkm/subdev/mxm/base.c | 9 +--
drivers/hid/i2c-hid/i2c-hid.c | 9 +--
drivers/iommu/dmar.c | 11 ++-
drivers/pci/pci-acpi.c | 11 ++-
drivers/pci/pci-label.c | 4 +-
drivers/thermal/int340x_thermal/int3400_thermal.c | 6 +-
drivers/usb/host/xhci-pci.c | 9 +--
fs/btrfs/volumes.c | 2 +-
fs/efivarfs/inode.c | 40 +---------
fs/ext4/ioctl.c | 1 +
fs/f2fs/file.c | 2 +-
fs/reiserfs/objectid.c | 2 +-
fs/ubifs/sb.c | 2 +-
include/acpi/acpi_bus.h | 10 ++-
include/linux/acpi.h | 2 +-
include/linux/efi.h | 14 +---
include/linux/pci-acpi.h | 2 +-
include/linux/random.h | 1 -
include/linux/uuid.h | 21 +++--
include/uapi/linux/uuid.h | 4 -
kernel/sysctl_binary.c | 30 +++----
lib/uuid.c | 96 +++++++++++++++++++++--
lib/vsprintf.c | 21 ++---
sound/soc/intel/skylake/skl-nhlt.c | 7 +-
35 files changed, 237 insertions(+), 259 deletions(-)
--
2.7.0
4 years, 10 months
[PATCH v4 0/8] Support for transparent PUD pages for DAX files
by Matthew Wilcox
We have customer demand to use 1GB pages to map DAX files. Unlike the 2MB
page support, the Linux MM does not currently support PUD pages, so I have
attempted to add support for the necessary pieces for DAX huge PUD pages.
Filesystems still need work to allocate 1GB pages. With ext4, I can
only get 16MB of contiguous space, although it is aligned. With XFS,
I can get 80MB less than 1GB, and it's not aligned. The XFS problem
may be due to the small amount of RAM in my test machine.
This patch set is against something approximately current -mm. I'd like
to thank Dave Chinner & Kirill Shutemov for their reviews of v1.
The conversion of pmd_fault & pud_fault to huge_fault is thanks to
Dave's poking, and Kirill spotted a couple of problems in the MM code.
Version 2 of the patch set is about 200 lines smaller (1016 insertions,
23 deletions in v1).
I've done some light testing using a program to mmap a block device
with DAX enabled, calling mincore() and examining /proc/smaps and
/proc/pagemap.
v4: Updated to current mmotm
Converted pud_trans_huge_lock to the same calling conventions as
pmd_trans_huge_lock.
Fill in vm_fault ->gfp_flags and ->pgoff, at Jan Kara's suggestion
Replace use of page table lock with pud_lock in __pud_alloc (cosmetic)
Fix compilation problems with various config settings
Convert dax_pmd_fault and dax_pud_fault to take a vm_fault instead of
individual pieces
Add copy_huge_pud() and follow_devmap_pud() so fork() should now work
Fix typo of PMD for PUD
v3: Rebased against current mmtom
v2: Reduced churn in filesystems by switching to ->huge_fault interface
Addressed concerns from Kirill
Matthew Wilcox (8):
mm: Convert an open-coded VM_BUG_ON_VMA
mm,fs,dax: Change ->pmd_fault to ->huge_fault
mm: Add support for PUD-sized transparent hugepages
mincore: Add support for PUDs
procfs: Add support for PUDs to smaps, clear_refs and pagemap
x86: Add support for PUD-sized transparent hugepages
dax: Support for transparent PUD pages
ext4: Support for PUD-sized transparent huge pages
Documentation/filesystems/dax.txt | 12 +-
arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/paravirt.h | 11 ++
arch/x86/include/asm/paravirt_types.h | 2 +
arch/x86/include/asm/pgtable-2level.h | 19 +++
arch/x86/include/asm/pgtable-3level.h | 31 ++++
arch/x86/include/asm/pgtable.h | 134 +++++++++++++++
arch/x86/include/asm/pgtable_64.h | 13 ++
arch/x86/kernel/paravirt.c | 1 +
arch/x86/mm/pgtable.c | 31 ++++
fs/block_dev.c | 10 +-
fs/dax.c | 295 +++++++++++++++++++++++++---------
fs/ext2/file.c | 27 +---
fs/ext4/file.c | 60 +++----
fs/proc/task_mmu.c | 109 +++++++++++++
fs/xfs/xfs_file.c | 25 ++-
fs/xfs/xfs_trace.h | 2 +-
include/asm-generic/pgtable.h | 74 ++++++++-
include/asm-generic/tlb.h | 14 ++
include/linux/dax.h | 17 --
include/linux/huge_mm.h | 78 ++++++++-
include/linux/mm.h | 48 +++++-
include/linux/mmu_notifier.h | 14 ++
include/linux/pfn_t.h | 8 +
mm/gup.c | 7 +
mm/huge_memory.c | 246 ++++++++++++++++++++++++++++
mm/memory.c | 135 ++++++++++++++--
mm/mincore.c | 13 ++
mm/pagewalk.c | 19 ++-
mm/pgtable-generic.c | 14 ++
31 files changed, 1261 insertions(+), 212 deletions(-)
--
2.7.0.rc3
4 years, 11 months
[RFC 0/2] New MAP_PMEM_AWARE mmap flag
by Boaz Harrosh
Hi all
Recent DAX code fixed the cl_flushing ie durability of mmap access
of direct persistent-memory from applications. It uses the radix-tree
per inode to track the indexes of a file that where page-faulted for
write. Then at m/fsync time it would cl_flush these pages and clean
the radix-tree, for the next round.
Sigh, that is life, for legacy applications this is the price we must
pay. But for NV aware applications like nvml library, we pay extra extra
price, even if we do not actually call m/fsync eventually. For these
applications these extra resources and especially the extra radix locking
per page-fault, costs a lot, like x3 a lot.
What we propose here is a way for those applications to enjoy the
boost and still not sacrifice any correctness of legacy applications.
Any concurrent access from legacy apps vs nv-aware apps even to the same
file / same page, will work correctly.
We do that by defining a new MMAP flag that is set by the nv-aware
app. this flag is carried by the VMA. In the dax code we bypass any
radix handling of the page if this flag is set. Those pages accessed *without*
this flag will be added to the radix-tree, those with will not.
At m/fsync time if the radix tree is then empty nothing will happen.
These are very simple none intrusive patches with minimum risk. (I think)
They are based on v4.5-rc5. If you need a rebase on any other tree please
say.
Please consider this new flag for those of us people who specialize in
persistent-memory setups and want to extract any possible mileage out
of our systems.
Also attached for reference a 3rd patch to the nvml library to use
the new flag. Which brings me to the issue of persistent_memcpy / persistent_flush.
Currently this library is for x86_64 only, using the movnt instructions. The gcc
compiler should have a per ARCH facility for durable memory accesses. So applications
can be portable across systems.
Please advise?
list of patches:
[RFC 1/2] mmap: Define a new MAP_PMEM_AWARE mmap flag
[RFC 2/2] REVIEWME: dax: Support MAP_PMEM_AWARE for optimal
Two Kernel patches
[RFC 1/1] util: add pmem-aware flag to mmap
A patch for the nvml library
Thanks
Boaz
4 years, 11 months
acpi_nfit_find_poison() question
by Linda Knippers
Hi Vishal,
I was looking at acpi_nfit_find_poison() and if I'm reading this
right, I think it's throwing away some ARS results and re-running
an ARS unnecessarily. More comments below...
-- ljk
> static int acpi_nfit_find_poison(struct acpi_nfit_desc *acpi_desc,
> struct nd_region_desc *ndr_desc)
> {
> struct nvdimm_bus_descriptor *nd_desc = &acpi_desc->nd_desc;
> struct nvdimm_bus *nvdimm_bus = acpi_desc->nvdimm_bus;
> struct nd_cmd_ars_status *ars_status = NULL;
> struct nd_cmd_ars_start *ars_start = NULL;
> struct nd_cmd_ars_cap *ars_cap = NULL;
> u64 start, len, cur, remaining;
> int rc;
>
> ars_cap = kzalloc(sizeof(*ars_cap), GFP_KERNEL);
> if (!ars_cap)
> return -ENOMEM;
>
> start = ndr_desc->res->start;
> len = ndr_desc->res->end - ndr_desc->res->start + 1;
>
> rc = ars_get_cap(nd_desc, ars_cap, start, len);
> if (rc)
> goto out;
>
> /*
> * If ARS is unsupported, or if the 'Persistent Memory Scrub' flag in
> * extended status is not set, skip this but continue initialization
> */
> if ((ars_cap->status & 0xffff) ||
> !(ars_cap->status >> 16 & ND_ARS_PERSISTENT)) {
> dev_warn(acpi_desc->dev,
> "ARS unsupported (status: 0x%x), won't create an error list\n",
> ars_cap->status);
> goto out;
> }
>
> /*
> * Check if a full-range ARS has been run. If so, use those results
> * without having to start a new ARS.
> */
> ars_status = kzalloc(ars_cap->max_ars_out + sizeof(*ars_status),
> GFP_KERNEL);
> if (!ars_status) {
> rc = -ENOMEM;
> goto out;
> }
>
> rc = ars_get_status(nd_desc, ars_status);
> if (rc)
> goto out;
>
> if (ars_status->address <= start &&
> (ars_status->address + ars_status->length >= start + len)) {
> rc = ars_status_process_records(nvdimm_bus, ars_status, start);
> goto out;
> }
The above code will process the records if the ARS ran to completion but
not if the ARS overflowed. It won't process partial results because it's
checking both the start and the length against the total range.
>
> /*
> * ARS_STATUS can overflow if the number of poison entries found is
> * greater than the maximum buffer size (ars_cap->max_ars_out)
> * To detect overflow, check if the length field of ars_status
> * is less than the length we supplied. If so, process the
> * error entries we got, adjust the start point, and start again
> */
This comment seems like the right idea but that's not what it's doing.
> ars_start = kzalloc(sizeof(*ars_start), GFP_KERNEL);
> if (!ars_start)
> return -ENOMEM;
>
> cur = start;
> remaining = len;
If we get here, we're starting over at the beginning, losing the
previous results. Shouldn't we process the previous results and
then enter this loop using
cur = ars_status->address + ars_status->length;
remaining = len - ars_status->length;
?
Or restructure the loop so that the existing results, if any, are
processed before doing an ars_do_start()? Or did I miss something?
> do {
> u64 done, end;
>
> rc = ars_do_start(nd_desc, ars_start, cur, remaining);
> if (rc)
> goto out;
>
> rc = ars_get_status(nd_desc, ars_status);
> if (rc)
> goto out;
>
> rc = ars_status_process_records(nvdimm_bus, ars_status, cur);
> if (rc)
> goto out;
>
> end = min(cur + remaining,
> ars_status->address + ars_status->length);
> done = end - cur;
> cur += done;
> remaining -= done;
> } while (remaining);
>
> out:
> kfree(ars_cap);
> kfree(ars_start);
> kfree(ars_status);
> return rc;
> }
4 years, 12 months
[PATCH 0/8] nfit, libnvdimm: async address range scrub
by Dan Williams
Given the capacities of next generation persistent memory devices a
scrub operation to find all poison may take 10s of seconds. We want
this scrub work to be done asynchronously with the rest of system
initialization, so we move it out of line from the NFIT probing, i.e.
acpi_nfit_add().
However, we may want to synchronously wait for that scrubbing to
complete before we probe any pmem devices. Consider the case where
consuming poison triggers a machine check and a reboot. That event will
trigger platform firmware to initiate a scrub. The kernel should
complete any firmware initiated scrubs as those likely indicate the
presence of known poison.
When errors are not present, platform firmware did not initiate
scrubbing, we still scrub, but asynchronously. This trades off a risk
of hitting new unknown poison ranges with making the data available
faster after loading the driver.
This async scrub capability is also useful in the future when we
integrate Tony Luck's mcsafe_copy() (or whatever it is
eventually called). After a machine check recovery event we can scrub
the pmem namespace to see if there are any other latent errors and
otherwise update the 'badblocks' list with the new entries.
This passes the libndctl unit test suite, with some minor updates to
account for the fact that when "modprobe nfit_test" returns not all
regions are registered.
---
Dan Williams (8):
libnvdimm, nfit: centralize command status translation
libnvdimm: protect nvdimm_{bus|namespace}_add_poison() with nvdimm_bus_lock()
libnvdimm: async notification support
nfit, tools/testing/nvdimm: unify common init for acpi_nfit_desc
nfit, libnvdimm: async region scrub workqueue
nfit: scrub and register regions in a workqueue
nfit: disable userspace initiated ars during scrub
tools/testing/nvdimm: expand ars unit testing
drivers/acpi/nfit.c | 761 +++++++++++++++++++++++++++-----------
drivers/acpi/nfit.h | 24 +
drivers/nvdimm/bus.c | 46 ++
drivers/nvdimm/core.c | 110 ++++-
drivers/nvdimm/dimm_devs.c | 6
drivers/nvdimm/nd.h | 2
drivers/nvdimm/pmem.c | 15 +
drivers/nvdimm/region.c | 12 +
include/linux/libnvdimm.h | 5
include/linux/nd.h | 7
tools/testing/nvdimm/test/nfit.c | 133 +++++--
11 files changed, 809 insertions(+), 312 deletions(-)
4 years, 12 months
[PATCH v2 0/3] ACPI 6.1 update for NFIT Control Region Structure
by Toshi Kani
ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure
as follows.
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs defined as SPD values are arrays of bytes. The spec
clarified that they need to be represented as arrays of bytes
as well.
Patch 1 changes 'struct acpi_nfit_control_region' and the NFIT driver to
comply ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
Patch 3 changes the nfit test driver.
link: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
---
v2:
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
(Dan Williams)
---
Toshi Kani (3):
1/3 ACPI/NFIT: Update Control Region Structure to comply ACPI 6.1
2/3 ACPI/NFIT: Add NVDIMM ID "id" under sysfs
3/3 nfit_test: Update SPD ID init handlings
---
drivers/acpi/nfit.c | 41 ++++++++++++++++++++-----
include/acpi/actbl1.h | 24 +++++++++------
tools/testing/nvdimm/test/nfit.c | 64 ++++++++++++++++++++++++----------------
3 files changed, 88 insertions(+), 41 deletions(-)
4 years, 12 months
[PATCH] ext2, ext4: Fix issue with missing journal entry
by Ross Zwisler
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.
Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().
Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
---
fs/ext2/file.c | 19 +------------------
fs/ext4/file.c | 19 ++-----------------
2 files changed, 3 insertions(+), 35 deletions(-)
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2c88d68..c1400b1 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -80,23 +80,6 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
return ret;
}
-static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct inode *inode = file_inode(vma->vm_file);
- struct ext2_inode_info *ei = EXT2_I(inode);
- int ret;
-
- sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
- down_read(&ei->dax_sem);
-
- ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
-
- up_read(&ei->dax_sem);
- sb_end_pagefault(inode->i_sb);
- return ret;
-}
-
static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
{
@@ -124,7 +107,7 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
static const struct vm_operations_struct ext2_dax_vm_ops = {
.fault = ext2_dax_fault,
.pmd_fault = ext2_dax_pmd_fault,
- .page_mkwrite = ext2_dax_mkwrite,
+ .page_mkwrite = ext2_dax_fault,
.pfn_mkwrite = ext2_dax_pfn_mkwrite,
};
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1126436..d2e8500 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -262,23 +262,8 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
return result;
}
-static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- int err;
- struct inode *inode = file_inode(vma->vm_file);
-
- sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
- down_read(&EXT4_I(inode)->i_mmap_sem);
- err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
- up_read(&EXT4_I(inode)->i_mmap_sem);
- sb_end_pagefault(inode->i_sb);
-
- return err;
-}
-
/*
- * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
+ * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_fault()
* handler we check for races agaist truncate. Note that since we cycle through
* i_mmap_sem, we are sure that also any hole punching that began before we
* were called is finished by now and so if it included part of the file we
@@ -311,7 +296,7 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
.pmd_fault = ext4_dax_pmd_fault,
- .page_mkwrite = ext4_dax_mkwrite,
+ .page_mkwrite = ext4_dax_fault,
.pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
#else
--
2.5.0
4 years, 12 months
[GIT PULL] dax-fixes for 4.5-rc6
by Ross Zwisler
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git dax-fixes
This fixes several issues with the current DAX code, including possible data
corruption and kernel OOPSes. This also includes bugs with raw block devices
that never opt-in to DAX, so can affect existing applications and setups.
1) DAX is used by default on raw block devices that are capable of
supporting it. This creates an issue because there are still uses of the
block device that use the page cache, and having one block device user
doing DAX I/O and another doing page cache I/O can lead to data corruption.
2) When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. This wasn't true in all cases.
3) ext4 online defrag combined with DAX I/O could lead to data corruption.
4) The DAX block/sector zeroing code needs a valid struct block_device,
which it wasn't always getting. This could lead to a kernel OOPS.
5) The DAX writeback code needs a valid struct block_device, which it
wasn't always getting. This could lead to a kernel OOPS.
6) The DAX writeback code needs to be called for sync(2) and syncfs(2).
This could lead to data loss.
I know DAX fixes have historically gone up through Andrew Morton's -mm tree,
but for some reason he's been silent on this series for the last few weeks. I
think that the problems being fixed are important enough that we really
shouldn't wait until v4.6.
Please let me know if you'd like additional justification on why I think these
should be merged, or if you have any questions.
Thanks,
- Ross
----------------------------------------------------------------
Dan Williams (1):
block: disable block device DAX by default
Ross Zwisler (4):
ext2, ext4: only set S_DAX for regular inodes
ext4: Online defrag not supported with DAX
dax: give DAX clearing code correct bdev
dax: move writeback calls into the filesystems
block/Kconfig | 13 +++++++++++++
fs/block_dev.c | 19 +++++++++++++++++--
fs/dax.c | 21 +++++++++++----------
fs/ext2/inode.c | 16 +++++++++++++---
fs/ext4/inode.c | 6 +++++-
fs/ext4/ioctl.c | 5 +++++
fs/xfs/xfs_aops.c | 6 +++++-
fs/xfs/xfs_aops.h | 1 +
fs/xfs/xfs_bmap_util.c | 3 ++-
include/linux/dax.h | 8 +++++---
mm/filemap.c | 12 ++++--------
11 files changed, 81 insertions(+), 29 deletions(-)
commit 1f20410488863337259f528b3210c464c72ee27c
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Sun Feb 7 00:19:13 2016 -0700
dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used to
get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.
Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response to
sync(2).
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Jan Kara <jack(a)suse.cz>
commit dda4dcbdc9242eb600aa2d271d80bf7e1762aa63
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Fri Feb 5 22:07:04 2016 -0700
dax: give DAX clearing code correct bdev
dax_clear_blocks() needs a valid struct block_device and previously it was
using inode->i_sb->s_bdev in all cases. This is correct for normal inodes
on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time devices.
Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its
arguments to take a bdev and a sector instead of an inode and a block.
This better reflects what the function does, and it allows the filesystem
and raw block device code to pass in an appropriate struct block_device.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
commit 0e2dcfb5b46129c01738d610b7a4aa4165800d5e
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Sat Feb 13 21:44:27 2016 -0700
ext4: Online defrag not supported with DAX
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.
Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
commit 10d08a7339df8a252c7365d8877c72acd2aed109
Author: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Date: Fri Feb 12 18:15:25 2016 -0700
ext2, ext4: only set S_DAX for regular inodes
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated with
the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().
With the current code, though, this isn't always true. For example, ext2
and ext4 directory inodes can have S_DAX set, but have their dirty data
stored as dirty page cache entries. For these types of inodes, having
S_DAX set doesn't really make sense since their I/O doesn't actually happen
through the DAX code path.
Instead, only allow S_DAX to be set for regular inodes for ext2 and ext4.
This allows us to have strict DAX vs non-DAX paths in the writeback code.
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
commit 67e8c633958de5c168ed857c94a4573cc0442c97
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri Feb 12 13:08:47 2016 -0800
block: disable block device DAX by default
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.
We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.
[<ffffffff81576d93>] dump_stack+0x85/0xc2
[<ffffffff812b9ee0>] dax_writeback_mapping_range+0x60/0xe0
[<ffffffff812a1d4f>] blkdev_writepages+0x3f/0x50
[<ffffffff811db011>] do_writepages+0x21/0x30
[<ffffffff811cb6a6>] __filemap_fdatawrite_range+0xc6/0x100
[<ffffffff811cb75a>] filemap_write_and_wait+0x4a/0xa0
[<ffffffff812a15e0>] set_blocksize+0x70/0xd0
[<ffffffff812a273d>] sb_set_blocksize+0x1d/0x50
[<ffffffff8132ac9b>] ext4_fill_super+0x75b/0x3360
[<ffffffff81583381>] ? vsnprintf+0x201/0x4c0
[<ffffffff815836d9>] ? snprintf+0x49/0x60
[<ffffffff81263010>] mount_bdev+0x180/0x1b0
[<ffffffff8132a540>] ? ext4_calculate_overhead+0x370/0x370
[<ffffffff8131ad95>] ext4_mount+0x15/0x20
[<ffffffff81263908>] mount_fs+0x38/0x170
Mark the support broken so its disabled by default, but otherwise still
available for testing.
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jens Axboe <axboe(a)fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox(a)intel.com>
Cc: Al Viro <viro(a)ftp.linux.org.uk>
Reported-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Suggested-by: Dave Chinner <david(a)fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
5 years
[GIT PULL] libnvdimm, nfit: fixes for 4.5-rc6
by Williams, Dan J
Hi Linus, please pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
...to receive:
1/ Two fixes for compatibility with the ACPI 6.1 specification.
Without these fixes multi-interface DIMMs will fail to be probed, and
address range scrub commands to find memory errors will give results
that the kernel will mis-interpret. For multi-interface DIMMs Linux
will accept either the original 6.0 implementation or 6.1. For address
range scrub we'll only support 6.1 since ACPI formalized this DSM
differently than the original example [1] implemented in v4.2. The
expectation is that production systems will only ever ship the ACPI 6.1
address range scrub command definition.
2/ The wider async address range scrub work targeting 4.6 discovered
that the original synchronous implementation in 4.5 is not sizing its
return buffer correctly.
3/ Arnd caught that my recent fix to the size of the pfn_t flags missed
updating the flags variable used in the pmem driver.
4/ Toshi found that we mishandle the memremap() return value in
devm_memremap().
This branch has received a clean build success notification from the
kbuild robot across 105 configs.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
The following changes since commit 18558cae0272f8fd9647e69d3fec1565a7949865:
Linux 4.5-rc4 (2016-02-14 13:05:20 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm libnvdimm-fixes
for you to fetch changes up to c45442055dfdeb265cc20c9eeaa9fd11a75fbf51:
nvdimm: use 'u64' for pfn flags (2016-02-23 17:17:20 -0800)
----------------------------------------------------------------
Arnd Bergmann (1):
nvdimm: use 'u64' for pfn flags
Dan Williams (3):
nfit: fix multi-interface dimm handling, acpi6.1 compatibility
libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
nfit: update address range scrub commands to the acpi 6.1 format
Toshi Kani (1):
devm_memremap: Fix error value when memremap failed
drivers/acpi/nfit.c | 90 ++++++++++++++++++++--------------------
drivers/nvdimm/bus.c | 20 ++++-----
drivers/nvdimm/pmem.c | 2 +-
include/linux/libnvdimm.h | 3 +-
include/uapi/linux/ndctl.h | 11 ++++-
kernel/memremap.c | 4 +-
tools/testing/nvdimm/test/nfit.c | 8 +++-
7 files changed, 75 insertions(+), 63 deletions(-)
commit c45442055dfdeb265cc20c9eeaa9fd11a75fbf51
Author: Arnd Bergmann <arnd(a)arndb.de>
Date: Mon Feb 22 22:58:34 2016 +0100
nvdimm: use 'u64' for pfn flags
A recent bugfix changed pfn_t to always be 64-bit wide, but did not
change the code in pmem.c, which is now broken on 32-bit architectures
as reported by gcc:
In file included from ../drivers/nvdimm/pmem.c:28:0:
drivers/nvdimm/pmem.c: In function 'pmem_alloc':
include/linux/pfn_t.h:15:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
This changes the intermediate pfn_flags in struct pmem_device to
be 64 bit wide as well, so they can store the flags correctly.
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Fixes: db78c22230d0 ("mm: fix pfn_t vs highmem")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 93f834df9c2d4e362dfdc4b05daa0a4e18814836
Author: Toshi Kani <toshi.kani(a)hpe.com>
Date: Sat Feb 20 14:32:24 2016 -0800
devm_memremap: Fix error value when memremap failed
devm_memremap() returns an ERR_PTR() value in case of error.
However, it returns NULL when memremap() failed. This causes
the caller, such as the pmem driver, to proceed and oops later.
Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap()
failed.
Signed-off-by: Toshi Kani <toshi.kani(a)hpe.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 4577b0665515e0abc7bc72562d6328d179598815
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed Feb 17 13:08:58 2016 -0800
nfit: update address range scrub commands to the acpi 6.1 format
The original format of these commands from the "NVDIMM DSM Interface
Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root
Device _DSMs" [2].
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
"9.20.7 NVDIMM Root Device _DSMs"
Changes include:
1/ New 'restart' fields in ars_status, unfortunately these are
implemented in the middle of the existing definition so this change
is not backwards compatible. The expectation is that shipping
platforms will only ever support the ACPI 6.1 definition.
2/ New status values for ars_start ('busy') and ars_status ('overflow').
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Cc: Linda Knippers <linda.knippers(a)hpe.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 747ffe11b440ef9ea752888806d3aac677ca52a4
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri Feb 19 15:21:14 2016 -0800
libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing
Use the output length specified in the command to size the receive
buffer rather than the arbitrary 4K limit.
This bug was hiding the fact that the ndctl implementation of
ndctl_bus_cmd_new_ars_status() was not specifying an output buffer size.
Cc: <stable(a)vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
commit 6697b2cf69d4363266ca47eaebc49ef13dabc1c9
Author: Dan Williams <dan.j.williams(a)intel.com>
Date: Thu Feb 4 16:51:00 2016 -0800
nfit: fix multi-interface dimm handling, acpi6.1 compatibility
ACPI 6.1 clarified that multi-interface dimms require multiple control
region entries (DCRs) per dimm. Previously we were assuming that a
control region is only present when block-data-windows are present.
This implementation was done with an eye to be compatibility with the
looser ACPI 6.0 interpretation of this table.
1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
coalesce on device_handle rather than control region index.
2/ Whenever we disocver a control region with non-zero block windows
re-scan for block-data-window (BDW) entries.
We may need to revisit this if a DIMM ever implements a format interface
outside of blk or pmem, but that is not on the foreseeable horizon.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
5 years