ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
Currently, copy_from_iter_nocache() uses "nocache" copies only for
iovecs; bvecs and kvecs use normal copies. This requires
x86's arch_copy_from_iter_pmem() to issue flushes for bvecs and kvecs,
which has a negative impact on performance when splice()ing from a pipe
to a pmem-backed file on a DAX-mounted file system.
This patch set enables nocache copies in copy_from_iter_nocache() for
bvecs and kvecs for arches that support it (x86 initially). This provides
a 2-3X improvement in splice() pipe-to-DAX-file throughput.
The first patch introduces memcpy_nocache(), which defaults to just
memcpy(), but for which an x86-specific implementation is provided.
For this patch, I sought to use a static inline function for x86, but
I could not find an obvious header file to put it in.
The build seemed to work when I put it in arch/x86/include/asm/uaccess.h,
but that didn't feel completely right. I also tried
arch/x86/include/asm/pmem.h, but that doesn't feel right either and it
didn't build. So, I offer it here in arch/x86/lib/misc.c for discussion.
The second patch updates copy_from_iter_nocache() to use the new
The third patch removes the flushes from x86's arch_copy_from_iter_pmem().
For testing, I ran fio with the posixaio, mmap, sync, psync, vsync, pvsync,
and splice engines, against both ext4 and xfs. Only the splice engine
showed any change in performance. For example, for xfs:
Run status group 2 (all jobs):
WRITE: io=37602MB, aggrb=641724KB/s, minb=641724KB/s, maxb=641724KB/s, mint=60001msec, maxt=60001msec
Run status group 3 (all jobs):
WRITE: io=36244MB, aggrb=618553KB/s, minb=618553KB/s, maxb=618553KB/s, mint=60001msec, maxt=60001msec
With this patch set:
Run status group 2 (all jobs):
WRITE: io=128055MB, aggrb=2134.3MB/s, minb=2134.3MB/s, maxb=2134.3MB/s, mint=60001msec, maxt=60001msec
Run status group 3 (all jobs):
WRITE: io=122586MB, aggrb=2043.8MB/s, minb=2043.8MB/s, maxb=2043.8MB/s, mint=60001msec, maxt=60001msec
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Brian Boylston <brian.boylston(a)hpe.com>
Reviewed-by: Toshi Kani <toshi.kani(a)hpe.com>
Reported-by: Oliver Moreno <oliver.moreno(a)hpe.com>
Changes in v2:
- Split into multiple patches (Toshi Kani)
- Introduce memcpy_nocache() (Al Viro)
- Use nocache for kvecs as well
Brian Boylston (3):
use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
x86: remove unneeded flush in arch_copy_from_iter_pmem()
arch/x86/include/asm/pmem.h | 19 +------------------
arch/x86/include/asm/string_32.h | 3 +++
arch/x86/include/asm/string_64.h | 3 +++
arch/x86/lib/misc.c | 12 ++++++++++++
include/linux/string.h | 15 +++++++++++++++
lib/iov_iter.c | 14 +++++++++++---
6 files changed, 45 insertions(+), 21 deletions(-)
[ adding linux-fsdevel and linux-nvdimm ]
On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong
> However, it is not easy to handle the case that the new VMA overlays with
> the old VMA
> already got by userspace. I think we have some choices:
> 1: One way is completely skipping the new VMA region as current kernel code
> does but i
> do not think this is good as the later VMAs will be dropped.
> 2: show the un-overlayed portion of new VMA. In your case, we just show the
> (0x2000 -> 0x3000), however, it can not work well if the VMA is a new
> region with different attributions.
> 3: completely show the new VMA as this patch does.
> Which one do you prefer?
I don't have a preference, but perhaps this breakage and uncertainty
is a good opportunity to propose a more reliable interface for NVML to
get the information it needs?
My understanding is that it is looking for the VM_MIXEDMAP flag which
is already ambiguous for determining if DAX is enabled even if this
dynamic listing issue is fixed. XFS has arranged for DAX to be a
per-inode capability and has an XFS-specific inode flag. We can make
that a common inode flag, but it seems we should have a way to
interrogate the mapping itself in the case where the inode is unknown
or unavailable. I'm thinking extensions to mincore to have flags for
DAX and possibly whether the page is part of a pte, pmd, or pud
mapping. Just floating that idea before starting to look into the
implementation, comments or other ideas welcome...
This patch follows from an RFC we did earlier this year . This
patchset applies cleanly to v4.9-rc1.
Updates since RFC
Included the iopmem driver in the submission.
There have been several attempts to upstream patchsets that enable
DMAs between PCIe peers. These include Peer-Direct  and DMA-Buf
style patches . None have been successful to date. Haggai Eran
gives a nice overview of the prior art in this space in his cover
Motivation and Use Cases
PCIe IO devices are getting faster. It is not uncommon now to find PCIe
network and storage devices that can generate and consume several GB/s.
Almost always these devices have either a high performance DMA engine, a
number of exposed PCIe BARs or both.
Until this patch, any high-performance transfer of information between
two PICe devices has required the use of a staging buffer in system
memory. With this patch the bandwidth to system memory is not compromised
when high-throughput transfers occurs between PCIe devices. This means
that more system memory bandwidth is available to the CPU cores for data
processing and manipulation. In addition, in systems where the two PCIe
devices reside behind a PCIe switch the datapath avoids the CPU
We provide a PCIe device driver in an accompanying patch that can be
used to map any PCIe BAR into a DAX capable block device. For
non-persistent BARs this simply serves as an alternative to using
system memory bounce buffers. For persistent BARs this can serve as an
additional storage device in the system.
Testing and Performance
We have done a moderate about of testing of this patch on a QEMU
environment and on real hardware. On real hardware we have observed
peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
both cases these numbers are limitations of our consumer hardware. In
addition, we have observed that the CPU DRAM bandwidth is not impacted
when using IOPMEM which is not the case when a traditional path
through system memory is taken.
For more information on the testing and performance results see the
GitHub site .
1. Address Translation. Suggestions have been made that in certain
architectures and topologies the dma_addr_t passed to the DMA master
in a peer-2-peer transfer will not correctly route to the IO memory
intended. However in our testing to date we have not seen this to be
an issue, even in systems with IOMMUs and PCIe switches. It is our
understanding that an IOMMU only maps system memory and would not
interfere with device memory regions. (It certainly has no opportunity
to do so if the transfer gets routed through a switch).
2. Memory Segment Spacing. This patch has the same limitations that
ZONE_DEVICE does in that memory regions must be spaces at least
SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
BARs can be placed closer together than this. Thus ZONE_DEVICE would not
be usable on neighboring BARs. For our purposes, this is not an issue as
we'd only be looking at enabling a single BAR in a given PCIe device.
More exotic use cases may have problems with this.
3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
peer there is potential for coherency issues and for writes to occur out
of order. This is something that users of this feature need to be
cognizant of. Though really, this isn't much different than the
existing situation with things like RDMA: if userspace sets up an MR
for remote use, they need to be careful about using that memory region
4. Architecture. Currently this patch is applicable only to x86_64
architectures. The same is true for much of the code pertaining to
PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
ARCH over time.
Logan Gunthorpe (1):
memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.
Stephen Bates (2):
iopmem : Add a block device driver for PCIe attached IO memory.
iopmem : Add documentation for iopmem driver
Documentation/blockdev/00-INDEX | 2 +
Documentation/blockdev/iopmem.txt | 62 +++++++
MAINTAINERS | 7 +
drivers/block/Kconfig | 27 ++++
drivers/block/Makefile | 1 +
drivers/block/iopmem.c | 333 ++++++++++++++++++++++++++++++++++++++
drivers/dax/pmem.c | 4 +-
drivers/nvdimm/pmem.c | 4 +-
include/linux/memremap.h | 5 +-
kernel/memremap.c | 80 ++++++++-
tools/testing/nvdimm/test/iomap.c | 3 +-
11 files changed, 518 insertions(+), 10 deletions(-)
create mode 100644 Documentation/blockdev/iopmem.txt
create mode 100644 drivers/block/iopmem.c
This is to confirm that one or more of your parcels has been shipped.
Please, open email attachment to print shipment label.
Thank you for choosing FedEx,