This patch follows from an RFC we did earlier this year . This
patchset applies cleanly to v4.9-rc1.
Updates since RFC
Included the iopmem driver in the submission.
There have been several attempts to upstream patchsets that enable
DMAs between PCIe peers. These include Peer-Direct  and DMA-Buf
style patches . None have been successful to date. Haggai Eran
gives a nice overview of the prior art in this space in his cover
Motivation and Use Cases
PCIe IO devices are getting faster. It is not uncommon now to find PCIe
network and storage devices that can generate and consume several GB/s.
Almost always these devices have either a high performance DMA engine, a
number of exposed PCIe BARs or both.
Until this patch, any high-performance transfer of information between
two PICe devices has required the use of a staging buffer in system
memory. With this patch the bandwidth to system memory is not compromised
when high-throughput transfers occurs between PCIe devices. This means
that more system memory bandwidth is available to the CPU cores for data
processing and manipulation. In addition, in systems where the two PCIe
devices reside behind a PCIe switch the datapath avoids the CPU
We provide a PCIe device driver in an accompanying patch that can be
used to map any PCIe BAR into a DAX capable block device. For
non-persistent BARs this simply serves as an alternative to using
system memory bounce buffers. For persistent BARs this can serve as an
additional storage device in the system.
Testing and Performance
We have done a moderate about of testing of this patch on a QEMU
environment and on real hardware. On real hardware we have observed
peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
both cases these numbers are limitations of our consumer hardware. In
addition, we have observed that the CPU DRAM bandwidth is not impacted
when using IOPMEM which is not the case when a traditional path
through system memory is taken.
For more information on the testing and performance results see the
GitHub site .
1. Address Translation. Suggestions have been made that in certain
architectures and topologies the dma_addr_t passed to the DMA master
in a peer-2-peer transfer will not correctly route to the IO memory
intended. However in our testing to date we have not seen this to be
an issue, even in systems with IOMMUs and PCIe switches. It is our
understanding that an IOMMU only maps system memory and would not
interfere with device memory regions. (It certainly has no opportunity
to do so if the transfer gets routed through a switch).
2. Memory Segment Spacing. This patch has the same limitations that
ZONE_DEVICE does in that memory regions must be spaces at least
SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
BARs can be placed closer together than this. Thus ZONE_DEVICE would not
be usable on neighboring BARs. For our purposes, this is not an issue as
we'd only be looking at enabling a single BAR in a given PCIe device.
More exotic use cases may have problems with this.
3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
peer there is potential for coherency issues and for writes to occur out
of order. This is something that users of this feature need to be
cognizant of. Though really, this isn't much different than the
existing situation with things like RDMA: if userspace sets up an MR
for remote use, they need to be careful about using that memory region
4. Architecture. Currently this patch is applicable only to x86_64
architectures. The same is true for much of the code pertaining to
PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
ARCH over time.
Logan Gunthorpe (1):
memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.
Stephen Bates (2):
iopmem : Add a block device driver for PCIe attached IO memory.
iopmem : Add documentation for iopmem driver
Documentation/blockdev/00-INDEX | 2 +
Documentation/blockdev/iopmem.txt | 62 +++++++
MAINTAINERS | 7 +
drivers/block/Kconfig | 27 ++++
drivers/block/Makefile | 1 +
drivers/block/iopmem.c | 333 ++++++++++++++++++++++++++++++++++++++
drivers/dax/pmem.c | 4 +-
drivers/nvdimm/pmem.c | 4 +-
include/linux/memremap.h | 5 +-
kernel/memremap.c | 80 ++++++++-
tools/testing/nvdimm/test/iomap.c | 3 +-
11 files changed, 518 insertions(+), 10 deletions(-)
create mode 100644 Documentation/blockdev/iopmem.txt
create mode 100644 drivers/block/iopmem.c
this is the fourth revision of my patches to clear dirty bits from radix tree
of DAX inodes when caches for corresponding pfns have been flushed. This patch
set is significantly larger than the previous version because I'm changing how
->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
fault so that we don't have to leak details about DAX locking into the generic
code. In principle, these patches enable handlers to easily update PTEs and do
other work necessary to finish the fault without duplicating the functionality
present in the generic code. I'd be really like feedback from mm folks whether
such changes to fault handling code are fine or what they'd do differently.
The patches are based on 4.9-rc1 + Ross' DAX PMD page fault series  + ext4
conversion of DAX IO patch to the iomap infrastructure . For testing,
I've pushed out a tree including all these patches and further DAX fixes
The patches pass testing with xfstests on ext4 and xfs on my end. I'd be
grateful for review so that we can push these patches for the next merge
 Posted an hour ago - look for "ext4: Convert ext4 DAX IO to iomap framework"
Changes since v3:
* rebased on top of 4.9-rc1 + DAX PMD fault series + ext4 iomap conversion
* reordered some of the patches
* killed ->virtual_address field in vm_fault structure as requested by
Changes since v2:
* rebased on top of 4.8-rc8 - this involved dealing with new fault_env
* changed calling convention for fault helpers
Changes since v1:
* make sure all PTE updates happen under radix tree entry lock to protect
against races between faults & write-protecting code
* remove information about DAX locking from mm/memory.c
* smaller updates based on Ross' feedback
Background information regarding the motivation:
Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) flushes all the dirty pfns again and again. This patches implement
clearing of the dirty tag in the radix tree so that we issue flush only when
The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).
The effect of the patch set is easily visible:
Writing 1 GB of data via mmap, then fsync twice.
Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
this patch set converts ext4 DAX IO paths to the new iomap framework and
removes the old bh-based DAX functions. As a result ext4 gains PMD page
fault support, also some other minor bugs get fixed. The patch set is based
on Ross' DAX PMD page fault support series . It passes xfstests both in
DAX and non-DAX mode.
The question is how shall we merge this. If Dave is pulling PMD patches through
XFS tree, then these patches could go there as well (chances for conflicts
with other ext4 stuff are relatively low) or Dave could just export a stable
branch with PMD series which Ted would just pull...
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.
Previously we had talked about this series going through the XFS tree, but
Jan has a patch set that will need to build on this series and it heavily
modifies the MM code. I think he would prefer that series to go through
Andrew Morton's -MM tree, so it probably makes sense for this series to go
through that same tree.
For reference, here is the series from Jan that I was talking about:
Andrew, can you please pick this up for the v4.10 merge window?
This series is currently based on v4.9-rc3. I tried to rebase onto a -mm
branch or tag, but couldn't find one that contained the DAX iomap changes
that were merged as part of the v4.9 merge window. I'm happy to rebase &
test on a v4.9-rc* based -MM branch or tag whenever they are available.
Changes since v8:
- Rebased onto v4.9-rc3.
- Updated the DAX PMD fault path so that on fallback we always check to see
if we are dealing with a transparent huge page, and if we are we will
split it. This was already happening for one of the fallback cases via a
patch from Toshi, and Jan hit a deadlock in another fallback case where
the same splitting was needed. (Jan & Toshi)
This series has passed all my xfstests testing, including the test that was
hitting the deadlock with v8.
Here is a tree containing my changes:
Ross Zwisler (16):
ext4: tell DAX the size of allocation holes
dax: remove buffer_size_valid()
ext2: remove support for DAX PMD faults
dax: make 'wait_table' global variable static
dax: remove the last BUG_ON() from fs/dax.c
dax: consistent variable naming for DAX entries
dax: coordinate locking for offsets in PMD range
dax: remove dax_pmd_fault()
dax: correct dax iomap code namespace
dax: add dax_iomap_sector() helper function
dax: dax_iomap_fault() needs to call iomap_end()
dax: move RADIX_DAX_* defines to dax.h
dax: move put_(un)locked_mapping_entry() in dax.c
dax: add struct iomap based DAX PMD support
xfs: use struct iomap based DAX PMD fault path
dax: remove "depends on BROKEN" from FS_DAX_PMD
fs/Kconfig | 1 -
fs/dax.c | 826 +++++++++++++++++++++++++++++-----------------------
fs/ext2/file.c | 35 +--
fs/ext4/inode.c | 3 +
fs/xfs/xfs_aops.c | 26 +-
fs/xfs/xfs_aops.h | 3 -
fs/xfs/xfs_file.c | 10 +-
include/linux/dax.h | 58 +++-
mm/filemap.c | 5 +-
9 files changed, 537 insertions(+), 430 deletions(-)