This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest kernel driver in this patchset with the
changes suggested in v2. Tested with Qemu side device
emulation for virtio-pmem .
Details of project idea for 'virtio pmem' flushing interface
is shared  & .
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
- Creates virtio pmem device and exposes a memory range to
- At host side this is file backed memory which acts as
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem errors handling:
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from PATCH v2: 
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'
Changes from PATCH v1: 
- 0-day build test for build dependency on libnvdimm
Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text
- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow
Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Pankaj Gupta (5):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem
drivers/acpi/nfit/core.c | 4 -
drivers/dax/super.c | 17 +++++
drivers/nvdimm/claim.c | 6 +
drivers/nvdimm/nd.h | 1
drivers/nvdimm/pmem.c | 15 +++-
drivers/nvdimm/region_devs.c | 45 +++++++++++++-
drivers/nvdimm/virtio_pmem.c | 84 ++++++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
drivers/virtio/Makefile | 1
drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++++++++++
fs/ext4/file.c | 11 +++
fs/xfs/xfs_file.c | 8 ++
include/linux/dax.h | 9 ++
include/linux/libnvdimm.h | 11 +++
include/linux/virtio_pmem.h | 60 ++++++++++++++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 10 +++
17 files changed, 406 insertions(+), 12 deletions(-)
From: Huaisheng Ye <yehs1(a)lenovo.com>
This patch set could be used for dm-writecache when use persistent
memory as cache data device.
Patch 1 and 2 go towards removing unused parameter and codes which
actually doesn't really work.
Patch 3 and 4 are targeted at solving problem fn ctr failed to work
due to invalid magic or version, which is caused by the super block
of pmem has messy data stored.
Patch 5 is used for getting the status of seq_count.
Changes Since v2:
- seq_count is important for flush operations, output it within status
for debugging and analyzing code behavior.
Huaisheng Ye (5):
dm-writecache: remove unused size to writecache_flush_region
dm-writecache: get rid of memory_data flush to writecache_flush_entry
dm-writecache: expand pmem_reinit for struct dm_writecache
Documentation/device-mapper: add optional parameter reinit
dm-writecache: output seq_count within status
Documentation/device-mapper/writecache.txt | 4 ++++
drivers/md/dm-writecache.c | 23 +++++++++++++----------
2 files changed, 17 insertions(+), 10 deletions(-)
I'll begin with the most important.
I hacked your device and then got access to all your accounts... Including linux-nvdimm(a)lists.01.org.
It is easy to check - I wrote you this email from your account.
Moreover, I know your intim secret, and I have proof of this.
You do not know me personally, and no one paid me to check you.
It is just a coincidence that I discovered your mistake.
In fact, I posted a malicious code (exploit) to an adult site, and you visited this site...
While watching a video Trojan virus has been installed on your device through an exploit.
This darknet software working as RDP (remote-controlled desktop), which has a keylogger,
which gave me access to your microphone and webcam.
Soon after, my software received all your contacts from your messenger, social network and email.
At that moment I spent much more time than I should have.
I studied your love life and created a good video series.
The first part shows the video that you watched,
and the second part shows the video clip taken from your webcam (you are doing inappropriate things).
Honestly, I want to forget all the information about you and allow you to continue your daily life.
And I will give you two suitable options. Both are easy to do.
First option: you ignore this email.
The second option: you pay me $750(USD).
Let's look at 2 options in detail.
The first option is to ignore this email.
Let me tell you what happens if you choose this path.
I will send your video to your contacts, including family members, colleagues, etc.
This does not protect you from the humiliation that you and
your family need to know when friends and family members know about your unpleasant details.
The second option is to pay me. We will call this "privacy advice."
Now let me tell you what happens if you choose this path.
Your secret is your secret. I immediately destroy the video.
You continue your life as if none of this has happened.
Now you might think: "I'll call to police!"
Undoubtedly, I have taken steps to ensure that this letter cannot be traced to me,
and it will not remain aloof from the evidence of the destruction of your daily life.
I don't want to steal all your savings.
I just want to get compensation for my efforts that I put in to investigate you.
Let us hope that you decide to create all this in full and pay me a fee for confidentiality.
You make a Bitcoin payment (if you don't know how to do it, just enter "how to buy bitcoins" in Google search)
Shipping amount: $750(USD).
Getting Bitcoin Addresses: 1GF8J1XRaiX2oHM7SQo9VAFAtWZcRgMncg
(This is sensitive, so copy and paste it carefully)
Don't tell anyone what to use bitcoins for. The procedure for obtaining bitcoins can take several days, so do not wait.
I have a spetial code in Trojan, and now I know that you have read this letter.
You have 48 hours to pay.
If I don't get BitCoins, I'll send your video to your contacts, including close relatives, co-workers, and so on.
Start looking for the best excuse for friends and family before they all know.
But if I get paid, I immediately delete the video.
This is a one-time offer that is non-negotiable, so do not waste my and your time.
Time is running out.
On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford(a)redhat.com> wrote:
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> > https://youtu.be/ywKPPIE8JfQ?t=395
> Thanks for the link, I'll review the panel.
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> > xfs is the only Linux filesystem that supports DAX and reflink.
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.
> > > Do they need to stick with xfs?
> > Can you clarify the motivation for that question?
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4). The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.
No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.
6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional
So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.
> > This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets? Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).
I linked it in my response to Jason , but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.
The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.
On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg(a)ziepe.ca> wrote:
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
> Device-DAX is not what I'm imagining when I say XFS--.
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
True, not all the way, but the distinction loses significance as you
lose fs features.
Filesystems mark DAX functionality experimental  precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.
EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.
Is mmaping a PMEM/DAX/fs file MAP_PRIVATE supported? Is it something
that people are likely to want to do?
If it is supported, suppose I open a file in PMEM/DAX/fs, mmap it
MAP_PRIVATE, read from the memory mapped file (with memory accesses,
not the read syscall) and take a page fault which the kernel satisfies.
At this time do my page tables for the private mmaped page(s) point to the
PMEM corresponding to the file and the kernel will wait until
the page(s) is/are altered (either by me or someone else) to
copy on write and give me a different page/mapping?
Or does the kernel avoid this by always mapping a copy of the
page(s) involved in the private mmap in the first place?
In either case, is my private copy going to come from PMEM or is it
an "ordinary" page, or is this "random"? Does the program have
any choice in this (i.e. suppose I want to make sure my copied
page is persistent)?
Libnvdimm reserves the first 8K of pfn and devicedax namespaces to
store a superblock describing the namespace. This 8K reservation
is contained within the altmap area which the kernel uses for the
vmemmap backing for the pages within the namespace. The altmap
allows for some pages at the start of the altmap area to be reserved
and that mechanism is used to protect the superblock from being
re-used as vmemmap backing.
The number of PFNs to reserve is calculated using:
Which is implemented as:
#define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT))
So on systems where PAGE_SIZE is greater than 8K the reservation
size is truncated to zero and the superblock area is re-used as
vmemmap backing. As a result all the namespace information stored
in the superblock (i.e. if it's a PFN or DAX namespace) is lost
and the namespace needs to be re-created to get access to the
This patch fixes this by using PFN_UP() rather than PHYS_PFN() to ensure
that at least one page is reserved. On systems with a 4K pages size this
patch should have no effect.
Cc: Dan Williams <dan.j.williams(a)intel.com>
Fixes: ac515c084be9 ("libnvdimm, pmem, pfn: move pfn setup to the core")
Signed-off-by: Oliver O'Halloran <oohall(a)gmail.com>
drivers/nvdimm/pfn_devs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6f22272e8d80..9b9be83da0e7 100644
@@ -593,7 +593,7 @@ static unsigned long init_altmap_base(resource_size_t base)
static unsigned long init_altmap_reserve(resource_size_t base)
- unsigned long reserve = PHYS_PFN(SZ_8K);
+ unsigned long reserve = PFN_UP(SZ_8K);
unsigned long base_pfn = PHYS_PFN(base);
reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);