ACPI 6.1, Table 5-133, updates NVDIMM Control Region Structure as
- Valid Fields, Manufacturing Location, and Manufacturing Date
are added from reserved range. No change in the structure size.
- IDs (SPD values) are stored as arrays of bytes (i.e. big-endian
format). The spec clarifies that they need to be represented
as arrays of bytes as well.
Patch 1 changes the NFIT driver to comply with ACPI 6.1.
Patch 2 adds a new sysfs file "id" to show NVDIMM ID defined in ACPI 6.1.
The patch-set applies on linux-pm.git acpica.
- Need to coordinate with ACPICA update (Bob Moore, Dan Williams)
- Integrate with ACPICA changes in struct acpi_nfit_control_region.
- Remove 'mfg_location' and 'mfg_date'. (Dan Williams)
- Rename 'unique_id' to 'id' and make this change as a separate patch.
Toshi Kani (3):
1/2 acpi/nfit: Update nfit driver to comply with ACPI 6.1
2/3 acpi/nfit: Add sysfs "id" for NVDIMM ID
drivers/acpi/nfit.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
The last patch is what started the series: XFS currently uses the
direct I/O locking strategy for DAX because DAX was overloaded onto
the direct I/O path. For XFS this means that we only take a shared
inode lock instead of the normal exclusive one for writes IFF they
are properly aligned. While this is fine for O_DIRECT which requires
explicit opt-in from the application it's not fine for DAX where we'll
suddenly lose expected and required synchronization of the file system
happens to use DAX undeneath.
Patches 1-7 just untangle the code so that we can deal with DAX on
it's own easily.
this is a second revision of my patches to clear dirty bits from radix tree of
DAX inodes when caches for corresponding pfns have been flushed. This patch set
is significantly larger than the previous version because I'm changing how
->fault, ->page_mkwrite, and ->pfn_mkwrite handlers may choose to handle the
fault so that we don't have to leak details about DAX locking into the generic
code. In principle, these patches enable handlers to easily update PTEs and do
other work necessary to finish the fault without duplicating the functionality
present in the generic code. I'd be really interested in feedback from mm
folks whether such changes to fault handling code are fine or what they'd do
Changes since v1:
* make sure all PTE updates happen under radix tree entry lock to protect
against races between faults & write-protecting code
* remove information about DAX locking from mm/memory.c
* smaller updates based on Ross' feedback
Background information regarding the motivation:
Currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) flushes all the dirty pfns again and again. This patches implement
clearing of the dirty tag in the radix tree so that we issue flush only when
The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).
The effect of the patch set is easily visible:
Writing 1 GB of data via mmap, then fsync twice.
Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
Patches have passed xfstests for both xfs and ext4.
when testing my latest changes to DXA fault handling code I have hit the
following interesting race between the fault and write path (I'll show
function names for ext4 but xfs has the same issue AFAICT).
We have a file 'f' which has a hole at offset 0.
Process 0 Process 1
data = mmap('f');
-> fault, we map a hole page
pwrite('f', buf, len, 0)
- drops hole page from
the radix tree
- allocates block for
data = 1
-> page_mkwrite fault
- creates locked radix tree entry
- maps block into PTE
- removes dax entry from
the radix tree
So we have just lost information that block 0 is mapped and needs flushing
Also the fact that the consistency of data as viewed by mmap and
dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
unexpected to me and we should document it somewhere.
The question is how to best fix this. I see three options:
1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
harsh but should work - we call filemap_write_and_wait() in
generic_file_direct_write() so we flush out all caches for the relevant
area before dropping radix tree entries.
2) Call filemap_write_and_wait() after we return from ->direct_IO before we
call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
for those two calls. Lock hold time will be shorter than 1) but it will
require additional flush and we'd probably have to stop using
generic_file_direct_write() for DAX writes to allow for all this special
3) Remodel dax_do_io() to work more like buffered IO and use radix tree
entry locks to protect against similar races. That has likely better
scalability than 1) but may be actually slower in the uncontended case (due
to all the radix tree operations).
Any opinions on this?
Jan Kara <jack(a)suse.com>
SUSE Labs, CR
The "NVDIMM Block Window Driver Writer's Guide":
defines the layout of the block window status register. For the July 2016
version of the spec linked to above, this happens in Figure 4 on page 26.
The only bits defined in this spec are bits 31, 5, 4, 2, 1 and 0. The rest
of the bits in the status register are reserved, and there is a warning
following the diagram that says:
Note: The driver cannot assume the value of the RESERVED bits in the
status register are zero. These reserved bits need to be masked off, and
the driver must avoid checking the state of those bits.
This change ensures that for hardware implementations that set these
reserved bits in the status register, the driver won't incorrectly fail the
Signed-off-by: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
drivers/acpi/nfit.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 1f0e060..375c10f 100644
@@ -1396,11 +1396,12 @@ static u32 read_blk_stat(struct nfit_blk *nfit_blk, unsigned int bw)
struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
u64 offset = nfit_blk->stat_offset + mmio->size * bw;
+ const u32 STATUS_MASK = 0x80000037;
offset = to_interleave_offset(offset, mmio);
- return readl(mmio->addr.base + offset);
+ return readl(mmio->addr.base + offset) & STATUS_MASK;
static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
Hi Dan and Jerry,
I'm currently looking into SMART data retrieval on HPE NVDIMMs.
After the first obstacle (like getting cat
/sys/class/nd/ndctl0/device/nmem0/commands reutrn smart so ndctl will issue
the ioctl) I ran into a rather nasty problem. According to  HPEDIMMs
need the input buffer specially crafted for SMART data, according to 
Intel DIMMs don't.
Adding translation functions for the DIMMs accepted commands is one thing and
should be more or less trivial for all DIMMs (I'll post an RFC patch as soon
as Linus merged Dan's 4.8 pull request so I can rebase it) but doing this
type of conversation for each and every command for every defined vendor
family for both the input and output buffers will drive us all mad I guess.
Especially from the distribution's POV I'm not to keen on having customers
with some new NVDIMM family and we would need to re-implement all the
translators again. Adding a new ID is one thing but translation tables are a
totally different story.
So the question is have I overlooked something and there is a clean and easy
solution to this problem, or not.
@Jerry have you tested SMART data retrieval with ndctl? Did it work for you?
Johannes Thumshirn Storage
jthumshirn(a)suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Do you want to advertise on facebook? We're here to help.
We wil manually post your product/logo/link on Facebook Groups and will
give you a full report with links of each live post where your advertisement
Unsubscribe option is available on the footer of our website