On Wed, Sep 10, 2014 at 6:23 AM, Boaz Harrosh <openosd(a)gmail.com> wrote:
> On 09/09/2014 07:53 PM, Dan Williams wrote:
>> On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz(a)plexistor.com> wrote:
>> Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line"
>> description language. I understand that this is borrowed from the
>> memmap= precedent, but, if I'm not mistaken, that's really more for
>> debug rather than a permanent platform-device descriptor.
> So what is bad about that ?
>> Since this looks like firmware why not go ahead and use
>> request_firmware() to request a pmem range descriptor blob
> What? God no! How is this a firmware at all. No this is all BUS info, it is
> "where on the bus this device's resource was allocated"
> Firmware is static compile time thing common to all of my devices
> but this here is where in the bus this was stuck in. What is your
> suggestion that I compile a kernel and "make install" it (initrd)
> for every DIMM I insert ? since when? and what tool will tell me
> what to put there?
>> Given you can compile such a blob into a kernel image or provide it
>> in an initrd I think it makes deployment more straightforward, also
>> the descriptor format can be extended going forward
> What ? really Dan I think you got confused. The nn@ss thing is just:
> "what connector I stuck my DIMM disk in (and what other empty slots I have)"
> Would you have me compile and install my kernel every time my sdX
> number change, or my sata channel moved. I do not think so.
>> whereas the
>> command line approach mandates ever increasingly complicated strings.
> This is the first API, I intend, there is even a TODO in the patchset, to
> also have a dynamic sysfs add/remove/grow API for on the fly changes.
> All these can be easily driven by a simple udev rule to plug and play them.
> Usually in the Kernel, buses do not directly load devices drivers. The
> bus driver sends an "event" with a new discovered resource, then user-mode
> udev rule will load an associated driver, which will scan for its device or
> receive its identification somewhere, that can be the sysfs interface above,
> or even commandline.
> Farther down the road we might want a Kernel API, through to the DIMM-manager
> when each device's ID is Identified to managed volumes of DIMMs. (And what
> happens to them if they move in physical addressing), but even with DIMM-manager
> I would have gone through a udev-event and pmem-probe because, this way user
> mode can add a chain of commands associated with new insertions. So since we
> have a DIMM-manager event going on why not have the udev rule load pmem as well
> and not have any API. But All this is way down the line.
> Regardless of which, commandline API of pmem will always stay as this is for
> the pmem emulation as we use it now, in accord with memmap=.
> (And it works very nice in our lab with DDR3 NvDIMMs that need an memmap= as well)
> We please need to start somewhere, no?
Sure, but you used the operative term "start", as in you already
expect to enhance this capability down the road, right?
It's fine to dismiss this request_firmware() based approach, but don't
mis-characterize it in the process. With regards to describing device
boundaries, a bus-descriptor-blob handed to the kernel is a superset
of the capability provided by the kernel command line. It can be
injected statically at compile time, or dynamically loaded from the
initrd or the rootfs. It has the added benefit of being flexible to
change whereas the kernel command line is a more permanent contract
that we will need to maintain compatibility with in perpetuity.
If you already see this bus description as a "starting" point, then I
think we need an interface that is more amenable to ongoing change,
that's not the kernel-command-line.
PMEM is a modified version of the Block RAM Driver, BRD. The major difference
is that BRD allocates its backing store pages from the page cache, whereas
PMEM uses reserved memory that has been ioremapped.
One benefit of this approach is that there is a direct mapping between
filesystem block numbers and virtual addresses. In PMEM, filesystem blocks N,
N+1, N+2, etc. will all be adjacent in the virtual memory space. This property
allows us to set up PMD mappings (2 MiB) for DAX.
This patch set is builds upon the work that Matthew Wilcox has been doing for
Specifically, my implementation of pmem_direct_access() in patch 4/4 uses API
enhancements introduced in Matthew's DAX patch v10 02/21:
Ross Zwisler (4):
pmem: Initial version of persistent memory driver
pmem: Add support for getgeo()
pmem: Add support for rw_page()
pmem: Add support for direct_access()
MAINTAINERS | 6 +
drivers/block/Kconfig | 41 ++++++
drivers/block/Makefile | 1 +
drivers/block/pmem.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 423 insertions(+)
create mode 100644 drivers/block/pmem.c
On 09/10/2014 07:10 PM, Dave Hansen wrote:
> On 09/10/2014 03:07 AM, Boaz Harrosh wrote:
>> On 09/09/2014 09:36 PM, Dave Hansen wrote:
>>> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>>>> This is for add_persistent_memory that will want a section of pages
>>>> allocated but without any zone associated. This is because belonging
>>>> to a zone will give the memory to the page allocators, but
>>>> persistent_memory belongs to a block device, and is not available for
>>>> regular volatile usage.
>>> I don't think we should be taking patches like this in to the kernel
>>> until we've seen the other side of it. Where is the page allocator code
>>> which will see a page belonging to no zone? Am I missing it in this set?
>> It is not missing. It will never be.
>> These pages do not belong to any allocator. They are not allocate-able
>> pages. In fact they are not "memory" they are "storage"
>> These pages belong wholesomely to a block-device. In turn the block
>> device grants ownership of a partition of this pages to an FS.
>> The FS loaded has its own block allocation schema. Which internally
>> circulate each pages usage around. But the page never goes beyond its
> I'm mostly worried about things that start with an mmap().
> Imagine you mmap() a persistent memory file, fault some pages in, then
> 'cat /proc/$pid/numa_maps'. That code will look at the page to see
> which zone and node it is in.
> Or, consider if you mmap() then put a futex in the page. The page will
> have get_user_pages() called on it by the futex code, and a reference
> taken. The reference can outlast the mmap(). We either have to put the
> file somewhere special and scan the page's reference occasionally, or we
> need to hook something under put_page() to make sure that we keep the
> page out of the normal allocator.
Yes the block_allocator of the pmem-FS always holds the final REF on this
page, as long as there is valid data on this block. Even cross boots, the
mount code re-initializes references. The only internal state that frees
these blocks is truncate, which only then return these pages to the block
allocator, all this is common practice in filesystems so the page-ref on
these blocks only ever drops to zero after they loose all visibility. And
yes the block allocator uses a special code to drop the count to zero
not using put_page().
So there is no chance these pages will ever be presented to page_allocators
through a put_page().
BTW: There is an hook in place that can be used today. By calling
SetPagePrivate(page) and setting a .release function on the page->mapping->a_ops
If .release() returns false the page is not released (and can be added on an
internal queue for garbage collection)
But with above schema this is not needed at all. I yet need to find a test
that keeps my free_block reference above 1. At which time I will exercise
a garbage collection queue.
>>> I see about 80 or so calls to page_zone() in the kernel. How will a
>>> zone-less page look to all of these sites?
>> None of these 80 call site will be reached! the pages are always used
>> below the FS, like send them on the network, or send them to a slower
>> block device via a BIO. I have a full fledge FS on top of this code
>> and it all works very smoothly, and stable. (And fast ;))
> Does the fs support mmap()?
> The idea of layering is a nice one, but mmap() is a big fat layering
> violation. :)
Yes the FS supports mmap, but through the DAX patchset. Please see
Matthew's DAX patchset how he implements mmap without using pages
at all, direct PFN to virtual_addr. So these pages do not get exposed
to the top of the FS.
My FS uses his technics exactly only when it wants to spill over to
slower device it will use these pages copy-less.
On 09/09/2014 09:36 PM, Dave Hansen wrote:
> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>> This is for add_persistent_memory that will want a section of pages
>> allocated but without any zone associated. This is because belonging
>> to a zone will give the memory to the page allocators, but
>> persistent_memory belongs to a block device, and is not available for
>> regular volatile usage.
> I don't think we should be taking patches like this in to the kernel
> until we've seen the other side of it. Where is the page allocator code
> which will see a page belonging to no zone? Am I missing it in this set?
It is not missing. It will never be.
These pages do not belong to any allocator. They are not allocate-able
pages. In fact they are not "memory" they are "storage"
These pages belong wholesomely to a block-device. In turn the block
device grants ownership of a partition of this pages to an FS.
The FS loaded has its own block allocation schema. Which internally
circulate each pages usage around. But the page never goes beyond its
> I see about 80 or so calls to page_zone() in the kernel. How will a
> zone-less page look to all of these sites?
None of these 80 call site will be reached! the pages are always used
below the FS, like send them on the network, or send them to a slower
block device via a BIO. I have a full fledge FS on top of this code
and it all works very smoothly, and stable. (And fast ;))
It is up to the pMem-based FS to manage its pages's ref count so they are
never released outside of its own block allocator.
at the end of the day, struct pages has nothing to do with zones
and allocators and "memory", as it says in Documentation struct
page is a facility to track the state of a physical page in the
system. All the other structures are higher in the stack above
the physical layer, struct-pages for me are the upper API of the
memory physical layer. Which are in common with pmem, higher
on the stack where with memory we have a zone, pmem has a block-device.
Higher where we have page allocators, pmem has an FS block allocator,
higher where we have a slab, pmem has files for user consumption.
pmem is storage, which shares the physical layer with memory, and
this is what this patch describes. There will be no more mm interaction
at all for pmem. The rest of the picture is all there in plain site as
part of this patchset, the pmem.c driver then an FS on top of that. What
else do you need to see?