I've merged mptcp_net-next's master branch
(https://github.com/multipath-tcp/mptcp_net-next) with the current
net-next branch. This makes Florian's skb extensions and the new indirect
call wrappers available.
FYI, Florian sent a new version of his patch-set on Netdev:
- https://lwn.net/ml/netdev/[email protected]/
- @Mat: you might be interested in:
Thanks Florian for this very good job! :)
---------- Forwarded message ---------
From: Florian Westphal <fw(a)strlen.de>
Date: Tue, Dec 18, 2018 at 5:13 PM
Subject: [PATCH net-next 0/13] sk_buff: add extension infrastructure
- objdiff shows no change if CONFIG_XFRM=n && BR_NETFILTER=n
- small size reduction when one or both options are set
- no changes in ipsec performance
Changes since v1:
- Allocate entire extension space from a kmem_cache.
- Avoid atomic_dec_and_test operation on skb_ext_put() for refcnt == 1 case.
(similar to kfree_skbmem() fclone_ref use).
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
The third (future) user is Multipath TCP which is still out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
It is possible to add support for extensions that are not preseved on
1. define a bitmask of all extensions that need copy/cow on clone
2. change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE
3. set clone->active_extensions to 0 if test is false.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
Last patch converts skb->sp, secpath information gets stored as
new SKB_EXT_SEC_PATH, so the 'sp' pointer is removed from skbuff.
Extra code added to skb clone and free paths (to deal with refcount/free
of extension area) replaces the existing code that does the same for
skb->nf_bridge and skb->secpath.
I don't see any other in-tree users that could benefit from this
infrastructure, it doesn't make sense to add an extension just for the sake
of a single flag bit (like skb->nf_trace).
Adding a new extension is a good fit if all of the following are true:
1. Data is related to the skb/packet aggregate
2. Data should be freed when the skb is free'd
3. Data is not going to be relevant/needed in normal case (udp, tcp,
forwarding workloads, ...)
4. There are no fancy action(s) needed on clone/free, such as callbacks
into kernel modules.
Florian Westphal (13):
netfilter: avoid using skb->nf_bridge directly
sk_buff: add skb extension infrastructure
net: convert bridge_nf to use skb extension infrastructure
xfrm: change secpath_set to return secpath struct, not error value
net: move secpath_exist helper to sk_buff.h
net: use skb_sec_path helper in more places
drivers: net: intel: use secpath helpers in more places
drivers: net: ethernet: mellanox: use skb_sec_path helper
drivers: net: netdevsim: use skb_sec_path helper
xfrm: use secpath_exist where applicable
drivers: chelsio: use skb_sec_path helper
xfrm: prefer secpath_set over secpath_dup
net: switch secpath to use skb extension infrastructure
Documentation/networking/xfrm_device.txt | 7
drivers/crypto/chelsio/chcr_ipsec.c | 4
drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 15
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5
drivers/net/ethernet/intel/ixgbevf/ipsec.c | 15
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2
drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c | 19
drivers/net/netdevsim/ipsec.c | 7
include/linux/netfilter_bridge.h | 33 +
include/linux/skbuff.h | 152 ++++++-
include/net/netfilter/br_netfilter.h | 14
include/net/xfrm.h | 41 --
net/Kconfig | 4
net/bridge/br_netfilter_hooks.c | 39 -
net/bridge/br_netfilter_ipv6.c | 4
net/core/skbuff.c | 201 +++++++++-
net/ipv4/esp4.c | 9
net/ipv4/esp4_offload.c | 15
net/ipv4/ip_output.c | 1
net/ipv4/netfilter/nf_reject_ipv4.c | 6
net/ipv6/esp6.c | 9
net/ipv6/esp6_offload.c | 15
net/ipv6/ip6_output.c | 1
net/ipv6/netfilter/nf_reject_ipv6.c | 10
net/ipv6/xfrm6_input.c | 8
net/netfilter/nf_log_common.c | 20
net/netfilter/nf_queue.c | 50 +-
net/netfilter/nfnetlink_queue.c | 23 -
net/netfilter/nft_meta.c | 2
net/netfilter/nft_xfrm.c | 2
net/netfilter/xt_physdev.c | 2
net/netfilter/xt_policy.c | 2
net/xfrm/Kconfig | 1
net/xfrm/xfrm_device.c | 4
net/xfrm/xfrm_input.c | 76 +--
net/xfrm/xfrm_interface.c | 2
net/xfrm/xfrm_output.c | 7
net/xfrm/xfrm_policy.c | 19
security/selinux/xfrm.c | 4
39 files changed, 564 insertions(+), 286 deletions(-)
Matthieu Baerts | R&D Engineer
Tessares SA | Hybrid Access Solutions
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
Peter and I have been working on this patch set to show how how MPTCP
can fit in to the Linux networking stack using these design ideas:
* Applications opt-in to MPTCP using IPPROTO_MPTCP, regular TCP sockets
are still the default. A socket created with
socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP) will attempt to form a
MPTCP connection. IPPROTO_MPTCP == 99 as a placeholder.
* Subflows exist within the kernel as separate sockets, owned by a
MPTCP connection-level socket that is visible to userspace.
* Adds private pointers to struct sk_buff to store MPTCP metadata.
* Adds the CONFIG_MPTCP option to Kconfig.
This now makes use of Florian's skb extensions, as posted to netdev.
The following patches can form an MPTCP connection with the
multipath-tcp.org kernel (tested with v0.94), and send DSS mappings that
are accepted for the initial data packet. It is an early implementation,
and I don't represent it as being upstreamable as-is or being everyone's
idea of what an eventual upstream implementation will necessarily look
like. It has significant limitations:
* Only one subflow is supported, no joins, and only ipv4.
* Does not support DSS checksums. Checksums must be disabled on the
remote stack (for multipath-tcp.org, 'sudo sysctl -w
* Lots of debug statements (although they use dynamic debug and are
disabled by default) and TODOs.
* It's only been tested sending small amounts of data for each send
Hopefully there are are some interesting concepts to discuss, and this
code helps us assess how workable the above design principles
are. Thanks in advance for your feedback on the benefits or drawbacks of
this code, how it might be improved, or how other approaches might
The patch set applies to Florian Westphal's skb_ext_both_03 branch
commit id b363a92ec92a7af53e254f1ac0555567fbf790d1.
I have also pushed the commits to:
v5 changes: Rebase on Florian's skb extensions. Fix !CONFIG_MPTCP build.
v4 changes: Refine skb extension (remove copy hook), change rx path to
use skb extension instead of error queue,
v3 changes: Change skb extension technique, change rx path to use error
queue, add foundational code for multiple subflows, and many bug fixes.
v2 changes: Added receive path implementation (last two patches).
Reworked TCP option writing. Miscellaneous bug fixes including
header dependency cleanup.
Mat Martineau (7):
tcp: Add MPTCP option number
tcp: Define IPPROTO_MPTCP
mptcp: Add MPTCP to skb extensions
tcp: Prevent coalesce and collapse when skb->priv is used
tcp: Export low-level TCP functions
mptcp: Write MPTCP DSS headers to outgoing data packets
mptcp: Implement MPTCP receive path
Peter Krystad (10):
mptcp: Add MPTCP socket stubs
mptcp: Handle MPTCP TCP options
tcp: Add IPPROTO_SUBFLOW
tcp: expose tcp routines and structs for MPTCP
mptcp: Create SUBFLOW socket for outgoing connections
mptcp: Create SUBFLOW socket for incoming connections
mptcp: Add key generation and token tree
mptcp: Add shutdown() socket operation
mptcp: Add setsockopt()/getsockopt() socket operations
mptcp: Make connection_list a real list of subflows
include/linux/skbuff.h | 3 +
include/linux/tcp.h | 26 ++
include/net/inet_common.h | 3 +
include/net/mptcp.h | 235 ++++++++++
include/net/tcp.h | 17 +
include/uapi/linux/in.h | 4 +
net/Kconfig | 1 +
net/Makefile | 1 +
net/core/skbuff.c | 7 +
net/ipv4/af_inet.c | 2 +-
net/ipv4/tcp.c | 12 +-
net/ipv4/tcp_input.c | 23 +-
net/ipv4/tcp_ipv4.c | 4 +-
net/ipv4/tcp_output.c | 243 ++++++++++-
net/mptcp/Kconfig | 11 +
net/mptcp/Makefile | 3 +
net/mptcp/crypto.c | 215 +++++++++
net/mptcp/options.c | 301 +++++++++++++
net/mptcp/protocol.c | 895 ++++++++++++++++++++++++++++++++++++++
net/mptcp/subflow.c | 377 ++++++++++++++++
net/mptcp/token.c | 256 +++++++++++
21 files changed, 2612 insertions(+), 27 deletions(-)
create mode 100644 include/net/mptcp.h
create mode 100644 net/mptcp/Kconfig
create mode 100644 net/mptcp/Makefile
create mode 100644 net/mptcp/crypto.c
create mode 100644 net/mptcp/options.c
create mode 100644 net/mptcp/protocol.c
create mode 100644 net/mptcp/subflow.c
create mode 100644 net/mptcp/token.c
We just had our 32nd meeting with Mat, Peter and Ossama (Intel OTC),
Christoph (Apple), Florian (Redhat) and myself (Tessares).
Thanks again for this new good meeting!
Here are the minutes of the meeting:
indirect call optimization:
- another approach by Paolo Abeni: https://lkml.org/lkml/2018/12/5/1206
- mainly for the fast path
Mat & Peter's patch-set:
- how the MP_JOIN will be done?
- see the beginning of the discussion:
- by Christoph: what might help is: giving an example on how the
userspace will act.
- → goal: only one accept() is done by the userspace for the whole
- → tricky bit: the kernel needs to listen + accept new subflows and
link that to the MPTCP connection. In the current code (mptcp_trunk), we
have many "if (LISTEN && !is_meta(sk))" → we want to avoid that.
- → also we want to correctly take into account the backlog (how
many connections per listening sk → subflows should not be taken into
- → could be nice to share some code/pseudocode. @Peter will look at
- Christoph identified a problem, a v2 is needed:
- note: disabling some rps show some bugs in MPTCP, looking at that
- other bugs seem linked to mptcp_trunk
- feel free to review (gerrit/mail, see above)
netdev patchset by Florian:
- lot of feedback from Eric, some concerns. Some make sense.
- now making some tests to show that it is fine to have that, hoping
to share that next week.
- no impact in IPSec perfs \o/
- Mat already rebased his version on top on last Florian's patch,
will share that hopefully tomorrow, could be good to discuss about that
on the ML.
- clone might be not needed. tests need to be done. But if it is
just for MPTCP it is not important to have the clone at this point.
- alignment of TCP options: currently we can have:
NOP NOP [Timestamps] (2 + 10 bytes)
NOP NOP [MPTCP] (without checksum → 2 + 18 bytes).
→ we could then avoid the NOP
But MPTCP options are generated in MPTCP stack, supposing then
that everything before is already aligned. → that would require more
changes in the TCP stack, that's why it was not done. Maybe for later if
a Generic TCP Options framework ( https://lwn.net/Articles/741859/ ) is
in place ;-)
Podcast from Monday:
- more questions about the protocol, not a lot about upstreaming
- should be released next year
- Mat and Christoph will be notified anyway.
- We propose to have the next one on Thursday, the 20th of December.
- Usual time: 17:00 UTC (9am PST, 6pm CET)
- Still open to everyone!
Feel free to comment on these points and propose new ones for the next
Talk to you next week,
Matthieu Baerts | R&D Engineer
Tessares SA | Hybrid Access Solutions
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium