From: Rao Shoaib <rao.shoaib(a)oracle.com>
Following patches modify TCP code to enable implementation of MPTCP. MPTCP implementation requires sharing of TCP code with minor modification here and there. In order to keep the TCP code clean and easy to maintain, common code has been moved to new functions for use by both TCP and MPTCP. struct tcp_sock now has function pointers and based on the socket type (TCP/MPTCP) appropriate function is called.
A basic implementation of MPTCP that works with IPv4/IPv6 and supports join has been tested based on these changes.
The changes are being submitted as an RFC to get feedback from the community and to start a discussion on how to move forward.
Rao Shoaib (9):
Modify tcp structures to support function pointers
Introduce MPTCP specific elements that will co-exist with TCP even
when MPTCP is not compiled
Introduce MPTCP specific elements that can be under #ifdef
Populate function pointers -- few (5) will be populated later
Switch code to use function pointers
Make TCP options processing abstract
Restructure syncookie code to use pointers
Restructure TCP code so that it can be shared primarily with MPTCP
Add MPTCP specific code to core TCP code
crypto/md5.c | 3 -
include/crypto/md5.h | 2 +
include/linux/tcp.h | 91 ++++++++++++
include/net/inet_common.h | 2 +
include/net/inet_sock.h | 6 +-
include/net/net_namespace.h | 6 +
include/net/secure_seq.h | 9 +-
include/net/sock.h | 1 +
include/net/tcp.h | 321 ++++++++++++++++++++++++++++++++++++++--
include/net/tcp_states.h | 4 +-
include/net/transp_v6.h | 3 -
include/uapi/linux/bpf.h | 4 +-
include/uapi/linux/if.h | 5 +
include/uapi/linux/tcp.h | 1 +
net/core/secure_seq.c | 70 +++++++++
net/ipv4/af_inet.c | 16 +-
net/ipv4/inet_connection_sock.c | 17 ++-
net/ipv4/ip_sockglue.c | 20 +++
net/ipv4/syncookies.c | 112 ++++++++++----
net/ipv4/tcp.c | 221 +++++++++++++++++++++------
net/ipv4/tcp_input.c | 195 ++++++++++++++----------
net/ipv4/tcp_ipv4.c | 112 ++++++++++----
net/ipv4/tcp_minisocks.c | 56 ++++++-
net/ipv4/tcp_output.c | 206 +++++++++++++++-----------
net/ipv4/tcp_timer.c | 55 +++++--
net/ipv6/af_inet6.c | 4 +-
net/ipv6/ipv6_sockglue.c | 14 ++
net/ipv6/syncookies.c | 40 +----
net/ipv6/tcp_ipv6.c | 163 +++++++++++++-------
29 files changed, 1360 insertions(+), 399 deletions(-)
When we met with David Miller and Eric Dumazet about a year ago to
discuss how to include MPTCP in the main Linux distribution, they wanted
to have some code that could be evaluated and used at the basis for
The patches that I have submitted are for starting that discussion only
and nothing else. If the approach is acceptable we can do more cleanup,
If not we will have a discussion and hopefully agree on something.
I would like to send the patches, perhaps a week from now to the netdev
mailing list to start a discussion there and come up with a plan to
as for next steps after the submission of the TCP-option framework to netdev
and DaveM's feedback on it.
Even if the submission got rejected, I think we still have a very useful set
of patches here. The need for a framework might pop up again in the future,
and so these patches could come in handy.
Mat, maybe you can put our latest submission on your kernel.org-git repo
just so that we don't lose track of these patches? I can also create a
github repo if you prefer that.
As for DaveM's feedback, the main takeaway - as Mat already noted on his other
mail - is that fast-path performance he the highest priority. Branching and
indirect function calls are hardly accepted there.
So, in that spirit I think we need to work towards reducing MPTCP's
intrusiveness to the TCP stack.
* Stop taking meta-lock when receiving subflow data (all the changes where
we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
The reason we do this in today's implementation is because it allows to
access the meta data-structure at any point. If we stop taking the
meta-lock a few things need to change:
1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
2. Group the more intrusive accesses to few select points in the TCP-stack
where we then take the meta-lock (e.g., when receiving data).
(this would be equivalent as if the TCP-option framework would be there
- thus we need to move code to these or similar points in the stack)
3. Sometimes schedule work-queues when we need to avoid deadlocks due to
lock-ordering issues (e.g., when we can't take the meta-lock because
it's already held by another thread).
I think, the way to approach this here, is by working iteratively and start
moving code in such a way that accesses to the meta-socket are grouped
Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
We added them to avoid duplicating the code. Let's review those and see if
we can get rid of them. (as an example: .send_fin could be removed as it is only
called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
thus if we expose a separate MPTCP socket-type with its own struct proto,
we can get rid of the .send_fin callback)
* Investigate how/if we can make MPTCP adopt KCM or ULP.
* There is still the open question of the API, path-management,... Tessares
has some experience with that, so maybe they can provide some ideas here.
* The size of the skb. Well, we have been discussing this for quite a while :)
One option is always to have a lookup table as they do for the
TLS-records. That will hurt performance, but at least it's a step forward.
And we have a bunch of other ideas that might be worth exploring as well.
If I'm not mistaken, Rao had an approach that could work as well, right?
Any other comments, suggestions,...? :-)
On my side, as a first concrete step, I will work towards lockless subflow
establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
when the socket-lookup matches on a request-socket. Now that TCP supports
lockless listeners, MPTCP should do that as well.
this weekend I was at the MPTCP Hackathon at the UCLouvain
The people from Tessares (Matt and Gregory in CC) told me there that they are
interested in participating and helping on the upstream effort.
You probably know both of them from the mptcp-dev mailing-list and some
scientific papers on MPTCP. They have a lot of experience in evolving the
Linux MPTCP implementation and I'm sure they will be of great help.
Besides code, they suggested that they could also contribute in terms of
tooling (syzbot jumps to my mind), infrastructure and other ways. I let them
chime in for their ideas and suggestions.
In any case, welcome to the list, Gregory and Matt!
I'm sending a separate e-mail to kick off a discussion on next steps.