Skip to content

Commit 9531ab6

Browse files
committed
Merge branch 'kcm'
Tom Herbert says: ==================== kcm: Kernel Connection Multiplexor (KCM) Kernel Connection Multiplexor (KCM) is a facility that provides a message based interface over TCP for generic application protocols. The motivation for this is based on the observation that although TCP is byte stream transport protocol with no concept of message boundaries, a common use case is to implement a framed application layer protocol running over TCP. To date, most TCP stacks offer byte stream API for applications, which places the burden of message delineation, message I/O operation atomicity, and load balancing in the application. With KCM an application can efficiently send and receive application protocol messages over TCP using a datagram interface. In order to delineate message in a TCP stream for receive in KCM, the kernel implements a message parser. For this we chose to employ BPF which is applied to the TCP stream. BPF code parses application layer messages and returns a message length. Nearly all binary application protocols are parsable in this manner, so KCM should be applicable across a wide range of applications. Other than message length determination in receive, KCM does not require any other application specific awareness. KCM does not implement any other application protocol semantics-- these are are provided in userspace or could be implemented in a kernel module layered above KCM. KCM implements an NxM multiplexor in the kernel as diagrammed below: +------------+ +------------+ +------------+ +------------+ | KCM socket | | KCM socket | | KCM socket | | KCM socket | +------------+ +------------+ +------------+ +------------+ | | | | +-----------+ | | +----------+ | | | | +----------------------------------+ | Multiplexor | +----------------------------------+ | | | | | +---------+ | | | ------------+ | | | | | +----------+ +----------+ +----------+ +----------+ +----------+ | Psock | | Psock | | Psock | | Psock | | Psock | +----------+ +----------+ +----------+ +----------+ +----------+ | | | | | +----------+ +----------+ +----------+ +----------+ +----------+ | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | +----------+ +----------+ +----------+ +----------+ +----------+ The KCM sockets provide the datagram interface to applications, Psocks are the state for each attached TCP connection (i.e. where message delineation is performed on receive). A description of the APIs and design can be found in the included Documentation/networking/kcm.txt. In this patch set: - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to indicate that more messages will be sent on the socket. The stack may batch messages up if it is beneficial for transmission. - In sendmmsg, set MSG_BATCH in all sub messages except for the last one. - In order to allow sendmmsg to contain multiple messages with SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR. - Add KCM module - This supports SOCK_DGRAM and SOCK_SEQPACKET. - KCM documentation v2: - Added splice and page operations. - Assemble receive messages in place on TCP socket (don't have a separate assembly queue. - Based on above, enforce maxmimum receive message to be the size of the recceive socket buffer. - Support message assembly timeout. Use the timeout value in sk_rcvtimeo on the TCP socket. - Tested some with a couple of other production applications, see ~5% improvement in application latency. Testing: Dave Watson has integrated KCM into Thrift and we intend to put these changes into open source. Example of this is in: https://github.com/djwatson/fbthrift/commit/ dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc Some initial KCM Thrift benchmark numbers (comment from Dave) Thrift by default ties a single connection to a single thread. KCM is instead able to load balance multiple connections across multiple epoll loops easily. A test sending ~5k bytes of data to a kcm thrift server, dropping the bytes on recv: QPS Latency / std dev Latency without KCM 70336 209/123 with KCM 70353 191/124 A test sending a small request, then doing work in the epoll thread, before serving more requests: QPS Latency / std dev Latency without KCM 14282 559/602 with KCM 23192 344/234 At the high end, there's definitely some additional kernel overhead: Cranking the pipelining way up, with lots of small requests QPS Latency / std dev Latency without KCM 1863429 127/119 with KCM 1337713 192/241 --- So for a "realistic" workload, KCM performs pretty well (second case). Under extreme conditions of highest tps we still have some work to do. In its nature a multiplexor will spread work between CPUs which is logically good for load balancing but coan conflict with the goal promoting affinity. Batching messages on both send and receive are the means to recoup performance. Future support: - Integration with TLS (TLS-in-kernel is a separate initiative). - Page operations/splice support - Unconnected KCM sockets. Will be able to attach sockets to different destinations, AF_KCM addresses with be used in sendmsg and recvmsg to indicate destination - Explore more utility in performing BPF inline with a TCP data stream (setting SO_MARK, rxhash for messages being sent received on KCM sockets). - Performance work - Diagnose performance issues under high message load FAQ (Questions posted on LWN) Q: Why do this in the kernel? A: Because the kernel is good at scheduling threads and steering packets to threads. KCM fits well into this model since it allows the unit of work for scheduling and steering to be the application layer messages themselves. KCM should be thought of as generic application protocol acceleration. It to the philosophy that the kernel provides generic and extensible interfaces. Q: How can adding code in the path yield better performance? A: It is true that for just sending receiving a single message there would be some performance loss since the code path is longer (for instance comparing netperf to KCM). But for real production applications performance takes on many dynamics. Parallelism, context switching, affinity, granularity of locking, and load balancing are all relevant. The theory of KCM is that by an application-centric interface, the kernel can provide better support for these performance characteristics. Q: Why not use an existing message-oriented protocol such as RUDP, DCCP, SCTP, RDS, and others? A: Because that would entail using a completely new transport protocol. Deploying a new protocol at scale is either a huge undertaking or fundamentally infeasible. This is true in either the Internet and in the data center due in a large part to protocol ossification. Besides, KCM we want KCM to work existing, well deployed application protocols that we couldn't change even if we wanted to (e.g. http/2). KCM simply defines a new interface method, it does not redefine any aspect of the transport protocol nor application protocol, nor set any new requirements on these. Neither does KCM attempt to implement any application protocol logic other than message deliniation in the stream. These are fundamental requirement of KCM. Q: How does this affect TCP? A: It doesn't, not in the slightest. The use of KCM can be one-sided, KCM has no effect on the wire. Q: Why force TCP into doing something it's not designed for? A: TCP is defined as transport protocol and there is no standard that says the API into TCP must be stream based sockets, or for that matter sockets at all (or even that TCP needs to be implemented in a kernel). KCM is not inconsistent with the design of TCP just because to makes an message based interface over TCP, if it were then every application protocol sending messages over TCP would also be! :-) Q: What about the problem of a connections with very slow rate of incoming data? As a result your application can get storms of very short reads. And it actually happens a lot with connection from mobile devices and it is a problem for servers handling a lot of connections. A: The storm of short reads will occur regardless of whether KCM is used or not. KCM does have one advantage in this scenario though, it will only wake up the application when a full message has been received, not for each packet that makes up part of a bigger messages. If a bunch of small messages are received, the application can receive messages in batches using recvmmsg. Q: Why not just use DPDK, or at least provide KCM like functionality in DPDK? A: DPDK, or more generally OS bypass presumably with a TCP stack in userland, presents a different model of load balancing than that of KCM (and the kernel). KCM implements load balancing of messages across the threads of an application, whereas DPDK load balances based on queues which are more static and coarse-grained since multiple connections are bound to queues. DPDK works best when processing of packets is silo'ed in a thread on the CPU processing a queue, and packet processing (for both the stack and application) is fairly uniform. KCM works well for applications where the amount of work to process messages varies an application work is commonly delegated to worker threads often on different CPUs. The message based interface over TCP is something that could be provide by a DPDK or OS bypass library. Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web sockets? A: Yes. KCM is most appropriate for message based protocols over TCP where is easy to deduce the message length (e.g. a length field) and the protocol implements its own message ordering semantics. Fortunately this encompasses many modern protocols. Q: How is memory limited and controlled? A: In v2 all data for messages is now kept in socket buffers, either those for TCP or KCM, so socket buffer limits are applicable. This includes receive messages assembly which is now done ont teh TCP socket buffer instead of a separate queue-- this has the consequence that the TCP socket buffer limit provides an enforceable maxmimum message size. Additionally, a timeout may be set for messages assembly. The value used for this is taken from sk_rcvtimeo of the TCP socket. ==================== Signed-off-by: David S. Miller <[email protected]>
2 parents 26e9093 + 1001659 commit 9531ab6

File tree

16 files changed

+3483
-43
lines changed

16 files changed

+3483
-43
lines changed

Documentation/networking/kcm.txt

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
Kernel Connection Mulitplexor
2+
-----------------------------
3+
4+
Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
5+
interface over TCP for generic application protocols. With KCM an application
6+
can efficiently send and receive application protocol messages over TCP using
7+
datagram sockets.
8+
9+
KCM implements an NxM multiplexor in the kernel as diagrammed below:
10+
11+
+------------+ +------------+ +------------+ +------------+
12+
| KCM socket | | KCM socket | | KCM socket | | KCM socket |
13+
+------------+ +------------+ +------------+ +------------+
14+
| | | |
15+
+-----------+ | | +----------+
16+
| | | |
17+
+----------------------------------+
18+
| Multiplexor |
19+
+----------------------------------+
20+
| | | | |
21+
+---------+ | | | ------------+
22+
| | | | |
23+
+----------+ +----------+ +----------+ +----------+ +----------+
24+
| Psock | | Psock | | Psock | | Psock | | Psock |
25+
+----------+ +----------+ +----------+ +----------+ +----------+
26+
| | | | |
27+
+----------+ +----------+ +----------+ +----------+ +----------+
28+
| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
29+
+----------+ +----------+ +----------+ +----------+ +----------+
30+
31+
KCM sockets
32+
-----------
33+
34+
The KCM sockets provide the user interface to the muliplexor. All the KCM sockets
35+
bound to a multiplexor are considered to have equivalent function, and I/O
36+
operations in different sockets may be done in parallel without the need for
37+
synchronization between threads in userspace.
38+
39+
Multiplexor
40+
-----------
41+
42+
The multiplexor provides the message steering. In the transmit path, messages
43+
written on a KCM socket are sent atomically on an appropriate TCP socket.
44+
Similarly, in the receive path, messages are constructed on each TCP socket
45+
(Psock) and complete messages are steered to a KCM socket.
46+
47+
TCP sockets & Psocks
48+
--------------------
49+
50+
TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
51+
for each bound TCP socket, this structure holds the state for constructing
52+
messages on receive as well as other connection specific information for KCM.
53+
54+
Connected mode semantics
55+
------------------------
56+
57+
Each multiplexor assumes that all attached TCP connections are to the same
58+
destination and can use the different connections for load balancing when
59+
transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
60+
can be used to send and receive messages from the KCM socket.
61+
62+
Socket types
63+
------------
64+
65+
KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
66+
67+
Message delineation
68+
-------------------
69+
70+
Messages are sent over a TCP stream with some application protocol message
71+
format that typically includes a header which frames the messages. The length
72+
of a received message can be deduced from the application protocol header
73+
(often just a simple length field).
74+
75+
A TCP stream must be parsed to determine message boundaries. Berkeley Packet
76+
Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
77+
BPF program must be specified. The program is called at the start of receiving
78+
a new message and is given an skbuff that contains the bytes received so far.
79+
It parses the message header and returns the length of the message. Given this
80+
information, KCM will construct the message of the stated length and deliver it
81+
to a KCM socket.
82+
83+
TCP socket management
84+
---------------------
85+
86+
When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
87+
write space available (POLLOUT) events are handled by the multiplexor. If there
88+
is a state change (disconnection) or other error on a TCP socket, an error is
89+
posted on the TCP socket so that a POLLERR event happens and KCM discontinues
90+
using the socket. When the application gets the error notification for a
91+
TCP socket, it should unattach the socket from KCM and then handle the error
92+
condition (the typical response is to close the socket and create a new
93+
connection if necessary).
94+
95+
KCM limits the maximum receive message size to be the size of the receive
96+
socket buffer on the attached TCP socket (the socket buffer size can be set by
97+
SO_RCVBUF). If the length of a new message reported by the BPF program is
98+
greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
99+
socket. The BPF program may also enforce a maximum messages size and report an
100+
error when it is exceeded.
101+
102+
A timeout may be set for assembling messages on a receive socket. The timeout
103+
value is taken from the receive timeout of the attached TCP socket (this is set
104+
by SO_RCVTIMEO). If the timer expires before assembly is complete an error
105+
(ETIMEDOUT) is posted on the socket.
106+
107+
User interface
108+
==============
109+
110+
Creating a multiplexor
111+
----------------------
112+
113+
A new multiplexor and initial KCM socket is created by a socket call:
114+
115+
socket(AF_KCM, type, protocol)
116+
117+
- type is either SOCK_DGRAM or SOCK_SEQPACKET
118+
- protocol is KCMPROTO_CONNECTED
119+
120+
Cloning KCM sockets
121+
-------------------
122+
123+
After the first KCM socket is created using the socket call as described
124+
above, additional sockets for the multiplexor can be created by cloning
125+
a KCM socket. This is accomplished by an ioctl on a KCM socket:
126+
127+
/* From linux/kcm.h */
128+
struct kcm_clone {
129+
int fd;
130+
};
131+
132+
struct kcm_clone info;
133+
134+
memset(&info, 0, sizeof(info));
135+
136+
err = ioctl(kcmfd, SIOCKCMCLONE, &info);
137+
138+
if (!err)
139+
newkcmfd = info.fd;
140+
141+
Attach transport sockets
142+
------------------------
143+
144+
Attaching of transport sockets to a multiplexor is performed by calling an
145+
ioctl on a KCM socket for the multiplexor. e.g.:
146+
147+
/* From linux/kcm.h */
148+
struct kcm_attach {
149+
int fd;
150+
int bpf_fd;
151+
};
152+
153+
struct kcm_attach info;
154+
155+
memset(&info, 0, sizeof(info));
156+
157+
info.fd = tcpfd;
158+
info.bpf_fd = bpf_prog_fd;
159+
160+
ioctl(kcmfd, SIOCKCMATTACH, &info);
161+
162+
The kcm_attach structure contains:
163+
fd: file descriptor for TCP socket being attached
164+
bpf_prog_fd: file descriptor for compiled BPF program downloaded
165+
166+
Unattach transport sockets
167+
--------------------------
168+
169+
Unattaching a transport socket from a multiplexor is straightforward. An
170+
"unattach" ioctl is done with the kcm_unattach structure as the argument:
171+
172+
/* From linux/kcm.h */
173+
struct kcm_unattach {
174+
int fd;
175+
};
176+
177+
struct kcm_unattach info;
178+
179+
memset(&info, 0, sizeof(info));
180+
181+
info.fd = cfd;
182+
183+
ioctl(fd, SIOCKCMUNATTACH, &info);
184+
185+
Disabling receive on KCM socket
186+
-------------------------------
187+
188+
A setsockopt is used to disable or enable receiving on a KCM socket.
189+
When receive is disabled, any pending messages in the socket's
190+
receive buffer are moved to other sockets. This feature is useful
191+
if an application thread knows that it will be doing a lot of
192+
work on a request and won't be able to service new messages for a
193+
while. Example use:
194+
195+
int val = 1;
196+
197+
setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
198+
199+
BFP programs for message delineation
200+
------------------------------------
201+
202+
BPF programs can be compiled using the BPF LLVM backend. For exmple,
203+
the BPF program for parsing Thrift is:
204+
205+
#include "bpf.h" /* for __sk_buff */
206+
#include "bpf_helpers.h" /* for load_word intrinsic */
207+
208+
SEC("socket_kcm")
209+
int bpf_prog1(struct __sk_buff *skb)
210+
{
211+
return load_word(skb, 0) + 4;
212+
}
213+
214+
char _license[] SEC("license") = "GPL";
215+
216+
Use in applications
217+
===================
218+
219+
KCM accelerates application layer protocols. Specifically, it allows
220+
applications to use a message based interface for sending and receiving
221+
messages. The kernel provides necessary assurances that messages are sent
222+
and received atomically. This relieves much of the burden applications have
223+
in mapping a message based protocol onto the TCP stream. KCM also make
224+
application layer messages a unit of work in the kernel for the purposes of
225+
steerng and scheduling, which in turn allows a simpler networking model in
226+
multithreaded applications.
227+
228+
Configurations
229+
--------------
230+
231+
In an Nx1 configuration, KCM logically provides multiple socket handles
232+
to the same TCP connection. This allows parallelism between in I/O
233+
operations on the TCP socket (for instance copyin and copyout of data is
234+
parallelized). In an application, a KCM socket can be opened for each
235+
processing thread and inserted into the epoll (similar to how SO_REUSEPORT
236+
is used to allow multiple listener sockets on the same port).
237+
238+
In a MxN configuration, multiple connections are established to the
239+
same destination. These are used for simple load balancing.
240+
241+
Message batching
242+
----------------
243+
244+
The primary purpose of KCM is load balancing between KCM sockets and hence
245+
threads in a nominal use case. Perfect load balancing, that is steering
246+
each received message to a different KCM socket or steering each sent
247+
message to a different TCP socket, can negatively impact performance
248+
since this doesn't allow for affinities to be established. Balancing
249+
based on groups, or batches of messages, can be beneficial for performance.
250+
251+
On transmit, there are three ways an application can batch (pipeline)
252+
messages on a KCM socket.
253+
1) Send multiple messages in a single sendmmsg.
254+
2) Send a group of messages each with a sendmsg call, where all messages
255+
except the last have MSG_BATCH in the flags of sendmsg call.
256+
3) Create "super message" composed of multiple messages and send this
257+
with a single sendmsg.
258+
259+
On receive, the KCM module attempts to queue messages received on the
260+
same KCM socket during each TCP ready callback. The targeted KCM socket
261+
changes at each receive ready callback on the KCM socket. The application
262+
does not need to configure this.
263+
264+
Error handling
265+
--------------
266+
267+
An application should include a thread to monitor errors raised on
268+
the TCP connection. Normally, this will be done by placing each
269+
TCP socket attached to a KCM multiplexor in epoll set for POLLERR
270+
event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
271+
on the socket thus waking up the application thread. When the application
272+
sees the error (which may just be a disconnect) it should unattach the
273+
socket from KCM and then close it. It is assumed that once an error is
274+
posted on the TCP socket the data stream is unrecoverable (i.e. an error
275+
may have occurred in in the middle of receiving a messssge).
276+
277+
TCP connection monitoring
278+
-------------------------
279+
280+
In KCM there is no means to correlate a message to the TCP socket that
281+
was used to send or receive the message (except in the case there is
282+
only one attached TCP socket). However, the application does retain
283+
an open file descriptor to the socket so it will be able to get statistics
284+
from the socket which can be used in detecting issues (such as high
285+
retransmissions on the socket).

include/linux/net.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,7 @@ int __sock_create(struct net *net, int family, int type, int proto,
215215
int sock_create(int family, int type, int proto, struct socket **res);
216216
int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res);
217217
int sock_create_lite(int family, int type, int proto, struct socket **res);
218+
struct socket *sock_alloc(void);
218219
void sock_release(struct socket *sock);
219220
int sock_sendmsg(struct socket *sock, struct msghdr *msg);
220221
int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,

include/linux/rculist.h

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,27 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
318318
likely(__ptr != __next) ? list_entry_rcu(__next, type, member) : NULL; \
319319
})
320320

321+
/**
322+
* list_next_or_null_rcu - get the first element from a list
323+
* @head: the head for the list.
324+
* @ptr: the list head to take the next element from.
325+
* @type: the type of the struct this is embedded in.
326+
* @member: the name of the list_head within the struct.
327+
*
328+
* Note that if the ptr is at the end of the list, NULL is returned.
329+
*
330+
* This primitive may safely run concurrently with the _rcu list-mutation
331+
* primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
332+
*/
333+
#define list_next_or_null_rcu(head, ptr, type, member) \
334+
({ \
335+
struct list_head *__head = (head); \
336+
struct list_head *__ptr = (ptr); \
337+
struct list_head *__next = READ_ONCE(__ptr->next); \
338+
likely(__next != __head) ? list_entry_rcu(__next, type, \
339+
member) : NULL; \
340+
})
341+
321342
/**
322343
* list_for_each_entry_rcu - iterate over rcu list of given type
323344
* @pos: the type * to use as a loop cursor.

include/linux/socket.h

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,9 @@ struct ucred {
200200
#define AF_ALG 38 /* Algorithm sockets */
201201
#define AF_NFC 39 /* NFC sockets */
202202
#define AF_VSOCK 40 /* vSockets */
203-
#define AF_MAX 41 /* For now.. */
203+
#define AF_KCM 41 /* Kernel Connection Multiplexor*/
204+
205+
#define AF_MAX 42 /* For now.. */
204206

205207
/* Protocol families, same as address families. */
206208
#define PF_UNSPEC AF_UNSPEC
@@ -246,6 +248,7 @@ struct ucred {
246248
#define PF_ALG AF_ALG
247249
#define PF_NFC AF_NFC
248250
#define PF_VSOCK AF_VSOCK
251+
#define PF_KCM AF_KCM
249252
#define PF_MAX AF_MAX
250253

251254
/* Maximum queue length specifiable by listen. */
@@ -274,6 +277,7 @@ struct ucred {
274277
#define MSG_MORE 0x8000 /* Sender will send more */
275278
#define MSG_WAITFORONE 0x10000 /* recvmmsg(): block until 1+ packets avail */
276279
#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
280+
#define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */
277281
#define MSG_EOF MSG_FIN
278282

279283
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
@@ -322,6 +326,7 @@ struct ucred {
322326
#define SOL_CAIF 278
323327
#define SOL_ALG 279
324328
#define SOL_NFC 280
329+
#define SOL_KCM 281
325330

326331
/* IPX options */
327332
#define IPX_TYPE 1

0 commit comments

Comments
 (0)