Flow 1.0.1
Flow project: Full implementation reference.
low_lvl_io.hpp
Go to the documentation of this file.
1/* Flow
2 * Copyright 2023 Akamai Technologies, Inc.
3 *
4 * Licensed under the Apache License, Version 2.0 (the
5 * "License"); you may not use this file except in
6 * compliance with the License. You may obtain a copy
7 * of the License at
8 *
9 * https://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in
12 * writing, software distributed under the License is
13 * distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
14 * CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing
16 * permissions and limitations under the License. */
17
18/// @file
19#pragma once
20
24
25namespace flow::net_flow
26{
27
28/**
29 * The current outgoing packet pacing state, including queue of low-level packets to be sent, for a
30 * given Peer_socket.
31 *
32 * This structure is a data store and not an object. It is a separate `struct`, instead of directly
33 * inside Peer_socket, to make the complex Peer_socket structure's definition shorter and easier to
34 * understand. Accordingly the Node code dealing with pacing (and working on this structure and
35 * related data) is in a separate file (low_lvl_io.cpp instead of peer_socket.cpp or
36 * node.cpp).
37 *
38 * ### Background on packet pacing ###
39 * Empirically we've seen that sending out many UDP datagrams at the same
40 * time may cause extra -- sometimes quite major -- loss somewhere in the network. To combat that,
41 * we've seen simple packet pacing be effective (and I hear Linux kernel does it too). The idea is
42 * to keep datagrams being sent at below a certain dynamically adjusting rate. More specifically,
43 * if we want to (and, subject to the send window, are allowed to) send N equally sized available
44 * packets over T msec, it may prevent loss to send them equally spaced T/N msec apart, as
45 * opposed to sending a burst of N packets immediately. In other words the total send rate over
46 * time period T is the same in either case, but in the latter case this is accomplished smoothly
47 * instead of burstily. This is pacing: minimize burstiness without lowering overall throughput.
48 * The pending packets that cannot be sent immediately are added to a queue to be sent later.
49 *
50 * Suppose we keep the pacing system as above; begin a pacing time slice of length T, and over that time
51 * period send packets spaced at T/N msec apart; once T is over, recompute T and N and repeat. Let T be
52 * decided. We shouldn't use N = [number of packets pending to send], since at any moment many more can come
53 * in and get queued (causing backup of the queue). The natural solution is to use T = round trip time (SRTT)
54 * and N = congestion window (CWND), since those by definition set the maximum rate we will allow packets to
55 * be sent out, because the network cannot support more. Thus, the interval between packets is S = SRTT /
56 * CWND.
57 *
58 * To make this system more responsive, S = SRTT / CWND can be recomputed after queued packet is
59 * sent (instead of recomputing it every SRTT as mentioned above; as SRTT and CWND are continuously
60 * computed as acknowledgments come in, much more frequently than every SRTT).
61 *
62 * More precisely, whenever we send a packet, start a timer for S = SRTT / CWND msec. While the
63 * timer is running any new packet available for sending should be added to the packet queue Q and
64 * not sent. Once tomer fires, send packet at top of Q, if any (and start the timer again,
65 * since packet was sent). If Q is empty, and the timer is not running, and a new packet is
66 * available for sending, it should be sent immediately. Finally, any packet available for sending
67 * until SRTT is known (i.e., before the first acknowledgment comes in) should be sent immediately
68 * (all packets after that are paced per above algorithm).
69 *
70 * There is a technical limitation that complicates the above. The timer period S can easily be
71 * quite low, e.g., S = 100 msec / 100 = 1 msec. In the current implementation (boost.asio) there
72 * is no timer mechanism with that precision (platform-dependent, but see comment on util::Timer alias)
73 * (we can *measure* time with much better precision, but we cannot *schedule* actions). Therefore we
74 * cannot always schedule timer for S. Suppose the timer has a minimum scheduling precision of < R.
75 * Since we mathematically cannot avoid burstiness if S = SRTT / CWND < R, we must minimize it;
76 * namely, in that case, set S = R and, instead of 1 packet max during that period,
77 * allow N = R / (SRTT / CWND) packets during that period. Thus the simple "queue packet each time
78 * it is available, if timer is running" policy changes to the slightly more complex "start counter
79 * K = N when timer starts; send packet and decrement K each time packet is available, until K is
80 * down to zero, then queue packet." Note that the latter reduces to the former when S >= R (i.e.,
81 * timer precision is not a problem).
82 *
83 * Now to formally define the invariants guiding the algorithm: let [T, T + S) be a time slice,
84 * where T is the time point at which the first packet is sent once an SRTT is known. S =
85 * max(floor(SRTT / CWND), R), where SRTT and CWND (which is in units of max-block-sizes) are
86 * determined at time T. Each packet should be sent out as soon as possible after it becomes
87 * available for sending, subject to the following constraint: at most N = S / (SRTT / CWND)
88 * full-sized blocks' worth of packets may be sent during the time slice [T, T + S); any packets
89 * beyond that should be queued on packet queue Q (and sent based on the policy in his sentence, in
90 * order of Q). After the time slice is finished, the next time slice [T, T + S) is re-defined
91 * at the time T the first packet available for sending appears (at which time S and N are
92 * recomputed), and the above policy applies. This continues indefinitely.
93 *
94 * ### Packets other than DATA ###
95 * There is one final consideration. The above discussion implicitly
96 * treats all packets as DATA packets. Since SYN/SYN_ACK/SYN_ACK_ACK packets all occur before the
97 * first SRTT measurement is available (but see below to-do), they are not relevant. However, there
98 * are also ACK and RST packets. First, let's deal with RST; an RST is by definition the very last
99 * packet we can send on a connection. We have no desire for it to be dropped any more than a DATA
100 * packet; so it can be subject to pacing just like DATA and placed onto Q, etc. Since it's the
101 * last packet to be sent, there is nothing more to discuss. However, that would mean the sending
102 * of RST may occur after socket is in state CLOSED. While we could make that work, it introduces
103 * unnecessary complication; we prefer to consider the Peer_socket quite dead once it is in CLOSED.
104 * Therefore I choose to not pace RSTs: send it immediately. This MAY put the RST ahead of other
105 * queued packets (ACK, DATA), but since RST indicates a sudden connection break, this is more or less
106 * acceptable (and those other packets will not be sent, as we stop sending paced packets once in
107 * CLOSED). It may also increase the chance the RST is dropped, but I highly doubt it's significant
108 * in reality.
109 *
110 * What about ACKs? Probably we must place them onto Q, since sending them earlier would at least
111 * mean changing the original order of Node's intended outgoing packet stream. However, how many
112 * bytes, in comparison to (N * max-block-size), is an ACK worth? We could do this a couple of
113 * ways. It could be worth max-block-size, like a typical DATA packet; this is the conservative
114 * choice. However, it's probably not too realistic, since an ACK is much smaller than a full DATA.
115 * It could be worth 0 bytes (i.e., send any ACK as soon as it's at the head of Q, regardless of
116 * time slice), which is closer to the truth but not true (the other way). Or it could be worth ~K
117 * bytes, where K is the approximate size of its payload (not counting the common header data). That
118 * is most realistic, but would throw off a bit the appealing exact math when all DATA packets are
119 * full-sized. What to choose? The answer is probably not too important, because in most
120 * applications there will be no significant mixing of outgoing DATA and ACK packets (since either
121 * one side or the other side is usually sending, not both simultaneously). I choose to go with ACK
122 * = 0 bytes, since it is fairly close to realistic yet also simple to implement.
123 *
124 * ### Thread safety ###
125 * Meant to be accessed from thread W only.
126 *
127 * @todo One way to provide finer pacing, in the situation where we must send more than 1 packet
128 * within the minimal schedulable timer period, is to rely on events other than timer clock ticks
129 * that occur more frequently. Namely, ACKs will possibly arrive faster that the minimal timer
130 * schedulable ticks do. Therefore, when a time slice begins, we can send no packets (instead of
131 * all the packets that must be sent in that time slice, as is the case currently). Then over the
132 * course of the time slice we can opportunistically -- upon each ACK send the proper number of
133 * queued packets up to that point into the time slice (this can be computed accurately, because we
134 * can MEASURE time very accurately -- just can't schedule things as accurately). Finally, once the
135 * time slice does expire, send off any packets still remaining for that slice. This way we can
136 * opportunistically provide finer pacing, where the ACK clock allows it.
137 *
138 * @todo Obtain the first RTT measurement based on SYN-SYN_ACK or SYN_ACK-SYN_ACK_ACK interval; then
139 * SRTT will be available from the start, and packet pacing can be enabled with the very first DATA
140 * packet, avoiding the initial (albeit small) burst of DATA packets. This is pretty easy.
141 *
142 * @todo As noted above we perform packet pacing but currently choose to assign a value of
143 * 0 bytes to an ACK. That is, while we do preserve the order of DATA and ACK packets -- if
144 * both happen to be in the outgoing stream -- we do not delay the sending of the ACK once it is
145 * the next packet to be sent out. However, even so, an ACK's sending may be delayed by the
146 * pacing applied to DATA packets intermixed with it. Therefore the ACK delay measurement we
147 * take here may be incorrect (too low) in that case. This can cause overestimated RTTs on the
148 * sender's side. The to-do is to correct the ACK delay value in a given ACK by adding the
149 * pacing delay (if any) of the ACK to the individual ACK delays within it. Conceptually this
150 * is similar to the `m_sent_when` value being set when choosing to send a DATA packet and then
151 * corrected in the pacing module later. This to-do is not important until we in practice start
152 * mixing sending and receiving at the application layer... but still -- it's worth knowing that
153 * there is a design bug here.
154 *
155 * @todo My quick look into Google BBR, the recent advanced congestion control algorithm, suggests
156 * that the pacing algorithm can be used as part of congestion control rather than something merely
157 * applied after congestion control has made the CWND decision. I believe the idea would be this:
158 * If we can somehow very reliably determine the available bandwidth, which Google BBR does purport to do,
159 * then instead of controlling send rate via CWND, instead one could apply it to this pacing module.
160 * It is a slight modification of the algorithm described above: instead of the rate of sending being
161 * equal to CWND/SRTT, make it equal to the bandwidth determined through Google BBR. If I understand
162 * correctly, Google BBR does maintain a CWND as well, but this is a safety upper bound, in case the
163 * computed bandwidth is too high to be practical (in this case CWND will slow down data being delivered
164 * to the pacing module -- but, normally, this would not be the case, and the pacing module would be,
165 * in practice, the thing controlling the sending rate). Note this is a pretty big conceptual change;
166 * in particular, it would make the queue Q grow much larger than what we see currently.
167 *
168 * ### `Timer` vs util::schedule_task_from_now() ###
169 * In other places we have tended to replace a `Timer` with the far simpler util::schedule_task_from_now() (etc.)
170 * facility (which internally uses a `Timer` but hides its various annoyances and caveats). Why not here?
171 * Answer: These tasks are potentially scheduled *very* frequently; and the firing periods can push the limits of how
172 * small they can effectively be. Therefore, the perf caveats (as listed in util::schedule_task_from_now() doc header)
173 * apply in a big way -- moreover, exactly when the effects would be felt the most (namely at high throughputs -- which
174 * in turn is likeliest to peg process use). We should definitely not risk that in this case.
175 */
177 private boost::noncopyable
178{
179 // Types.
180
181 /// Short-hand for FIFO of low-level packets. `queue<>` dumbly has no `clear()`, so just use `deque<>` directly.
182 using Packet_q = std::deque<Low_lvl_packet::Const_ptr>;
183
184 // Data.
185
186 /**
187 * The time point at which the last pacing time slice began; or epoch if no packets sent so far
188 * (i.e., no time slices yet => undefined). The time period
189 * [#m_slice_start, #m_slice_start + #m_slice_period) defines the range of time points in the time
190 * slice.
191 */
193
194 /**
195 * The length of the current pacing time slice period; this depends on congestion window and SRTT
196 * on the containing socket, at the time point #m_slice_start. Undefined (`zero()`) if #m_slice_start
197 * is undefined.
198 */
200
201 /**
202 * This many bytes worth of DATA packets may still be sent, at this time, within the
203 * time slice defined by #m_slice_start and #m_slice_period; if this is less than the data size of
204 * `m_packet_q.front()`, then no more can be sent, and any packets that need to be sent must be sent
205 * in the next time slice (i.e., at time point `>= m_slice_start + m_slice_period`). Undefined
206 * (zero) if #m_slice_start is undefined.
207 *
208 * Why track it in bytes instead of multiples of max-block-size (as in the `struct` doc header)?
209 * Since the DATA packets in the queue may be of varying sizes, this allows us to more precisely
210 * (proportionally) count them against the #m_bytes_allowed_this_slice budget.
211 */
213
214 /**
215 * Queue of low-level packets to be sent to the remote endpoint, in order in which they are to be
216 * sent, with the head element the one to be sent next. Each new pending packet is to be added at
217 * the tail. Invariant: `m_packet_q.front()`, if it exists, is of type DATA after any
218 * Node::sock_pacing_new_packet_ready() or Node::sock_pacing_process_q() call returns. (This is consistent
219 * with the aforementioned "do not re-order non-DATA packets, but do not pace them" philosophy.)
220 */
222
223 /**
224 * When running, #m_packet_q is non-empty, #m_bytes_allowed_this_slice < data size of
225 * `m_packet_q.front()`, and the timer will fire when this time slice ends, and thus the head packet
226 * in #m_packet_q must be sent. In all other situations the timer is inactive. (Note that if the
227 * timer is not running, that does not mean any packet becoming available can be sent immediately;
228 * if the queue is empty and `m_bytes_allowed_this_slice == N`, the timer is off, but if a packet
229 * comes in with data size > `N` before `m_slice_start + m_slice_period` [within the current time
230 * slice], then #m_slice_timer must be set to fire at time `m_slice_start + m_slice_period`.)
231 */
233
234 // Constructors/destructor.
235
236 /**
237 * Initializes data to initial state (no active time slice).
238 *
239 * @param task_engine
240 * IO service for the timer stored as data member.
241 */
242 explicit Send_pacing_data(util::Task_engine* task_engine);
243}; // struct Send_pacing_data
244
245} // namespace flow::net_flow
Flow module containing the API and implementation of the Flow network protocol, a TCP-inspired stream...
Definition: node.cpp:25
boost::asio::io_service Task_engine
Short-hand for boost.asio event service, the central class of boost.asio.
Definition: util_fwd.hpp:135
boost::asio::basic_waitable_timer< Fine_clock > Timer
boost.asio timer.
Definition: util_fwd.hpp:202
Fine_clock::duration Fine_duration
A high-res time duration as computed from two Fine_time_pts.
Definition: common.hpp:410
Fine_clock::time_point Fine_time_pt
A high-res time point as returned by Fine_clock::now() and suitable for precise time math in general.
Definition: common.hpp:407
The current outgoing packet pacing state, including queue of low-level packets to be sent,...
Definition: low_lvl_io.hpp:178
util::Timer m_slice_timer
When running, m_packet_q is non-empty, m_bytes_allowed_this_slice < data size of m_packet_q....
Definition: low_lvl_io.hpp:232
size_t m_bytes_allowed_this_slice
This many bytes worth of DATA packets may still be sent, at this time, within the time slice defined ...
Definition: low_lvl_io.hpp:212
Send_pacing_data(util::Task_engine *task_engine)
Initializes data to initial state (no active time slice).
std::deque< Low_lvl_packet::Const_ptr > Packet_q
Short-hand for FIFO of low-level packets. queue<> dumbly has no clear(), so just use deque<> directly...
Definition: low_lvl_io.hpp:182
Packet_q m_packet_q
Queue of low-level packets to be sent to the remote endpoint, in order in which they are to be sent,...
Definition: low_lvl_io.hpp:221
Fine_duration m_slice_period
The length of the current pacing time slice period; this depends on congestion window and SRTT on the...
Definition: low_lvl_io.hpp:199
Fine_time_pt m_slice_start
The time point at which the last pacing time slice began; or epoch if no packets sent so far (i....
Definition: low_lvl_io.hpp:192