Flow 1.0.1
Flow project: Full implementation reference.
cong_ctl_util.cpp
Go to the documentation of this file.
1/* Flow
2 * Copyright 2023 Akamai Technologies, Inc.
3 *
4 * Licensed under the Apache License, Version 2.0 (the
5 * "License"); you may not use this file except in
6 * compliance with the License. You may obtain a copy
7 * of the License at
8 *
9 * https://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in
12 * writing, software distributed under the License is
13 * distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
14 * CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing
16 * permissions and limitations under the License. */
17
18/// @file
20#include <limits>
21
22namespace flow::net_flow
23{
24// Static initializations.
25
26/* By setting this to infinity, we disable any limit on per-ack-group CWND incrementing.
27 * For reasons why see on_acks(). */
29 = std::numeric_limits<size_t>::max();
30
31// Implementations.
32
35 log::Log_context(logger_ptr, Flow_log_component::S_NET_FLOW_CC),
36 m_sock(sock),
37 // Initialize CWND in a standard way, or use the option's overriding value if provided.
38 m_cong_wnd_bytes(sock->max_block_size() * // Initial CWND a multiple of max-block-size always.
39 ((sock->opt(sock->m_opts.m_st_cong_ctl_init_cong_wnd_blocks) == 0)
40 ? ((sock->max_block_size() > 2190) // Not option-specified; use RFC 5681.
41 ? 2 // In (2190, ...).
42 : ((sock->max_block_size() > 1095)
43 ? 3 // In (1095, 2190].
44 : 4)) // In [1, 1095].
45 : sock->opt(sock->m_opts.m_st_cong_ctl_init_cong_wnd_blocks))), // Option-specified.
46 m_init_cong_wnd_bytes(m_cong_wnd_bytes),
47 // SSTHRESH set to infinity as recommended by RFC 5681-3.1 and others. Will be set to finite value on drop.
48 m_slow_start_thresh_bytes(std::numeric_limits<size_t>::max()),
49 m_acked_bytes_since_window_update(0)
50{
51 /* Initial CWND is chosen above according to RFC 5681-3.1 (for TCP) (unless it's overridden with
52 * m_st_init_cond_wnd_blocks option).
53 *
54 * I'm blindly following the RFC here. Whatever reasoning led to the above formula is probably
55 * about right for us too; IP+UDP+Flow overhead should not be too much different from IP+TCP
56 * overhead. Anyway, the initial CWND is just a safe guess and will change quickly regardless. */
57
58 log_state("construction");
59}
60
62{
63 return m_cong_wnd_bytes;
64}
65
67{
69}
70
72{
74}
75
77{
78 using std::min;
79
80 log_state("ack-handling-before");
81
82 const Peer_socket::Const_ptr sock = socket();
83
85 {
86 /* Once we've maxed out CWND, some of the below calculations make no sense and wouldn't result
87 * in an increased CWND anyway. If we are in slow start, this is nothing more than a little
88 * optimization. If we are in congestion avoidance, then it's a bit more complex:
89 * m_acked_bytes_since_window_update state remains unchaged which may seem like it might have
90 * implications on when CWND is no longer at the limit in the future (does congestion avoidance
91 * resume as if nothing happened or what?). HOWEVER, any change in CWND that would lower it
92 * will have to be done through congestion_event(), in which case
93 * m_acked_bytes_since_window_update will reset to zero, so what we do with
94 * m_acked_bytes_since_window_update at this point is immaterial. Therefore it's safe to simply
95 * do nothing here. */
96 FLOW_LOG_TRACE("cong_ctl [" << sock << "] update: cong_wnd already at limit.");
97 }
98 else if (in_slow_start())
99 {
100 /* We perform slow start CWND increase based on classic Reno TCP (RFC 5681) with the Appropriate
101 * Byte Counting (ABC) modification (RFC 3465-2.2). It says, simply, to increment CWND by the
102 * number of bytes acknowledged. By the way, ABC is the reason I ultimately chose to store CWND
103 * in bytes, not max-blocks; implementing ABC with a multiples-of-max-block-size CWND is tricky.
104 * (Linux TCP does it. The code is difficult to understand.)
105 *
106 * Well, there is one difficulty. The RFC also says to increase CWND as above but capped by a
107 * maximum of L bytes, where L is 2 * max-block-size.
108 *
109 * The justification is: "A very large L could potentially lead to large line-rate bursts of
110 * traffic in the face of a large amount of ACK loss or in the case when the receiver sends
111 * 'stretch ACKs' (ACKs for more than the two full-sized segments allowed by the delayed ACK
112 * algorithm)." I don't know what the first part means, but let's deal with the second part.
113 * For now, let's say we know for a fact that the receiver's net_flow is the same as our net_flow.
114 * net_flow does not send stretch acknowledgments; it sends an ACK -- at the latest -- when it
115 * has 2 un-acked DATA segments ready to acknowledge. If we assume good behavior (which for now
116 * we will), then "stretch ACKs" are not a concern.
117 *
118 * However, the main reason the limit of 2 makes no sense for us is the following. The RFC,
119 * when describing how to increment CWND based on an ACK, is describing a slightly different
120 * situation than on_acks(). To illustrate the difference, suppose 10 segments happen to arrive
121 * at the same time, each acknowledging a different data segment sent earlier (for this example,
122 * suppose each one acks one MSS). The RFC describes what to do for *each* of them. Therefore,
123 * for each one CWND will increase by 1 MSS (for 10 MSS total). Now imagine the equivalent
124 * situation in Flow protocol: 10 individual acknowledgments arrive at the same time (perhaps even in one
125 * ACK, which can happen in high-bandwidth situations). Based on the documented rule for
126 * calling on_acks(), it will be called ONCE, with bytes = 10 * max-block-size. If we were to
127 * apply the limit of 2 we'd only increase CWND by 2 * max-block-size -- 5x less than even the
128 * RFC would recommend! Therefore we should disregard the limit part of the RFC, as it doesn't
129 * apply to us. We are perhaps vulnerable to our equivalent of "stretch ACKs" if the other
130 * net_flow misbehaves, but let's not sweat it until it's a problem.
131 *
132 * ...That was difficult to explain in writing. The bottom line is, I do allow for a limit
133 * S_MAX_SLOW_START_CWND_INCREMENT_BLOCKS a-la RFC but recommend that, at least for now, we set
134 * it to basically infinity, because the model the RFC assumes is not quite the model we use and
135 * would thus lead to far-too-slow CWND growth during slow start in a high-bandwidth situation
136 * like transfer over localhost. I have seen that first-hand with the value set to 2. */
137
138 const size_t limit = S_MAX_SLOW_START_CWND_INCREMENT_BLOCKS * sock->max_block_size();
139 const size_t increment_bytes = min(bytes, limit);
140
141 FLOW_LOG_TRACE("cong_ctl [" << sock << "] update: [slow_start] cong_wnd incrementing "
142 "by [" << sock->bytes_blocks_str(increment_bytes) << "]; "
143 "acked [" << sock->bytes_blocks_str(bytes) << "]; "
144 "max increment [" << sock->bytes_blocks_str(limit) << "].");
145
146 m_cong_wnd_bytes += increment_bytes;
147 clamp_congestion_window(); // If exceeded limit, bring it to the limit.
148
149 /* Take opportunity to ensure our arithmetic is working; this field should be zero outside of
150 * congestion avoidance. Thus if we just equalled or exceeded SSTRESH, thus entering
151 * congestion avoidance, then this is a clean slate. */
153 }
154 else // if (!in_slow_start())
155 {
156 /* In congestion avoidance, also follow classic Reno RFC 5681 with the ABC modification (RFC
157 * 3465-2.1). For every CWND bytes acked in a row, increase CWND by a block, and then decrease
158 * the byte tally by the pre-increase CWND. This causes a linear-ish growth in CWND while in
159 * congestion avoidance. We add the feature to make the increment N * block, where N is configurable
160 * (the growth is still linear).
161 *
162 * As always there is a caveat that prevents the below code from literally following the
163 * algorithm from RFC. More precisely, we follow the RFC algorithm and, if after that
164 * bytes_acked >= CWND still, we reapply the RFC algorithm. We repeat until bytes_acked < CWND.
165 *
166 * How can bytes_acked match or exceed CWND after applying the RFC algorithm once? I believe
167 * it's impossible assuming the expected handling of Dropped packets in Node. However I
168 * cannot/don't want to prove this to myself, especially with edge cases with a very low CWND
169 * yet somehow a quite large ack group (large "bytes" arg value). Therefore, to be safe, I
170 * added the handling for this case. */
171
172 // Add the newly acked bytes to the tally.
174
175 /* Increase CWND if that has put us over the top. Repeat if we're still over the top after
176 * increasing CWND and accordingly decreasing the tally. In the usual case
177 * this loop will execute once, reducing the algorithm to the one in the RFC. */
178 bool cwnd_changed = false;
180 // Can't be true in 1st iteration (eliminated in if () above) but can become true after 1st iteration:
182 {
183 // Have a full CWND's worth of acked bytes; increase CWND.
184
185 if (!cwnd_changed)
186 {
187 cwnd_changed = true;
188 }
189 else // (already is true)
190 {
191 // This should be rare or impossible, but not an error. An INFO is in order.
192 FLOW_LOG_INFO("cong_ctl [" << sock << "] info: [cong_avoid] acknowledgment group brings tally to more "
193 "than one congestion window -- thought to be rare or impossible.");
194 }
195
196 /* Extend RFC 5681: in the RFC, this is simply 1. If they specify special value 0, use that value.
197 * Otherwise use the value they give. */
198 unsigned int increment_blocks = sock->opt(sock->m_opts.m_st_cong_ctl_cong_avoidance_increment_blocks);
199 if (increment_blocks == 0)
200 {
201 increment_blocks = 1;
202 }
203 const size_t increment_bytes = increment_blocks * sock->max_block_size();
204
205 FLOW_LOG_TRACE("cong_ctl [" << sock << "] update: [cong_avoid] cong_wnd incrementing "
206 "by [" << sock->bytes_blocks_str(increment_bytes) << "]; "
207 "acked [" << sock->bytes_blocks_str(bytes) << "].");
208
209 // "Convert" the CWND's worth of acked bytes into N-block increase of CWND (as in RFC, though there N=1).
210
212
213 m_cong_wnd_bytes += increment_bytes;
214 clamp_congestion_window(); // If exceeded limit, bring it to the limit. (while() loop will exit in that case.)
215 } // while ((bytes_acked >= CWND) && (CWND < max CWND)
216
217 if (!cwnd_changed)
218 {
219 // Don't have a full CWND's worth of acked bytes yet; leave CWND alone.
220 FLOW_LOG_TRACE("cong_ctl [" << sock << "] update: [cong_avoid] cong_wnd not yet incrementing; "
221 "acked [" << sock->bytes_blocks_str(bytes) << "].");
222 }
223 } // else if (!in_slow_start())
224
225 log_state("ack-handling-after");
226} // void Congestion_control_classic_data::on_acks()
227
228void Congestion_control_classic_data::on_congestion_event(size_t new_slow_start_thresh_bytes,
229 size_t new_cong_wnd_bytes)
230{
231 log_state("event-handling-before");
232
233 /* They are updating CWND and/or SSTHRESH due to some (unspecified) event. For example the
234 * classic Reno algorithm (more or less) halves CWND and sets SSTHRESH to this halved CWND,
235 * whenever a non-timeout packet loss is detected.
236 *
237 * First figure out what to do about m_acked_bytes_since_window_update given this change. We may
238 * have to reset it to zero. This is fairly important, as it basically resets the congestion
239 * avoidance state to that suitable to the beginning of the congestion avoidance phase (which may
240 * then happen immediately or after one or more slow start phases, depending on the CWND vs.
241 * SSTHRESH comparison).
242 *
243 * Let us consider the possibilities. (It may help to draw a mental or physical graph of CWND and
244 * SSTHRESH over time.) In slow start, where CWND < SSTHRESH, CWND is increasing exponentially
245 * (for every incoming full acknowledgment), and m_acked_bytes_since_window_update is not used.
246 * In congestion avoidance, where CWND >= SSTHRESH, CWND is increasing linearly and
247 * m_acked_bytes_since_window_update is used to keep track of this process. Whenever congestion
248 * avoidance ends (in favor of slow start or a different congestion avoidance phase at a lower [or
249 * higher?] CWND), m_acked_bytes_since_window_update must be reset to 0. So, the possibilities:
250 *
251 * 1. They're changing CWND (usually lower, e.g., Reno halves it). This introduces
252 * "discontinuity" in the CWND value over time, so if any congestion avoidance phase was in
253 * progress, it is now over. Set to 0 to avoid affecting the next congestion avoidance phase
254 * (whether it happens now or after a slow start phase). If no congestion avoidance phase
255 * was in progress, then it is already 0 and can be safely set to 0 again (or it can be left
256 * alone; makes no difference).
257 *
258 * 2. CWND is staying constant, but SSTHRESH is changing. If we are in slow start, then we're
259 * not in congestion avoidance, so it is already set to 0 so no need to do anything. If we
260 * are not in slow start, then we are in congestion avoidance. If we are in congestion
261 * avoidance, and under the new SSTHRESH we remain in congestion avoidance, then the
262 * in-progress congestion avoidance increase can continue undisturbed. Finally, if we are in
263 * congestion avoidance, and the new SSTHRESH would put us into slow start, then this
264 * congestion avoidance session is over, and we must set to 0 (to avoid affecting the next
265 * congestion avoidance phase, if there is one).
266 *
267 * Most of the above is uncontroversial and is consistent with various implementations of classic
268 * TCP congestion control (Reno and off-shoots like Westwood+). Certainly -1- is clear-cut.
269 * -2- is generally atypical (usually SSTHRESH changes only when CWND changes), but let's consider
270 * it. If CWND stays constant, while SSTHRESH moves from at or below to above CWND, then even if
271 * unlikely it clearly means congestion avoidance is over. So that case is not controversial.
272 * The remaining case is if SSTHRESH moves but stays below CWND and the decision to not touch
273 * m_acked_bytes_since_window_update (i.e., let congestion avoidance continue). I find this
274 * decision acceptable <comment cuts out>. @todo Where did the rest of the comment go?
275 * why is the decision acceptable? Didn't save? */
276 if ((new_cong_wnd_bytes != m_cong_wnd_bytes) // Case 1 above.
277 // Case 2 above:
278 || ((!in_slow_start()) && in_slow_start(new_slow_start_thresh_bytes, new_cong_wnd_bytes)))
279 {
281 }
282
283 // Now the easy part.
284 m_slow_start_thresh_bytes = new_slow_start_thresh_bytes;
285 m_cong_wnd_bytes = new_cong_wnd_bytes;
286
287 // In case they gave a too-high value for congestion window. (SSTHRESH can be arbitrarily high.)
289
290 log_state("event-handling-after");
291} // Congestion_control_classic_data::on_congestion_event()
292
293void Congestion_control_classic_data::on_drop_timeout(size_t new_slow_start_thresh_bytes)
294{
295 using std::max;
296
297 const Peer_socket::Const_ptr sock = socket();
298
299 /* Node has detected a major loss event in the form of a Drop Timeout. In most congestion
300 * control algorithms this causes CWND to be decreased to a low value, as inspired by classic Reno
301 * (RFC 5681). On the other hand SSTHRESH is set to a strategy-dependent value, which is that's
302 * given to us as an argument. Certainly the specific strategy may also reset other state. */
303
304 // RFC 5689-3.1 (set CWND to a low value; note: not the initial CWND, which can be higher).
305 const size_t new_cong_wnd_bytes
306 = sock->opt(sock->m_opts.m_st_cong_ctl_cong_wnd_on_drop_timeout_blocks) * sock->max_block_size();
307 // ^-- @todo Consistently with some other options, have value 0 mean "use RFC value," which is 1.
308
309 on_congestion_event(new_slow_start_thresh_bytes, new_cong_wnd_bytes);
310} // Congestion_control_classic_data::on_drop_timeout()
311
313{
314 /* Node has detected that nothing has been sent out by us for a while. (Note that this is
315 * different from a Drop Timeout. A Drop Timeout causes data to be sent (either retransmitted [if
316 * used] or new data), which resets the state back to non-idle for a while.) Per TCP
317 * RFC 5681 (and DCCP CCID 2 RFC 4341), in this case we are to simply set CWND back to its initial
318 * value (unless that would increase it [e.g., if we had recently hit Drop Timeout as well], in
319 * which case just leave it alone). SSTHRESH is to be left alone. Thus we will slow-start much
320 * as we would at the beginning of the connection, except that SSTHRESH stays constant and is not
321 * set back to infinity.
322 *
323 * There is at least one RFC with a more moderate policy, but it's not standard, and let's not
324 * get into it for now.
325 *
326 * Certainly the specific strategy may also reset other state. */
327
328 const Peer_socket::Const_ptr sock = socket();
330 {
331 FLOW_LOG_TRACE("cong_ctl [" << sock << "] update: Idle Timeout event; "
332 "cong_wnd [" << sock->bytes_blocks_str(m_cong_wnd_bytes) << "] is "
333 "already <= initial cong_wnd [" << sock->bytes_blocks_str(m_init_cong_wnd_bytes) << "].");
334 }
335 else
336 {
337 // Below call will log details.
338 FLOW_LOG_TRACE("cong_ctl [" << socket() << "] update: Idle Timeout event; set cong_wnd to initial cong_wnd.");
339
340 // Don't change SSTHRESH.
342 }
343} // Congestion_control_classic_data::on_idle_timeout()
344
346{
347 const Peer_socket::Const_ptr sock = socket();
348 return m_cong_wnd_bytes
349 >= sock->opt(sock->m_opts.m_st_cong_ctl_max_cong_wnd_blocks) * sock->max_block_size();
350}
351
353{
354 const Peer_socket::Const_ptr sock = socket();
355 const size_t limit = sock->opt(sock->m_opts.m_st_cong_ctl_max_cong_wnd_blocks) * sock->max_block_size();
356 if (m_cong_wnd_bytes > limit)
357 {
358 FLOW_LOG_TRACE("cong_ctl [" << sock << "] update: "
359 "clamping cong_wnd [" << sock->bytes_blocks_str(m_cong_wnd_bytes) << "] "
360 "to [" << sock->bytes_blocks_str(limit) << "].");
361 m_cong_wnd_bytes = limit;
362 }
363}
364
366{
368}
369
370bool Congestion_control_classic_data::in_slow_start(size_t cong_wnd_bytes, size_t slow_start_thresh_bytes) // Static.
371{
372 return cong_wnd_bytes < slow_start_thresh_bytes;
373}
374
375void Congestion_control_classic_data::log_state(const std::string& context) const
376{
377 using std::numeric_limits;
378
379 const Peer_socket::Const_ptr sock = socket();
380 FLOW_LOG_TRACE("cong_ctl [" << sock << "] info [" << context <<
381 '|' << (in_slow_start() ? "slow_start" : "cong_avoid") << "]: "
382 "cong_wnd [" << sock->bytes_blocks_str(m_cong_wnd_bytes) << "]; "
383 "sl_st_thresh "
384 "[" << ((m_slow_start_thresh_bytes == numeric_limits<size_t>::max())
385 ? "inf" : sock->bytes_blocks_str(m_slow_start_thresh_bytes)) << "] "
386 "cong_avoid_acked "
387 "[" << sock->bytes_blocks_str(m_acked_bytes_since_window_update) << "].");
388}
389
391{
392 const Peer_socket::Const_ptr sock = m_sock.lock();
393 assert(sock);
394 return sock;
395}
396
397} // namespace flow::net_flow
Interface that the user should implement, passing the implementing Logger into logging classes (Flow'...
Definition: log.hpp:1291
void log_state(const std::string &context) const
Logs, in a TRACE message, all internal state values.
size_t slow_start_threshold_bytes() const
Return current stored SSTHRESH value in bytes.
size_t m_acked_bytes_since_window_update
While in congestion avoidance mode (!in_slow_start()), the # of acked bytes reported to on_acks() tha...
size_t m_slow_start_thresh_bytes
Current value of SSTRESH.
Congestion_control_classic_data(log::Logger *logger_ptr, Peer_socket::Const_ptr sock)
Constructs object by setting up logging and saving a pointer to the containing Peer_socket,...
void on_drop_timeout(size_t new_slow_start_thresh_bytes)
Adjust state, including CWND and SSTHRESH, assuming a Drop Timeout just occurred.
void on_congestion_event(size_t new_slow_start_thresh_bytes, size_t new_cong_wnd_bytes)
Sets internally stored SSHTRESH and CWND to the given values; appropriately resets internal state so ...
void clamp_congestion_window()
If congestion_window_at_limit(), sets CWND to the limit value.
const size_t m_init_cong_wnd_bytes
Initial value of CWND in constructor.
Peer_socket::Const_ptr socket() const
Analogous to Congestion_control_strategy::socket().
bool congestion_window_at_limit() const
Returns true if and only if the stored CWND is >= a certain constant maximum value.
void on_acks(size_t bytes)
Adjusts state, including potentially CWND, based on either "congestion avoidance" or "slow start" alg...
size_t init_congestion_window_bytes() const
The value to which the constructor set the stored CWND value (also known as Initial Window or IW).
static const size_t S_MAX_SLOW_START_CWND_INCREMENT_BLOCKS
The constant that limits how much CWND can grow in slow start for a single ack group.
size_t congestion_window_bytes() const
Return current stored CWND value in bytes.
boost::weak_ptr< Peer_socket::Const_ptr::element_type > m_sock
Same meaning as in Congestion_control_strategy.
bool in_slow_start() const
Return true if and only if CWND < SSTHRESH.
void on_idle_timeout()
Adjust state, namely CWND, assuming an Idle Timeout just occurred.
Const_target_ptr Const_ptr
Short-hand for ref-counted pointer to immutable values of type Target_type::element_type (a-la T cons...
#define FLOW_LOG_INFO(ARG_stream_fragment)
Logs an INFO message into flow::log::Logger *get_logger() with flow::log::Component get_log_component...
Definition: log.hpp:197
#define FLOW_LOG_TRACE(ARG_stream_fragment)
Logs a TRACE message into flow::log::Logger *get_logger() with flow::log::Component get_log_component...
Definition: log.hpp:227
Flow module containing the API and implementation of the Flow network protocol, a TCP-inspired stream...
Definition: node.cpp:25
Flow_log_component
The flow::log::Component payload enumeration comprising various log components used by Flow's own int...
Definition: common.hpp:632