Flow 1.0.2
Flow project: Full implementation reference.
|
In all of Flow, scan for the unintentional use of the period character in auto-brief summaries in many doc headers. A period will be assumed to be the end of the brief summary. We've been disciplined about this, but I (ygoldfel) just realized that, e.g., "boost.asio" has a period and yet is present in at least some doc headers. Fix those in some consistent yet palatable way, so that the Doxygen output for those isn't prematurely truncated. (Check that Doxygen isn't smart enough to figure it out already. If it is, delete this to-do.)
A script/tool/something that would auto-scan the code for violations of cosmetic conventions described in doc-coding_style.cpp. It need not be perfect or comprehensive. Humans tend to have difficulty following lots of conventions, and it then clutters up code reviews subsequently.
There are a few little to-dos inside the Doxygen-skipped main body of doc-coding_style.cpp; check them out.
std::string_view
is a way to pass things around similarly to const std::string&
without requiring that a string
be created just for that purpose; it has a highly similar API but can be constructed from any character sequence in memory and internally stores nothing more than a pointer and length; we should use it wherever possible (within reason) instead of const std::string&
. Much code now uses String_view
; the remaining to-do is: scour the rest of the code for possible const string&
s to convert and indeed convert those to util::String_view.
Inline things: Or just use gcc -O3
(and non-gcc
equivalents) for prettier/faster-to-compile code? The latter is definitely tempting if it works sufficiently well. So far we've been using gcc -O3
and equivalents, and it seems to be working well (turning off inlining results in huge performance losses). Still, I am not sure if it would be better to explicitly inline
functions instead. Not having to do so definitely simplifies the code, so it is my great hope that the answer is no, and we can keep using gcc -O3
and equivalents. In that case delete this paragraph. (To be clear: gcc -O3
means that it ignores inline
keyword and anything similar, including inlined method bodies inside class {}
and struct {}
. Instead it determines what to inline based on its own ideas on what will generate the fastest code (with reasonable code size). gcc -O2
, on the other hand, will mostly inline things explicitly declared as such in code (again, via inline
or inlining inside class bodies or other techniques).) Update: Now using clang (not gcc) with maximum auto-inlining AND FLTO (link-time optimization will allow inlining across object file boundaries) in at least some platforms. This should be close to as good as possible. Update: gcc auto-inlining+FLTO also works.
There are some boost.thread "interruption points" throughout the code, so we should investigate whether we must catch boost::thread_interrupted
in those spots, or what...?
Be more specific (cite date) when writing "as of this writing." I use a rhetorical trick when commenting the state of something that may not continue to be the case. Though I do avoid writing such things, often it is necessary; in that case I usually write "as of this writing" or something very similarly worded. That's fine and essentially the best one can do. It means technically the statement won't become false, even if the "sub-statement" (the thing that was true when written) does become false. However, obviously, to the reader of the comment at that later time, that's little consolation: they're still reading a possibly false statement and will want to know what the situation is THEN, or "as of the reading," to to speak. In order to at least try to be helpful, in those cases a date (numeric month/year – like 4/2017 – should be sufficient in most cases) should be supplied. The to-do is to convert all "as of this writing" instances – and to always add a date when writing new instances of "as of this writing." The to-do can be removed once the conversion is completed. Example: this to-do has not been completed as of this writing (11/2017). (Side note: possibly goes without saying, but one is free to explain to an arbitrary degree of detail why something is true as of that writing, and how/why/when it might change. This to-do covers those cases where no such explanation is written. It would be impractically verbose to get into speculative detail for every as-of-this-writing instance; so at least a date should thus be inserted.)
The comments (as of this writing, all written by me, ygoldfel) in this library could use an edit to make them briefer. (I've found even a self-edit by me, with that express purpose, often does wonders.) Background: I write very extensive comments. I am quite convinced this is superior (far superior even) to next-to-no comments; and to the average amount of comments one tends to see in production code. That said, the top code review feedback I receive (not here specifically but in general) is that my comments tend to be too "discursive" (consisting of discourse) and/or at times unnecessarily in-depth. Discursive = as if I am talking to the reader (many prefer a terser, more formal style). Too in-depth = tends to go into history, related issues, future work, etc. (these elements can remain but can be cut down significant without losing much substance). In other words, there should be a happy middle in terms of comment volume, and this can be achieved by a comment edit run by Yuri or someone else (if reviewed by Yuri). To be clear, I believe this middle ground is to be closer to the status quo than to the average amount of comments in other projects.
Possibly document exceptions thrown explicitly via the Doxygen keyword meant for this purpose: "@throws"
. Currently when we document explicitly throwing an exception, it is ALWAYS a flow::error::Runtime_error encapsulating a flow::Error_code (which is an int
-like error code). This is very explicitly documented, but technically Doxygen has a keyword which will generate a special little readout for the exception (similarly as for each parameter, etc.). We don't use that keyword. We probably should, though this isn't totally cut-and-dried. Consider that we already document the exception on the err_code
parameter in every case; so no information would really be gained (only arguably nicer formatting). On the other hand, the code would be somewhat more verbose (and boiler-platey, since each already boiler-platey err_code
comment snippet would essentially grow in size). Furthermore, if we document this explicit exception, one might say it behooves us to now document all the other possible sources of exceptions such as std::bad_alloc
when running out of heap memory. Perhaps then we have to talk about constructor-throwing-exception behavior and other C++ technicalities to do with exceptions. Do we really want to enter that land? I think maybe not; consider just leaving it alone. Though, maybe I'm over-dramatizing the impact of adding a "@throws"
section on our various flow::error::Runtime_error-throwing methods. Well, it's a to-do; decide later.
Since Flow gained its first users beyond the original author, some Flow-adjacent code has been written from which Flow can benefit, including a potential io
module/namespace for general networking/local I/O. (Flow itself continued to be developed, but some features were added elsewhere for expediency; this is a reminder to factor them out into Flow for the benefit of all.) Some features to migrate here might be: boost.asio extensions to UDP receive APIs to obtain receipt time stamps and destination IP (recvmsg()
with ancillary data extensions) and to receive multiple datagrams in one call (recvmmsg()
); boost.asio-adjacent facility to add custom socket options similarly to how boost.asio does it internally; boost.asio support for (local) transmission of native socket handles and security data over stream sockets (SCM_RIGHTS
, etc.).
As of this writing the exact nature of where the project will permanently reside (and who will maintain it vs. use it) is in flux. Therefore for now I have removed the section covering certain topics and replaced it with the to-do you're reading. This should be undone when things settle down (obviously ensuring the brought-back section is made accurate). The topics to cover: "@author"
(including contact info); GitHub/other address indicating where to browse the project source; link(s) to the latest auto-generated web docs (if any); a section on the history of the project; and licensing info (if deemed needed) or pointer to it. (Reminder: Also update any similar sections of the historically significant net_flow::Node doc header.)
One space after period, not two: For some reason in this project I've been following the convention – in comments and (I think) log messages – of two spaces after a sentence-ending punctuator (usually period) before the next sentence's first character. I now regret trying this annoying convention. Go back to one space. (For the record, it does look sort of nice, but that's a matter of opinion, and single space looks fine too... AND doesn't confuse various editors' auto-formatting facilityies, among other problem.)
We use 0
instead of NULL
or nullptr
when needing a null pointer; perhaps we should use the latter. NULL
is an anachronism from C, so we shouldn't use it. nullptr
is at least no worse than 0
, however, other than being less concise. However, the main reason it exists – to avoid ambiguities in function overloading (e.g., when something could take either an int
or a char*
, nullptr
would resolve to the latter, while 0
probably unintentionally to the former) – is not a situation our style ever invokes, to my knowledge, so using nullptr
would not solve any actual problems. However, it could be argued that using it more readily calls attention to the use of a pointer, as opposed to an integer, in the particular context at play. So it's something to consider (but, no doubt, the conversion process would be laborious, as there's no simple search-replace that would work).
= default
for copy constructors and copy operators is now used in a few places; consider spreading this C++11 feature everywhere it's being done implicitly due to C++03 rules (good documentation practices suggest declaring them explicitly but of course leave the implementation to the compiler default, gaining best of both worlds – proper class API docs yet maintenance-friendly default body).
Consider PIMPL and related topics. Recommend scouring Boost docs, particularly for the smart pointer library, which discuss how to potentially use smart pointers for easy PIMPLing. In general, research the state of the art on the topic of library interface vs. implementation/hiding.
Return-by-copy binary operators of the form T operatorBLAH(const T& x1, const Some_type& x2)
should be written as free functions instead of T
members. I don't recall at this point why, but this tends to be recommended and done in STL and Boost. Maybe check the Meyer Effective C++ book on the theory; and if it makes sense find all such operators written as members and change them to be free functions. Should this be avoided if it requires friend
though? Lastly, for Doxygen, use the relatesalso T
command to link the free function to the class T
in the documentation.
In many (most) cases we pass shared_ptr
s (and their aliases) by value when it would be more performant to do so by const
reference; at times possibly better to pass by raw pointer. Scott Meyer/Herb Sutter have opined on this to basically indicate that (1) it is often best to use a raw pointer, unless actual copying/ownership status is expected; but failing that (2) it is often best to use a const&
when safe; and failing that passing by value is fine. This need not be a dogmatic change, but we should be more mindful than simply always passing by value. When searching for instances to potentially change, check for shared_ptr
, Ptr
, and _ptr
tokens.
Concurrent_task_loop::schedule_*()
APIs. Perhaps add bool in_loop_use_only
arg which, if false
, will always disable the single_threaded
optimization internally. At this time it always enables it if n_threads() == 1
which will cause thread un-safety if the returned handle is touched from outside an in-loop task. void
versions of the schedule_*()
APIs should be added which would lack this, as in that case there is no handle to misuse outside the loop. io_context
doc page. n_threads_or_zero
, est_hw_core_sharing_helps_algo
, est_hw_core_pinning_helps_algo
) can only be set at construction time even though start() and stop() can be invoked anytime. For instance a non-virtual
configure()
method could be added to each Concurrent_task_loop subclass, potentially with different signatures. Note, also, that the decision to have stop() exit the actual threads, in addition to making some things easier to code/reason about internally, is also more amenable to these dynamic changes – messing with threading behavior dynamically is easier if one can exit and re-spawn the thread pool. In Segregated_thread_task_loop, when randomly selecting a thread, considering trying to bias in favor of threads with less load (perhaps measured as # of tasks enqueued in a given thread's Task_engine
); and/or round-robin thread assignment; and/or other heuristics.
In Segregated_thread_task_loop, when dealing with the 2-arg post()
(the one taking an async::Op and an async::Task) and similar schedule_*()
methods – consider postponing the random thread selection until the last possible moment, as opposed to in create_op(). Given that Segregated_thread_task_loop is legacy-ish in nature, the resulting significant increase in internal complexity may or may not be worth accepting. This to-do also makes little practical difference without the nearby to-do that would bias in favor of less-loaded threads.
Single_thread_task_loop::schedule_*()
APIs. Perhaps add bool in_loop_use_only
arg which, if false
, will always disable the single_threaded
optimization internally. At this time it always enables it which will cause thread un-safety if the returned handle is touched from outside an in-loop task. void
versions of the schedule_*()
APIs should be added which would lack this, as in that case there is no handle to misuse outside the loop. Target_ptr
is deprecated and shall be always left at its default value in future code; eventually remove it entirely and hard-code the default value internally. union
used inside Option_set implementation should be replaced with the type-safe, equally-fast, more-expressive std::variant
. The author was not aware of the latter's existence when originally writing this. Lacking feature: log message rate-limiting. This could actually mean a few things. One is being able to rate-limit the messages produced a given log call site per unit time; this might be the minimum for this to-do. This could be done by a dirty macro hack; or it could be done in more civilized fashion (while minding perf) by configuring it by message ID (but message IDs are optional and not implemented as of this writing anyway). Another rate-limiting thing is general controls on frequency of logging; though arguably that is more applicable to individual Logger
implementations rather than as a mandatory-general mechanism.
Lacking feature: boost.format-style logging call sites. This is conceptually similar to – and possibly even entirely subsuming in backwards-compatible fashion – the to-do just above for printf()
-style logging. Likely both to-dos should be designed/implemented in one shot. Moreover, it is possible (and would be absolutely delightful if true) that boost.format's format
class can be used entirely transparently on top of the ostream
-focused existing flow::log API (such as FLOW_LOG_WARNING() and buddies)!
Lacking feature: printf()
-style logging call sites, in contrast to the currently supported ostream
fragment call sites. So, like, FLOW_LOG_WARNING_FMT("Hello, Mr. %s!", my_name.c_str())
could be used analogously to FLOW_LOG_WARNING("Hello, Mr " << my_name << "!!")
(with std::string my_name
). In terms of implementation, there is more to this than meets the eye perhaps; as ostream
s potentially store (formatting) state between totally separate invocations of ostream<<
, whereas printf()
style functions normally do not. Internally, this likely allows for specially optimized logging code.
Lacking feature: message IDs. This is discussed a bit more in the log::Msg_metadata doc header.
There are third-party logging systems including boost.log. The to-do is to investigate using boost.log, partially or entirely replacing the manually implemented system. (To be honest I am rather partial to the simple but effective system already implemented, in particular the performance-focused trickery.) Boost is invaluable in many ways, and Flow modules use it extremely extensively, but my intuition says that in this case it may be irrelevant. Meanwhile, boost.log can, of course, be used within user's Logger interface implementation, but that's not what this to-do is about; it's about actually placing it all over Flow's implementation and APIs in place of Logger, log::Sev, etc. (Also, after a 5-minute glance at boost.log, I am noticing that some of its ideas I seem to have independently and unknowingly replicated here. It is, however, simple stuff, not rocket science.) My intuition is I hesitate to force-marry the Flow user to boost.log; the very simple Logger (abstract) interface is seemingly a gentler thing to force the Flow user into, at least relatively speaking.
Lacking feature: compiler hints for optimizing away log filter checks. This is inspired by a certain other proprietary logging API in which we noticed attempts to give hints to the compiler as to how likely or unlikely the verbosity check is to return a true or false value, seemingly so that the compiler might optimize out the check and hence the entire log statement in some conditions; or always log others and skip the check in the optimized code. I omit any details here, but that's the general idea. I don't know how effective this is given that verbosities are configurable dynamically potentially; but look into it.
Async_file_logger::log_flush_and_reopen(true)
is great for flushing, such as in an abort-signal handler, but providing just the flushing part without the reopening might be useful. At the moment we've left it this way, due to the vague feeling that closing the file upon flushing it is somehow more final and thus safer (in terms of accomplishing its goal) in an abort-signal scenario. Feelings aren't very scientific though. unique_ptr
members of other classes) would be slightly nicer as an std::optional<>
instead of unique_ptr
. optional
was not in STL back in the day and either did not exist or did not catch our attention back in the day. unique_ptr
in situations like this is fine but uses the heap more. Logger
s, since it is an app-wide function and potentially would want to be reflected in multiple loggers. It could take a pair of iterators or a container (in both cases, of template argument type). Arguably, though, this is overkill; as (1) most code paths deal with only one get_logger()
-owning object each; (2) naming of a thread is obvious enough in subsequent logs (hence this message only calls attention to that operation plus notes the thread ID before the nickname assignment, not necessarily the most critical of information). Certainly this zero-to-one-Logger
version must continue to be available for syntactic-sugary convenience, even if the to-do is performed. Ack_delay_time_unit(1)
is a nanosecond, then 32 bits would support a maximum delay of ~4.1 seconds which is likely fine for most real-world scenarios. This would reduce the size of ACK packets quite a bit. The sync_*()
early-destruction functionality explained in "Thread safety of destructor" above could be extended to async_*()
. This is not a huge deal but would be nice for consistency. Not allowing it currently is really a result of internal implementation concerns (it is easier to not account for this corner case). Alternatively, maybe we should go the other way and not support the somewhat-weird corner case in sync_*()
.
To enable reactor-style async_*()
operations, meaning waiting for readability/writability/etc. but not performing the actual operation before calling the user handler, we provide a null_buffers
-style interface; like newer boost.asio versions we should deprecate this in favor of simpler async_wait()
APIs. This would apply to net_flow::asio::Peer_socket and net_flow::asio::Server_socket APIs. Similarly consider doing this for the sync_*()
operations in the non-asio
superclasses net_flow::Peer_socket and net_flow::Server_socket. Note that Event_set::async_wait() and Event_set::sync_wait() already exist; they are powerful but a bit complex to use in these simple situations. Hence the following hypothetical wrappers would be welcome replacements for the deprecated null_buffers
and "reactor-style" APIs in these classes: net_flow::Peer_socket::sync_wait(Event_set::Event_type)
, net_flow::asio::Peer_socket::async_wait(Event_set::Event_type)
, net_flow::Server_socket::sync_wait()
, net_flow::Server_socket::async_wait()
. (Timeout-arg versions would be desired also, as they exist now.)
namespace cong_ctl
? const
. These should be pointers for max consistency of style with other code in the project (and that style is encouraged for certain good reasons). Scour rest of code for other such data members as well. WAIT_INTERRUPTED
and USER_TIMEOUT
. In doing so, make sure there's an automated syntactic-sugary way of doing this, so that one need only add a code<->class pair into some internal centralized mapping, automatically making system emit the proper exception class, defaulting to plain Runtime_error if not specially-mapped. boost:any
s each of which stores either a Peer_socket::Ptr
or a Server_socket::Ptr
; Sockets should be changed to store C++17 std::variant
s. Performance, both internally and externally, would improve by using this type-safe compile-time mechanism (which is akin to union
s but much more pleasant to use). At the time this feature was written, Flow was in C++11, so variant
s were not available, and the author wanted to play around with any
s instead of haxoring old-school union
s. variant
is much nicer, however, and the dynamic nature of any
is entirely unnecessary here. unordered_set
would also work and take somewhat less memory and computational overhead. It would become unordered, instead of ordered chronologically, but that seems like a price possibly worth paying. using
for Low_lvl_packet sub-types to inherit the constructor(s) of Low_lvl_packet. (For some of the sub-types an explicit constructor would still be necessary, but it would just shadow the inherited one, which is fine and still saves lines in the other sub-types.) However, for now I've left it as-is, partially to be friendly to the Doxygen documentation generator, and partially to make the interface easy to read. Still, it may be better the other way. util
namespace or even outside it, as it is used commonly (or boost::asio::const_buffer
is used where Const_buffer
could be used for readability). Receive UDP datagrams as soon as possible (avoid internal buffer overflow): The OS UDP net-stack buffers arriving datagrams until they're recv()
d by the application layer. This buffer is limited; on my Linux test machine the default appears to buffer ~80 1k datagrams. With a high sender CWND and high throughput (e.g., over loopback), thread W – which both reads off UDP datagrams and handles them, synchronously – cannot always keep up, and the buffer fills up. This introduces Flow loss despite the fact that the datagram actually safely got to the destination; and this is with just ONE sender; in a server situation there could be thousands. In Linux I was able to raise, via setsockopt()
, the buffer size to somewhere between 1000 and 2000 1k datagrams. This helps quite a bit. However, we may still overflow the buffer in busy situations (I have seen it, still with only one connection). So, the to-do is to solve this problem. See below to-dos for ideas. WARNING: UDP buffer overflow can be hard to detect and may just look like loss; plus the user not reading off the Receive buffer quickly enough will also incur similar-looking loss. If there were a way to detect the total # of bytes or datagrams pending on a socket, that would be cool, but sock.available()
(where sock
is a UDP socket) just gives the size of the first queued datagram. Congestion control (if effective) should prevent this problem, if it is a steady-state situation (i.e., loss or queueing delay resulting from not keeping up with incoming datagrams should decrease CWND). However, if it happens in bursts due to high CPU use in the environment, then that may not help. NOTE 1: In practice a Node with many connections is running on a server and thus doesn't receive that much data but rather sends a lot. This mitigates the UDP-receive-buffer-overflow problem, as the Node receiving tons of data is more likely to be a client and thus have only one or two connections. Hence if we can handle a handful of really fast flows without such loss, we're good. (On other hand, ACKs are also traffic, and server may get a substantial amount of them. Much testing is needed.) Packet pacing on the sender side may also avoid this loss problem; on the other hand it may also decrease throughput. NOTE 2: Queue delay-based congestion control algorithms, such as FastTCP and Vegas, are highly sensitive to accurate RTT (round trip time) readings. Heavy CPU load can delay the recording of the "received" time stamp, because we call UDP recv()
and handle the results all in one thread. Any solution, such as the dedicated thread proposed below, would allow one to record the time stamp immediately upon receipt of the packet by the dedicated thread; while W would independently handle everything else. Is that a good idea though? Maybe not. If the CPU load is such that ACK-receiver-side can't get to the time-stamp-saving, RTT-measuring step without tricks like doing it immediately upon some low-level datagram receipt hook, then that CPU-pegged jungle is, in a way, part of the network and should probably be fairly counted as part of the RTT. So perhaps we should continue to take the RTT time stamp while actually handling the individual acknowledgments. Instead we should strive to use multi-core resources efficiently, so that the gap between receipt (on whatever thread) and acknowledgment processing (on whatever thread) is as small as possible. Then RTT is RTT, but we make it smaller via improved performance. Meanwhile, we hopefully also solve the original problem (internal kernel buffer overflowing and dropping datagrams).
Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 3: Suppose we take APPROACH 1 (no-brainer) plus APPROACH 2. Certain decisions in the latter were made for certain stated reasons, but let's examine those more closely. First note that the APPROACH 1 part will ensure that, given a burst of incoming datagrams, the first UDP recv()
will occur somewhere inside boost.asio, so that's definitely a success. Furthermore, strand S will invoke m_low_lvl_sock.async_send()
as soon as the pacing module decides to do so; if I recall correctly, boost.asio will invoke the UDP send()
right then and there, synchronously (at least I wrote that unequivocally in a Node::async_low_lvl_packet_send_impl() comment). Again, that's as good as we can possibly want. Finally, for messages 2, 3, ... in that incoming datagram burst, our handler will (indirectly but synchronously) perform the UDP recv()
s in strand S2. Here we're somewhat at boost.asio's mercy, but assuming its strand task scheduling is as efficient as possible, it should occur on the thread that's free, and either W1 or W2 should be free given the rest of the design. Still, boost.asio docs even say that different strands' tasks are NOT guaranteed to be invoked concurrently (though common sense implies they will be when possible... but WHAT IF WE'RE WRONG!!!?). Also, we don't know how much computational overhead is involved in making strands work so nicely (again, hopefully it's well written... but WHAT IF!!!?). A negative is the small mutex section around the two m_low_lvl_sock calls; not complex and probably hardly a performance concern, but still, it's a small cost. Finally, using strands – while not ugly – does involve a bit more code, and one has to be careful not to forget to wrap each handler with the appropriate strand (there is no compile- or runtime error if we forget!) So what can we change about APPROACH 2 to avoid those negatives? As stated in that approach's description, we could have thread W not deal with m_low_lvl_sock at all; thread W2 would have a separate Task_engine
handling only m_low_lvl_sock (so no mutex needed). W2 would do both sends and receives on the socket; and absolutely nothing else (to ensure it's as efficient as possible at getting datagrams off the kernel buffer, solving the original problem). Yes, the receiving would have to share time with the sends, but assuming nothing else interferes, this feels like not much of a cost (at least compared with all the heavy lifting thread W does today anyway). Each receive would read off all available messages into raw buffers and pass those (sans any copying) on to thread W via post(W)
. The other negative, also already mentioned, is that once pacing module (in thread W) decides that a datagram should be sent, the post(W2)
for the task that would peform the send introduces a delay between the decision and the actual UDP send()
done by boost.asio. Thinking about it now, it is questionable to me how much of a cost that really is. Without CPU contention, we can measure it; I expect it to be quite cheap, but I coudl be wrong. With CPU contention – which would have to come from many datagrams arriving at the same time – I'm not sure. It wouldn't be overly hard to test; basically flood with UDP traffic over loopback and log the delay between W deciding to send datagram and W2 calling m_low_lvl_sock.async_send_to()
(obviously use Fine_clock!). All in all, if we name the dedicated thread approach described here as APPROACH 3, then APPROACH 3 is appealingly simpler than APPROACH 2; and in most ways appears like it'd be at least as efficient and good at solving the original problem as APPROACH 2. The only danger that worries me is this business with messing up pacing (and no, moving pacing into W2 just endangers the receiving efficiency and introduces thread safety problems and complexity) by having it compete with receiving during incoming-traffic-heavy times. Ultimately, I'd recommend timing this "danger zone" as described a bit earlier (also time delay introduced by post(W2)
without any traffic coming in); and if it looks good, do APPROACH 3. Otherwise spring for APPROACH 2.
Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 2: To eliminate the problem to the maximum extent possible, we can dedicate its own thread – call it W2 – to reading m_low_lvl_sock. We could also write to m_low_lvl_sock on W2 (which would also allow us to use a different util::Task_engine from m_task_engine to exclusively deal with W2 and the UDP socket m_low_lvl_sock – simpler in some ways that the strand-based solution described below). There are a couple of problems with this. One, it may delay receive ops due to send ops, which compromises the goals of this project in the first place. Two, send pacing code is in thread W (and moving it to W2 would be complex and unhelpful); and once the decision to REALLY send a given datagram has been made, this send should occur right then and there – queuing that task on W2 may delay it, compromising the quality of send pacing (whose entire nature is about being precise down to the microsecond or even better). Therefore, we'd like to keep m_low_lvl_sock.async_send()
in thread W along with all other work (which allows vast majority of internal state to be accessed without locking, basically just the Send/Receive buffers excluded); except the chain of [m_low_lvl_sock.async_receive()
-> post handler of received datagram onto thread W; and immediately m_low_lvl_sock.async_receive()
again] would be on W2. AS WRITTEN, this is actually hard or impossible to do with boost.asio because of its design: m_low_lvl_sock must belong to exactly one Task_engine
(here, m_task_engine), whereas to directly schedule a specific task onto a specific thread (as above design requires) would require separate Task_engine
objects (1 per thread): boost.asio guarantees a task will run on a thread which is currently executing run()
– if 2 threads are executing run()
on the same service, it is unknown which thread a given task will land upon, which makes the above design (AS WRITTEN) impossible. (Side note: I'm not sure it's possible in plain C with BSD sockets either. A naive design, at least, might have W select()
on m_low_lvl_sock.native()
for writability as well other stuff like timers, while W2 select()
s on same for readability; then the two threads perform UDP send()
and recv()
, respectively, when so instructed by select()
s. Is it allowed to use select()
on the same socket concurrently like that? StackOverflow.com answers are not clear cut, and to me, at least, it seems somewhat dodgy.) However, an equivalent design IS possible (and supported cleanly by boost.asio): In addition to the 2 threads, set up 2 strands, S and S2. All work except the m_low_lvl_sock reads and posting of the handler onto S will be scheduled with a strand S. All work regarding m_low_lvl_sock reads and posting of its handler onto S will be scheduled with a strand S2. Recall that executing tasks in 1 strand, with 2+ threads executing run()
, guarantees that no 2+ of those tasks will execute simultaneously – always in series. This is actually – in terms of efficiency and thread safety – equivalent to the above W/W2 design. Since the S tasks will always execute serially, no locking is necessary to prevent concurrent execution; thus what we know today as thread W tasks (which need no locking against each other) will be equally thread safe; and same holds for the new S2 tasks (which are considerably simpler and fewer in number). So that's the thread safety aspect; but why is efficiency guaranteed? The answer becomes clear if one pictures the original thread W/W2 design; basically little task blocks serially pepper thread W timeline; and same for W2. By doing it with strands S and S2 (running on top of threads W and W2), the only thing that changes is that the little blocks might at random be swapped between threads. So the series of tasks T1 -> T2 -> T3 -> T4 meant for for S might actually jump between W and W2 randomly; but whenever thread W2 is chosen instead of thread W, that leaves an empty "space" in thread W, which will be used by the S2 task queue if it needs to do work at the same time. So tasks may go on either thread, but the strands ensure that both are used with maximum efficiency while following the expected concurrency constraints (that each strand's tasks are to be executed in series). Note, however, that m_low_lvl_sock (the socket object) is not itself safe for concurrent access, so we WILL need a lock to protect the tiny/short calls m_low_lvl_sock.async_receive()
and m_low_lvl_sock.async_send()
: we specifically allow that a read and write may be scheduled to happen simultaneously, since the two are independent of each other and supposed to occur as soon as humanly possible (once the desire to perform either one is expressed by the code – either in the pacing module in strand S or the read handler in S2). In terms of nomenclature, if we do this, it'd be more fair to call the threads W1 and W2 (as they become completely equal in this design). (In general, any further extensions of this nature (if we want still more mostly-independent task queues to use available processor cores efficiently), we would add 1 strand and 1 worker thread per each such new queue. So basically there's a thread pool of N threads for N mostly-independent queues, and N strands are used to use that pool efficiently without needing to lock any data that are accessed exclusively by at most 1 queue's tasks only. Resources accessed by 2 or more task queues concurrently would need explicit locking (e.g., m_low_lvl_sock in this design).) So then where we start thread W today, we'd start the thread pool of 2 threads W1, W2, with each executing m_task_engine.run()
. Before the run()s execute, the initial tasks (each wrapped in strand S or S2, as appropriate) need to be posted onto m_task_engine; this can even occur in the user thread U in the constructor, before W1 and W2 are created. The destructor would m_task_engine.stop()
(like today), ending each thread's run()
and trigger the imminent end of that thread; at which point destructor can W1.join()
and W2.join()
(today it's just W.join()
).
Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 1 (CO-WINNER!): One approach is to note that, as of this writing, we call m_low_lvl_sock.async_receive(null_buffers)
; the null_buffers
value for the buffers arg means that the handler is called without any actual UDP receive is performed by boost.asio; our handler is called once there is at least 1 message TO read; and then indeed our handler does read it (and any more messages that may also have arrived). Well, if we pass in an actual buffer instead, then boost.asio will read 1 (and no more, even if there are more) message into that buffer and have it ready in the handler. Assuming the mainstream case involves only 1 message being ready, and/or assuming that reading at least 1 message each time ASAP would help significantly, this may be a good step toward relieving the problem, when it exists. The code becomes a tiny bit less elegant, but that's negligible. This seems like a no-brainer that should be included in any solution, but it by itself may not be sufficient, since more than 1 datagram may be waiting, and datagrams 2, 3, ... would still have to be read off by our code in the handler. So other approaches should still be considered.
More uniform diagnostic logging: There is much diagnostic logging in the implementation (FLOW_ERROR*(), etc.), but some of it lacks useful info like sock
or serv
(i.e., the ostream
representations of Peer_socket and Server_socket objects, which include the UDP/Flow endpoints involved in each socket). The messages that do include these have to do so explicitly. Provide some macros to automatically insert this info, then convert the code to use the macros in most places. Note that existing logging is rather exhaustive, so this is not the biggest of deals but would be nice for ease of coding (and detailed logging).
Based on some outside experience, there maybe be problems – particularly considering the to-do regarding dual-stack IPv6/v4 support – in servers listening in multiple-IP situations; make sure we support these seamlessly. For example consider a machine with a WAN IP address and a LAN (10.x.x.x) IP address (and possibly IPv6 versions of each also) that (as is typical) binds on all of them at ANY:P (where P is some port; and ANY is the IPv6 version, with dual-stack mode ensuring V4 datagrams are also received). If a client connects to a LAN IP, while in our return datagrams we set the source IP to the default, does it work? Outside experience shows it might not, depending, plus even if in our protocol it does, it might be defeated by some firewall... the point is it requires investigation (e.g., mimic TCP itself; or look into what IETF or Google QUIC does; and so on).
flow::net_flow should use flow::cfg for its socket-options mechanism. It is well suited for that purpose, and it uses some ideas from those socket-options in the first place but is generic and much more advanced. Currently net_flow
socket-options are custom-coded from long before flow::cfg existed.
ostream
output operators for Node and asio::Node should exist. Also scour through all types; possibly some others could use the same. (I have been pretty good at already implementing these as-needed for logging; but I may have "missed a spot.")
Some of the ostream<<
operators we have take X*
instead of const X&
; this should be changed to the latter for various minor reasons and for consistency.
Actively support IPv6 and IPv4, particularly in dual-stack mode (wherein net_flow::Server_socket would bind to an IPv6 endpoint but accept incoming V4 and V6 connections alike). It already supports it nominally, in that one can indeed listen on either type of address and connect to either as well, but how well it works is untested, and from some outside experience it might involve some subtle provisions internally.
Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 4: It now occurs to me how to solve the main questionable part about APPROACH 3. If we fear that the reads and writes in thread W2 may compete for CPU, especially the reads delaying timing-sensitive paced writes, then we can eliminate the problem by taking W2's own util::Task_engine (which would be separate from m_task_engine) and have two equal threads W2' and W2'' start up and then each call Task_engine::run()
. Without having to use any strand, this will essentially (as documented in boost.asio's doc overview) establish a thread pool of 2 threads and then perform the receive and send tasks at random on whichever thread(s) is/are available at a particular time. Since there's no strand(s) to worry about, the underlying overhead in boost.asio is probably small, so there's nothing to fear about efficiency. In fact, in this case we've now created 3 separate threads W, W2', W2'', all doing things independently of each other, which is an excellent feature in terms of using multiple cores. Do note we will now need a mutex and very short critical sections around the calls to m_low_lvl_sock::async_receive()
and m_low_lvl_sock::async_send()
, but as noted before this seems extremely unlikely to have any real cost due to the shortess of critical sections in both threads. If this is APPROACH 4, then I'd say just time how much absolute delay is introduced by a ‘post(W2’)or
post(W2'')` of the async send call compared to directly making such a call on thread W, as is done today. I suspect it's small, in which case the action is go for APPROACH 4... finally.
Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 5 (CO-WINNER!): Actually, the thing I've been dismissing in approaches 2-4, which was to combine the pacing logic with the actual m_low_lvl_sock.async_send()
(like today) but in their own dedicated thread, now seems like the best way to solve the one questionable aspect of APPROACH 4. So, APPROACH 4, but: Move the pacing stuff into the task queue associated with threads W2' and W2''. So then these 2 threads/cores will be available for 2 task queues: one for pacing timers + datagram sending over m_low_lvl_sock (with mutex); the other for receiving over m_low_lvl_sock (with same mutex). Now, the only "delay" is moved to before pacing occurs: whatever minimal time cost exists of adding a queue from thread W to thread W2' or W2'' occurs just before the pacing logic, after which chances are the datagram will be placed on a pacing queue anyway and sent off somewhat later; intuitively this is better than the delay occurring between pacing logic and the actual UDP send. Note, also, that the timing-sensitive pacing logic now gets its own thread/core and should thus work better vs. today in situations when thread W is doing a lot of work. This is even more logical than APPROACH 4 in that sense; the pacing and sending are concern 1 and get their own thread (essentially; really they get either W2' or W2'' for each given task); the receiving is concern 2 and gets its own thread (same deal); and all the rest is concern 3 and remains in its own thread W (until we can think of ways to split that into concerns; but that is another to-do). Only one mutex with 2 very small critical sections, as in APPROACH 4, is used. The only subtlety regarding concurrent data access is in Node::mark_data_packet_sent(), which is called just before m_low_lvl_sock.async_send()
, and which finalizes certain aspects of Peer_socket::Sent_packet::Sent_when; most notably Peer_socket::Sent_packet::Sent_when::m_sent_time (used in RTT calculation upon ACK receipt later). This is stored in Peer_socket::m_snd_flying_pkts_by_sent_when, which today is not protected by mutex due to only being accessed from thread W; and which is extremely frequently accessed. So either we protect the latter with a mutex (out of the question: it is too frequently accessed and would quite possibly reduce performance) or something else. Currently I think Node::mark_data_packet_sent() should just be placed onto m_task_engine (thread W) via post()
but perhaps take all or most of the items to update Sent_when with as arguments, so that they (especially Sent_when::m_sent_time
) could be determined in thread W2' or W2'' but written thread-safely in W. (There is no way some other thread W task would mess with this area of Peer_socket::m_snd_flying_pkts_by_sent_when before the proposed mark_data_packet_sent() was able to run; thread W had just decided to send that packet over wire in the first place; so there's no reason to access it until ACK – much later – or some kind of socket-wide catastrophe.) All that put together I dub APPROACH 5. Thus, APPROACH 1 + APPROACH 5 seems like the best idea of all, distilling all the trade-offs into the the fastest yet close to simplest approach.
It may be desirable to use not boost.asio's out-of-the-box UDP receive routines but rather extensions capable of some advanced features, such as recvmsg()
– which can obtain kernel receipt time stamps and destination IP address via the cmsg
feature. This would tie into various other to-dos listed around here. There is, as of this writing, also a to-do in the top-level flow
namespace doc header about bringing some code into a new io
namespace/Flow module; this includes the aforementioned recvmsg()
wrapper.
It may be desirable to further use recvmmsg()
for UDP input; this allows to read multiple UDP datagrams with one call for performance.
By the same token, wrapping sendmsg()
and sendmmsg()
may allow for futher perf and feature improvements – in some ways potentially symmetrically to recvmsg()
and recvmmsg()
respectively. However, as of this writing, I (ygoldfel) see this more of an opportunistic "look into it" thing and not something of active value; whereas recv[m]msg()
bring actual features we actively desire for practical reasons.
Send and Receive buffer max sizes: These are set to some constants right now. That's not optimal. There are two competing factors: throughput and RAM. If buffer is too small, throughput can suffer in practice, if the Receiver can't read the data fast enough (there are web pages that show this). Possibly with today's CPUs it's no longer true, but I don't know. If buffer is too large and with a lot of users, a LOT of RAM can be eaten up (however note that a server will likely be mostly sending, not receiving, therefore it may need smaller Receive buffers). Therefore, as in Linux 2.6.17+, the buffer sizes should be adaptively sized. It may be non-trivial to come up with a good heuristic, but we may be able to copy Linux. The basic idea would probably be to use some total RAM budget and divide it up among the # of sockets (itself a constant estimate, or adaptive based on the number of sockets at a given time?). Also, buffer size should be determined on the Receive side; the Send side should make its buffer to be of equal size. Until we implement some sensible buffer sizing, it might be a good idea (for demo purposes with few users) to keep the buffers quite large. However, flow control (receive window) is now implemented and should cope well with momentary Receive buffer exhaustion. Related facts found on the web: In Linux, since a long time ago, Send buffer size is determined by other side's Receive buffer size (probably sent over in the SYN or SYN-ACK as the receive window). Also, in older Linuxes, Receive buffer defaults to 128k but can be manually set. Supposedly the default can lead to low throughput in high-speed (gigabit+) situations. Thus Linux 2.6.17+ apparently made the Receive buffer size adaptive.
Drop Acknowledgments: DCCP, a somewhat similar UDP-based protocol, uses the concept of Data-Dropped acknowledgments. If a packet gets to the receiver, but the receiver is forced to drop it (for example, no Receive buffer space; not sure if there are other reasons in Flow protocol), then the sender will only find out about this by inferring the drop via Drop Timeout or getting acknowledgments for later data. That may take a while, and since receiver-side drops can be quite large, it would be more constructive for the receive to send an un-ACK of sorts: a Data-Dropped packet informing the sender that specific data were dropped. The sender can then update his congestion state (and retransmit if we enable that). See RFC 4340 and 4341.
Add extra-thread-safe convention for setting options: We can provide a thread-safe (against other user threads doing the same thing) macro to set a given option. set_options() has a documented, albeit in practice not usually truly problematic, thread safety flaw if one calls options(), modifies the result, then calls set_options(). Since another thread may modify the Node's options between the two calls, the latter call may unintentionally revert an option's value. Macro would take an option "name" (identifier for the Node_options member), a Node, and a target value for the option and would work together with a helper method template to obtain the necessary lock, make the assignment to the internal option, and give up the lock. The implementation would probably require Node to expose its internal stored Node_options for the benefit of this macro only. Anyway, this feature is not super-important, as long as the user is aware that modifying options from multiple threads simultaneously may result in incorrect settings being applied.
The preceding to-do regarding Node_options applies to Peer_socket_options stored in Peer_socket in in an analogous way.
Consider removing Flow ports and even Server_socket: As explained above, we add the concept of a large set of available Flow ports within each Node, and each Node itself has a UDP port all to itself. So, for example, I could bind a Node to UDP port 1010, and within that listen() on Flow ports 1010 (unrelated to UDP port 1010!) and 1011. In retrospect, though, is that complexity necessary? We could save quite a few lines of code, particularly in the implementation (class Port_space, for example) and the protocol (extra bytes for Flow source and target ports, for example). (They're fun and pretty lines, but the absence of lines is arguably even prettier albeit less fun. On the other hand, bugs aren't fun, and more code implies a higher probability of bugs, maintenance errors, etc.) The interface would also be a bit simpler; and not just due to fewer items in Remote_endpoint (which would in fact reduce to util::Udp_endpoint and cease to exist). Consider Server_socket; currently listen() takes a flow_port_t argument and returns a Server_socket which is listening; calling accept() (etc.) on the latter yields Peer_socket, as the other side connects. Without Flow ports, there is no argument to listen(); in fact, Server_socket itself is not strictly necessary and could be folded into Node, with listen() becoming essentially something that turns on the "are we listening?" Boolean state, while stop_listening() would turn it off (instead of something like Server_socket::close()
). (Important note: That was not an endorsement of removing Server_socket. Arguably it is still a nice abstraction. Removing it would certainly remove some boiler-plate machinery to do with Server_socket's life cycle, on the other hand. Perhaps it's best to take a two-step appraoch; remove Flow ports first; then after a long time, assuming it becomes clear that nothing like them is going to come back, remove Server_socket as well.) A key question is, of course what would we lose? At first glance, Flow port allows multiple connections on a single UDP-port-taking Flow server, including multiple connections from one client (e.g., with differing connection parameters such as reliability levels among the different connections, or "channels")... but actually that'd still work without Flow ports, assuming the "one client's" multiple connections can bind to different (presumably ephemeral) UDP ports; since the tuple (source host, source UDP port) is still enough to distinguish from the 2+ "channels" of the same "client" connecting to the one Flow Node (vs. today's tuple: source host, source UDP port, source Flow port, destination Flow port; see struct
Socket_id). However, without Flow ports, it is not possible for one Node to connect to another Node twice, as each Node by definition is on one port. Is this important? Maybe, maybe not; for NAT purposes it can be important to use only 1 port; but that typically applies only to the server, while the client can send packets from anywhere. However, gaming applications can be very demanding and for the most restrictive NAT types might desire only a single port used on both sides. So whether to remove Flow ports is somewhat questionable, now that they exist; but arguably they didn't need to be added in the first place, until they were truly needed. I'd probably leave them alone, since they do exist.
Multi-core/multi-threading: The existing implementation already has a nice multi-threaded property, namely that each Node (object that binds to a single UDP endpoint/port) is entirely independent of any other such object – they have entirely separate data, and each one does all its work on a separate thread. So to make use of multiple cores/processors, one can set up multiple Node objects. (Obviously this only makes sense for apps where using multiple ports is acceptable or even desired. E.g., a server could listen on N different UDP ports, where N=# of cores.) However, it would be nice for a single Node to be as multi-core/processor-friendly as possible. This is partially addressed by the "Dedicated thread to receive
UDP datagrams ASAP" to-do item elsewhere. We could go further. boost.asio lets one easily go from 1 thread to multiple threads by simply starting more worker threads like W (W1, W2, ...) and executing m_task_engine::run()
in each one – note that m_task_engine is shared (sans lock). Then subsequent handlers (timer-fired handlers, ack-received handlers, data-received handlers, and many more) would be assigned evenly to available threads currently executing run(). However, then all data these handlers access would need to be protected by a mutex or mutexes, which would be a significant increase in complexity and maintenance headaches compared to existing code, which features mutexes for data accessed both by W and the user thread(s) U – which excludes most Node, Server_socket, Peer_socket state – essentially the user-visible "state" enums, and the Receive and Send buffers; but hugely complex things like the scoreboards, etc. etc., needed no mutex protection, but with this change they would need it. Actually, if the implementation essentially uses one mutex M, and every handler locks it for the entirety of its execution, then one isn't really making use of multi-cores/etc. anyway. One could make use of boost.asio "strands" to avoid the need for the mutex – just wrap every handler in a shared strand S, and no locking is needed; boost.asio will never execute two handlers simultaneously in different threads. While this would arguably make the code simpler, but in terms of performance it wouldn't make any difference anyway, as it is functionally equivalent to the lock-M-around-every-operation solution (in fact, internally, it might even amount to exactly that anyway). So that's probably not worth it. We need to have more mutexes or strands, based on some other criterion/criteria. After a datagram is demuxed, vast majority of work is done on a particular socket independently of all others. Therefore we could add a mutex (or use an equivalent) into the socket object and then lock on that mutex. Multiple threads could then concurrently handle multiple sockets. However, for this to be used, one would have to use a single Node (UDP endpoint) with multiple sockets at the same time. Without any changes at all, one can get the same concurrency by instead setting up multiple Node objects. Other than a bit of lost syntactic sugar (arguably) – multiple Node objects needed, each one having to initialize with its own set of options, for example – this is particularly cost-free on the client side, as each Node can just use its own ephemeral UDP port. On the server side the network architecture has to allow for multiple non-ephemeral ports, and the client must know to (perhaps randomly) connect to one of N UDP ports/endpoints on the server, which is more restrictive than on the client. So perhaps there are some reasons to add the per-socket concurrency – but I would not put a high priority on it. IMPORTANT UPDATE: Long after the preceding text was written, flow::async Flow module was created containing flow::async::Concurrent_task_loop interface. That looks at the general problem of multi-tasking thread pools and what's the user-friendliest-yet-most-powerful way of doing it. While the preceding discussion in this to-do has been left unchanged, one must first familiarize self with flow::async; and then read the above, both because some of those older ideas might need reevaluation; and because some of those ideas may have been implemented by flow::async and are now available easily.
In Node::low_lvl_packet_sent(), the UDP async_send()
handler, there is an inline to-do about specially treating the corner case of the would_block
and try_again
boost.asio errors being reported (a/k/a POSIX EAGAIN
/EWOULDBLOCK
). Please see that inline comment for details.
Class Node private
section is very large. It's so large that the implementations of the methods therein are split up among different files such as flow_peer_socket.cpp
, flow_event_set.cpp
, flow_low_lvl_io.cpp
, etc. Consider the following scheme to both better express this separation as well as enforce which of a given method group's method(s) are meant to be called by outside code vs. being helpers thereof: Introduce static
-method-only inner classes (and, conceivably, even classes within those classes) to enforce this grouping (public
methods and private
methods enforcing what is a "public" helper vs. a helper's helper).
We are now on Boost 1.75; the use of asio's null_buffers
semantics is deprecated and should be changed to the replacement mechanism suggested in Boost docs – the async_wait()
method.
Make use of flow::async::Concurrent_task_loop or flow::async::Single_thread_task_loop, instead of manually setting up a thread and util::Task_engine, for m_worker. I, Yuri, wrote the constructor, worker_run(), destructor, and related setup/teardown code as my very first boost.asio activity ever. It's solid, but flow::async just makes it easier and more elegant to read/maintain; plus this would increase Flow-wide consistency. It would almost certainly reduce the number of methods and, even nicely, state (such as m_event_loop_ready).
opts
now that we've switched to C++11 or better? Provide another method to shut down everything gracefully?
Server_socket objects closed as if by what?
@internal
(Doxygen command) to hide from generated public documentation, and that works, but really they should not be visible in the publicly-exported (not in detail/) header source code; so this should be reorganized for cleanliness. The prototypes like this one can be moved to a detail/ header or maybe directly into .cpp that uses them (for those where it's only one). <<
and >>
operators should use the flow::util::istream_to_enum() pattern which is much easier than the overwrought old thing in there now involving two map
s. Perhaps add a generic enum_to_strings()
to provide the body for net_flow::Congestion_control_selector::get_ids(). Closing connection considerations. May implement closing only via timeouts at first (as opposed to explicit closing). Below text refers to close_final()
and close_start()
, but those are just ideas and may be replaced with timeout, or nothing. At this time, the only closing supported is abrupt close due to error or abrupt close via close_abruptly().
Rename State
and Open_sub_state
to Phase
and Open_sub_phase
respectively; and Int_state
to State
. Explain difference between phases (application-layer, user-visible and used close to application layer) and states (transport layer, internal).
Look into a way to defeat the need for boiler-plate trickery – with low but non-zero perf cost – involving *_socket
-vs-Node
circular references in method templates, such as the way Peer_socket::send() and Peer_socket::receive() internally make Function<>
s before forwarding to the core in Node. Can this be done with .inl
files? Look into how Boost internally uses .inl
files; this could inspire a solution... or not.
Currently this close_abruptly() is the only way for the user to explicitly close one specified socket. All other ways are due to error (or other side starting graceful shutdown, once we implement that). Once we implement graceful close, via close_start()
and close_final()
, use of close_abruptly() should be discouraged, or it may even be deprecated (e.g., Node
s lack a way to initiate an abrupt close for a specific socket).
close_abruptly() return bool
(false
on failure)?
boost::weak_ptr<Node>
would be ideal for this, but of course then Node would have to (only?) be available via shared_ptr<>. struct
instance. Since a typical pattern might include 1 lost packet followed by 100 received packets, we'd be able to cut down memory use by a factor of about 100 in that case (and thus worry much less about the limit). Of course the code would get more complex and potentially slower (but not necessarily significantly). Can we make m_sent_cwnd_bytes and some of its peers const
?
Why is m_sent_cwnd_bytes in Sent_when struct
and not directly in Sent_packet
? Or maybe it should stay here, but Sent_when should be renamed Send_attempt
(or Send_try
for brevity)? Yeah, I think that's it. Then Sent_packet::m_sent_when – a container of Sent_when objects – would become m_send_tries
, a container of Send_try
objects. That makes more sense (sentce?!) than the status quo which involves the singular-means-plural strangeness.
Node
-wide value; we should probably have a Node_info
class for such things. boost::filesystem::path
) in ~Port_space()
. This wouldn't work if Port_space crashed, however. To solve that, we could easily start a thread in Port_space() (and join it in ~Port_space()
) that would save that state every N seconds. If that's not enough, we can instead have that thread run a boost.asio io_service::run()
event loop and io_service::post()
the save operation onto it (from arbitrary threads) each time a new ephemeral port is reserved (with some kind of protection against incessant disk writing built into this worker thread; e.g., don't save if saved in the last N msec). There is a sub-case of the immediately preceding to-do that may be performed first without much controversy. That is the case where there is some member method of some type, that is then called by an extremely thin wrapper free function (not even a friend
). In fact, hash_value()
and Remote_endpoint::hash() are such a pair. Assuming one does not forget Doxygen's relatesalso
command, it would be easy and concise to just get rid of the member and simply move its implementation directly into the free function. After all, people are meant to use the free function anyway, so why the middle-man method? In this example, hash_value()
would directly compute the hash, and Remote_endpoint::hash() would not exist; but the former would include a Doxygen relatesalso Remote_endpoint
command to ensure properly linked generated documentation.
The respected Scott Meyers ("Effective C++") would recommend that Remote_endpoint::hash() and Remote_endpoint::operator==() be written as free functions instead of as struct
members (see http://www.drdobbs.com/cpp/how-non-member-functions-improve-encapsu/184401197). The same would apply to a potentially great number of other non-virtual
methods (including operators) of other struct
s and classes that could be implemented without friend
, so really pursuing this approach could touch quite a few things. I won't repeat Meyers' reasoning; see the link. I find the reasoning mostly compelling. To summarize his final suggestions: An operation on a type should be a free function if ALL of the following holds: it is not virtual
by nature; AND: it is <<
or >>
, and/or its would-be left operand would require type conversions, and/or it can be implemented entirely via the type's publicly exposed interface. We already follow the advice for <<
and >>
(if only because we apply it to stream ops, and for that it just makes sense, <<
especially, since the stream is the left-most operand there, so it has to be a free function in that case – and thus also for >>
for consistency). The type conversion is not a common thing for this library; so that leaves non-virtual
operations (which is most of them) that can be implemented via public APIs only (which is probaby common for struct
s like Remote_endpoint, though we generally try not to have lots of struct
methods in the first place... usually). Before rushing headlong into this project, consider a few more things, though. 1, Doxygen wouldn't pick up the relation between a non-friend free function dealing with struct
or class C
and C
itself; so verbosity/error-proneness would increase by having to add a Doxygen relatesalso
special command to each free function; otherwise documentation becomes less readable (and there's no way to enforce this by having Doxygen fail without this being done, somehow). 2, in the C
-returning-static-member-of-C
pattern, usually in our code it would call a private C
constructor, meaning it would require friend
to make it a free function, meaning it breaks Meyers' own rule and thus should be left a member. 3, Meyers tends to place very little value on brevity as its own virtue. If you find that following the above rule in some case seems to be significantly increasing code complexity/size, maybe it's best to leave it alone. (I am thinking of Low_lvl_packet and its sub-types like Ack_packet: they are struct
s but not typical ones, with complex APIs; making those APIs free function when non-virtual
sounds like a rather hairy change that "feels" like it would reduce simplicity and may increase size, at least due to all the necessary 'splaining in the comments.) All that said, this is perhaps worth pursuing (in the pursuit of stylistic perfection) – but surgically and first-do-no-harm-edly. Oh, also, constructors don't count for this and should continue to remain constructors and not some free functions stuff (should be obvious why, but just in case you get any ideas...).
One way to provide finer pacing, in the situation where we must send more than 1 packet within the minimal schedulable timer period, is to rely on events other than timer clock ticks that occur more frequently. Namely, ACKs will possibly arrive faster that the minimal timer schedulable ticks do. Therefore, when a time slice begins, we can send no packets (instead of all the packets that must be sent in that time slice, as is the case currently). Then over the course of the time slice we can opportunistically – upon each ACK send the proper number of queued packets up to that point into the time slice (this can be computed accurately, because we can MEASURE time very accurately – just can't schedule things as accurately). Finally, once the time slice does expire, send off any packets still remaining for that slice. This way we can opportunistically provide finer pacing, where the ACK clock allows it.
Obtain the first RTT measurement based on SYN-SYN_ACK or SYN_ACK-SYN_ACK_ACK interval; then SRTT will be available from the start, and packet pacing can be enabled with the very first DATA packet, avoiding the initial (albeit small) burst of DATA packets. This is pretty easy.
As noted above we perform packet pacing but currently choose to assign a value of 0 bytes to an ACK. That is, while we do preserve the order of DATA and ACK packets – if both happen to be in the outgoing stream – we do not delay the sending of the ACK once it is the next packet to be sent out. However, even so, an ACK's sending may be delayed by the pacing applied to DATA packets intermixed with it. Therefore the ACK delay measurement we take here may be incorrect (too low) in that case. This can cause overestimated RTTs on the sender's side. The to-do is to correct the ACK delay value in a given ACK by adding the pacing delay (if any) of the ACK to the individual ACK delays within it. Conceptually this is similar to the m_sent_when
value being set when choosing to send a DATA packet and then corrected in the pacing module later. This to-do is not important until we in practice start mixing sending and receiving at the application layer... but still – it's worth knowing that there is a design bug here.
My quick look into Google BBR, the recent advanced congestion control algorithm, suggests that the pacing algorithm can be used as part of congestion control rather than something merely applied after congestion control has made the CWND decision. I believe the idea would be this: If we can somehow very reliably determine the available bandwidth, which Google BBR does purport to do, then instead of controlling send rate via CWND, instead one could apply it to this pacing module. It is a slight modification of the algorithm described above: instead of the rate of sending being equal to CWND/SRTT, make it equal to the bandwidth determined through Google BBR. If I understand correctly, Google BBR does maintain a CWND as well, but this is a safety upper bound, in case the computed bandwidth is too high to be practical (in this case CWND will slow down data being delivered to the pacing module – but, normally, this would not be the case, and the pacing module would be, in practice, the thing controlling the sending rate). Note this is a pretty big conceptual change; in particular, it would make the queue Q grow much larger than what we see currently.
double
or a big-integer type from boost.multiprecision? Implement Server_socket::close()
functionality – at least the equivalent of Peer_socket::close_abruptly().
Limit Server_socket listen queue/set length.
Rename State
to Phase
, as with similar to-do in class Peer_socket doc header.
const
? util::Blob::emplace_copy()
and in the other direction. In that case util::Blob::assign_copy()
and all single-buffer util::Blob methods should probably be similarly generalized (overloaded). const std::string&
to String_view
in all of flow::perf where applicable. Look into string&&
also. S_REAL_HI_RES
would always be preferable. Nevertheless it would be interesting to "officially" see its characteristics including in particular (1) resolution and (2) its own perf cost especially vs. S_REAL_HI_RES
which we know is quite fast itself. This may also help a certain to-do listed as of this writing in the doc header of flow::log FLOW_LOG_WITHOUT_CHECKING() (the main worker bee of the log system, the one that generates each log time stamp). operator*=(Duration_set)
by a potentially negative number; same for division. timed_function() overload exists for a single Clock_type
, but simultaneous multi-clock timing using the perf::Clock_types_subset paradigm (as used, e.g., in Checkpointing_timer) would be a useful and consistent API. E.g., one could measure user and system elapsed time simultaneously. As of this writing this only does not exist due to time constraints: a perf-niggardly version targeting one clock type was necessary.
timed_function(), when operating on an atomic<duration_rep_t>
, uses +=
for accumulation which may be lock-free but uses strict ordering; a version that uses fetch_add()
with relaxed ordering may be desirable for extra performance at the cost of not-always-up-to-date accumulation results in all threads. As of this writing this can be done by the user by providing a custom type that defines +=
as explicitly using fetch_add()
with relaxed ordering; but we could provide an API for this.
Tight_blob<Allocator, bool>
, which would be identical to Basic_blob but forego the framing features, namely size() and start(), thus storing only the RAII array pointer data() and capacity(); rewrite Basic_blob in terms of this Tight_blob
. This simple container type has had some demand in practice, and Basic_blob can and should be cleanly built on top of it (perhaps even as an IS-A subclass). bind()
and others described above this line in the source code. Key
in m_value_list (in addition to the mandatory one in the lookup table m_keys_into_list_map). Perhaps the key copy would be replaced by an iterator back into m_value_list. A custom iterator class would be necessary to properly dereference this (this is non-trivial given that operator*()
would have to return a reference to a pair which is no longer stored anywhere in this hypothetical design). Moreover, iterators exposed to the user would become invalid the same way an unordered_map<>
iterator does due to seemingly unrelated changes. Finally, the memory savings would not even exist for Key
types roughly the size of a pointer. All in all, not a slam-dunk.... Lock_guard<Mutex_non_recursive>
possible. Remove it and all (outside) uses eventually. Lock_guard<Mutex_noop_shared_non_recursive>
possible. Remove it and all (outside) uses eventually. Shared_lock_guard<Mutex_noop_shared_non_recursive>
possible. Remove it and all (outside) uses eventually. Lock_guard<Mutex_recursive>
possible. Remove it and all (outside) uses eventually. Lock_guard<Mutex_shared_non_recursive>
possible. Remove it and all (outside) uses eventually. Shared_lock_guard<Mutex_shared_non_recursive>
possible. Remove it and all (outside) uses eventually. boost::shared_mutex
to std::shared_mutex
, as the former has a long-standing unresolved ticket about its impl being slow and/or outdated (as of Boost-1.80). However see also the note on util::Mutex_shared_non_recursive that explains why it might be best to avoid this type of mutex altogether in the first place (in most cases). boost::random::random_number_generator
, and should be written to conform to it, so as to be usable in std::random_shuffle()
, and also to gain its appealing extra features without losing any of the already-available simplicity. Note, however, that std::random_shuffle()
is deprecated (and gone in C++17) in favor of std::shuffle()
, which is available from C++11 on, and which works directly with a formal URNG such as Rnd_gen_uniform_range_base::Random_generator. So this to-do is less about that and more about gaining those other features while being suitable for random_shuffle()
, which some code does still use out there. We could eliminate schedule_task_from_now() potential limitation versus Timer wherein each call constructs (internally) a new Timer. A pool of Timer
s can be internally maintained to implement this. This may or may not be worth the complexity, but if the API can remain identically simple while cleanly eliminating the one perf-related reason to choose Timer over this simpler facility, then that is a clean win from the API user's point of view. By comparison, other possible improvements mentioned complicate the API which makes them less attractive.
schedule_task_from_now() and surrounding API provides an easy way to schedule a thing into the future, but it is built on top of boost.asio util::Timer directly; an intermediate wrapper class around this would be quite useful in its own right so that all boost.asio features including its perf-friendliness would be retained along with eliminating its annoyances (around canceling-but-not-really and similar). Then scheduled_task_from_now() would be internally even simpler, while a non-annoying util::Timer would become available for more advanced use cases. echan may have such a class (in a different project) ready to adapt (called Serial_task_timer
). I believe it internally uses integer "task ID" to distinguish between scheduled tasks issued in some chronological order, so that boost.asio firing a task after it has been canceled/pre-fired can be easily detected.
flow::util::Shared_ptr_alias_holder Const_target_ptr
is deprecated and shall be always left at its default value in future code; eventually remove it entirely and hard-code the default value internally.
Add example usage snippets in Shared_ptr_alias_holder doc header, illustrating the pattern.
alt_sstream
in boost.format; it doesn't seem to have public documentation but isn't in a detail
directory either. So that's interesting. It might have better performance than the implementation here (by being more "direct" probably). Then again it might not. Upgrade to newer Boost and keep testing timer resolutions on the above and other OS versions. Update: As of this writing we are on Boost 1.74 now. Needless to say, it is time to reassess the above, as it has been years since 1.50 – and we had been on 1.63 and 1.66 for long period of time, let alone 1.74. Update: No changes with Boost 1.75, the newest version as of this writing.
An alternative copy-free implementation of the asynchronous FLOW_LOG_DO_LOG() work-flow is possible. The basic idea of using a thread-local non-reallocating work string is just fine in the non-logs_asynchronously()
(i.e., synchronous) Logger flow. In the asynchronous flow, however, it involves an added message copy. Instead – as done in certain major server software author is familiar with – one could (perhaps in the async flow only) allocate a new string in each FLOW_LOG_DO_LOG() (in rare cases reallocating, even repeatedly, if more space is needed); then pass that pointer around, until it is asynchronously written out by Logger impl; then deallocate it. Thus, a copy is eliminated in the async workflow. A complicating factor is that the current system maintains a format state across separate log call sites in a given thread; this change would (if naively implemented at least) eliminate that feature – but this could be considered acceptable. (Still, do realize that, for example, in today's feature set one can set the chrono
I/O formatting to show short unit names like 2ms
instead of 2 milliseconds
; and one need do it only once; but with this change one would need to do it in every log call site. That would be, I can attest, rather annoying. Additionally, make sure the behavior is consistent in the sync and async work-flows.) A related wrinkle is that if we add special support for printf()
-style log call sites (a to-do in flow::log doc header as of this writing), then in that case since there is no expectation of such format statefulness in the first place, in that flow that particular concern isn't a concern. (Sub-to-do: If one did this, an extra-optimized further idea is to avoid the repeated allocs/deallocs by maintaining a pool of already-allocated buffers to be reused opportunistically.) Bottom line: I claim the existing thing is pretty good; the extra copy is IMO unlikely to affect real performance, because (1) it's only one copy in the context of quite a bit of similar copying and other ops going on while writing out the string; and (2) if the message is so verbose as to affect performance at all, then it will probably affect it regardless of the extra copy (in other words, its verbosity must be increased, or the filter verbosity must be decreased – avoiding this exta internal copy feels in my [ygoldfel] personal opinion like rearranging deck chairs on the Titanic). So, this to-do should probably be to-done at some point, but it doesn't feel urgent. And since there are quite a few subtleties involved, as shown above, it's natural to resist doing it until truly necessary.
Time stamp source: The current implementation uses the system clock to generate time stamps (a/k/a POSIX time), but consider optionally or mandatorily using the high-resolution clock instead (or additionally?). This has pros and cons; all or most time stamps elsewhere use the system clock also, so this allows for easy cross-referencing against other systems and logs. There's also the question of how to express an absolute time, as usually the high-resolution clock starts at system startup – not as humanly useful as a "calendar" (UTC) time, which – while useful humanly – is not steady/monotonic/etc. There is no reasonable conversion between Fine_clock::now()
and a calendar time (boost.chrono docs say this unequivocally which is a confirmation). The pros include: (1) higher precision and resolution; (2) that time always flows forward and at a uniform rate without possibility of time going back or forward due to human/otherwise clock sets or rare events like daylight savings and leap seconds; or (3) to summarize, something more suited as a quick-and-dirty way to measure how long things take in the program, like an always-on, log-integrated version of perf::Checkpointing_timer. As of this writing all out-of-the-box log::Logger implementations and log::Config allow the output of human-readable as well as sec.usec-from-Epoch time stamps. One approach might be to replace the latter only with the high-rez clock's time stamps, perhaps optionally, while leaving the human-readable one alone. Note: There is an important test to perform here, which is the time cost of obtaining either time stamp. E.g., the high-rez time stamp might be much slower – or maybe the opposite! To test this, (1) add the POSIX-time clock into the perf::Clock_type enum
, with all associated (fairly straightforward) changes in flow::perf
; and (2) test the perf characteristics of this new clock. Certain code exists outside of Flow itself that already automatically tests all Clock_type
s, so it would quickly give the answer. (Secondary to-do: Be less vague about where this program resides, in this comment. I, ygoldfel, have the answer at any rate and am only omitting it here for boring process reasons.)
Time stamp subtlety: It might not be crazy to log not just the time stamp of entry to this macro but also some indication how long it took to evaluate the rest and finally output it to the ultimate device/whatever. It could be an optional differential or something, like "timestamp+diff," with diff in microseconds or something.
<<
and by implementing this as a variadic function template (C++11 feature). get_logger()
and get_log_component()
return values in such a way as to be reused by the FLOW_LOG_WITHOUT_CHECKING() invoked by the former macro if should_log() == true
. As it stands, they are called again inside the latter macro. In the most-common case, wherein Log_context is used for those two expressions, this should get inline-optimized to be maximally fast anyway. With FLOW_LOG_SET_*()
, though, it might be a bit slower than that. Technically, one can make their own get_logger
and get_log_component
identifiers that might do something slower still – though it's unlikely (and as of this writing unprecedented). So I would not call this pressing, but on the other hand... just do it! The implementation code will be somewhat hairier though.