Flow 1.0.2
Flow project: Full implementation reference.
Todo List
File doc-coding_style.cpp

In all of Flow, scan for the unintentional use of the period character in auto-brief summaries in many doc headers. A period will be assumed to be the end of the brief summary. We've been disciplined about this, but I (ygoldfel) just realized that, e.g., "boost.asio" has a period and yet is present in at least some doc headers. Fix those in some consistent yet palatable way, so that the Doxygen output for those isn't prematurely truncated. (Check that Doxygen isn't smart enough to figure it out already. If it is, delete this to-do.)

A script/tool/something that would auto-scan the code for violations of cosmetic conventions described in doc-coding_style.cpp. It need not be perfect or comprehensive. Humans tend to have difficulty following lots of conventions, and it then clutters up code reviews subsequently.

There are a few little to-dos inside the Doxygen-skipped main body of doc-coding_style.cpp; check them out.

Namespace flow

std::string_view is a way to pass things around similarly to const std::string& without requiring that a string be created just for that purpose; it has a highly similar API but can be constructed from any character sequence in memory and internally stores nothing more than a pointer and length; we should use it wherever possible (within reason) instead of const std::string&. Much code now uses String_view; the remaining to-do is: scour the rest of the code for possible const string&s to convert and indeed convert those to util::String_view.

Inline things: Or just use gcc -O3 (and non-gcc equivalents) for prettier/faster-to-compile code? The latter is definitely tempting if it works sufficiently well. So far we've been using gcc -O3 and equivalents, and it seems to be working well (turning off inlining results in huge performance losses). Still, I am not sure if it would be better to explicitly inline functions instead. Not having to do so definitely simplifies the code, so it is my great hope that the answer is no, and we can keep using gcc -O3 and equivalents. In that case delete this paragraph. (To be clear: gcc -O3 means that it ignores inline keyword and anything similar, including inlined method bodies inside class {} and struct {}. Instead it determines what to inline based on its own ideas on what will generate the fastest code (with reasonable code size). gcc -O2, on the other hand, will mostly inline things explicitly declared as such in code (again, via inline or inlining inside class bodies or other techniques).) Update: Now using clang (not gcc) with maximum auto-inlining AND FLTO (link-time optimization will allow inlining across object file boundaries) in at least some platforms. This should be close to as good as possible. Update: gcc auto-inlining+FLTO also works.

There are some boost.thread "interruption points" throughout the code, so we should investigate whether we must catch boost::thread_interrupted in those spots, or what...?

Be more specific (cite date) when writing "as of this writing." I use a rhetorical trick when commenting the state of something that may not continue to be the case. Though I do avoid writing such things, often it is necessary; in that case I usually write "as of this writing" or something very similarly worded. That's fine and essentially the best one can do. It means technically the statement won't become false, even if the "sub-statement" (the thing that was true when written) does become false. However, obviously, to the reader of the comment at that later time, that's little consolation: they're still reading a possibly false statement and will want to know what the situation is THEN, or "as of the reading," to to speak. In order to at least try to be helpful, in those cases a date (numeric month/year – like 4/2017 – should be sufficient in most cases) should be supplied. The to-do is to convert all "as of this writing" instances – and to always add a date when writing new instances of "as of this writing." The to-do can be removed once the conversion is completed. Example: this to-do has not been completed as of this writing (11/2017). (Side note: possibly goes without saying, but one is free to explain to an arbitrary degree of detail why something is true as of that writing, and how/why/when it might change. This to-do covers those cases where no such explanation is written. It would be impractically verbose to get into speculative detail for every as-of-this-writing instance; so at least a date should thus be inserted.)

The comments (as of this writing, all written by me, ygoldfel) in this library could use an edit to make them briefer. (I've found even a self-edit by me, with that express purpose, often does wonders.) Background: I write very extensive comments. I am quite convinced this is superior (far superior even) to next-to-no comments; and to the average amount of comments one tends to see in production code. That said, the top code review feedback I receive (not here specifically but in general) is that my comments tend to be too "discursive" (consisting of discourse) and/or at times unnecessarily in-depth. Discursive = as if I am talking to the reader (many prefer a terser, more formal style). Too in-depth = tends to go into history, related issues, future work, etc. (these elements can remain but can be cut down significant without losing much substance). In other words, there should be a happy middle in terms of comment volume, and this can be achieved by a comment edit run by Yuri or someone else (if reviewed by Yuri). To be clear, I believe this middle ground is to be closer to the status quo than to the average amount of comments in other projects.

Possibly document exceptions thrown explicitly via the Doxygen keyword meant for this purpose: "@throws". Currently when we document explicitly throwing an exception, it is ALWAYS a flow::error::Runtime_error encapsulating a flow::Error_code (which is an int-like error code). This is very explicitly documented, but technically Doxygen has a keyword which will generate a special little readout for the exception (similarly as for each parameter, etc.). We don't use that keyword. We probably should, though this isn't totally cut-and-dried. Consider that we already document the exception on the err_code parameter in every case; so no information would really be gained (only arguably nicer formatting). On the other hand, the code would be somewhat more verbose (and boiler-platey, since each already boiler-platey err_code comment snippet would essentially grow in size). Furthermore, if we document this explicit exception, one might say it behooves us to now document all the other possible sources of exceptions such as std::bad_alloc when running out of heap memory. Perhaps then we have to talk about constructor-throwing-exception behavior and other C++ technicalities to do with exceptions. Do we really want to enter that land? I think maybe not; consider just leaving it alone. Though, maybe I'm over-dramatizing the impact of adding a "@throws" section on our various flow::error::Runtime_error-throwing methods. Well, it's a to-do; decide later.

Since Flow gained its first users beyond the original author, some Flow-adjacent code has been written from which Flow can benefit, including a potential io module/namespace for general networking/local I/O. (Flow itself continued to be developed, but some features were added elsewhere for expediency; this is a reminder to factor them out into Flow for the benefit of all.) Some features to migrate here might be: boost.asio extensions to UDP receive APIs to obtain receipt time stamps and destination IP (recvmsg() with ancillary data extensions) and to receive multiple datagrams in one call (recvmmsg()); boost.asio-adjacent facility to add custom socket options similarly to how boost.asio does it internally; boost.asio support for (local) transmission of native socket handles and security data over stream sockets (SCM_RIGHTS, etc.).

As of this writing the exact nature of where the project will permanently reside (and who will maintain it vs. use it) is in flux. Therefore for now I have removed the section covering certain topics and replaced it with the to-do you're reading. This should be undone when things settle down (obviously ensuring the brought-back section is made accurate). The topics to cover: "@author" (including contact info); GitHub/other address indicating where to browse the project source; link(s) to the latest auto-generated web docs (if any); a section on the history of the project; and licensing info (if deemed needed) or pointer to it. (Reminder: Also update any similar sections of the historically significant net_flow::Node doc header.)

One space after period, not two: For some reason in this project I've been following the convention – in comments and (I think) log messages – of two spaces after a sentence-ending punctuator (usually period) before the next sentence's first character. I now regret trying this annoying convention. Go back to one space. (For the record, it does look sort of nice, but that's a matter of opinion, and single space looks fine too... AND doesn't confuse various editors' auto-formatting facilityies, among other problem.)

We use 0 instead of NULL or nullptr when needing a null pointer; perhaps we should use the latter. NULL is an anachronism from C, so we shouldn't use it. nullptr is at least no worse than 0, however, other than being less concise. However, the main reason it exists – to avoid ambiguities in function overloading (e.g., when something could take either an int or a char*, nullptr would resolve to the latter, while 0 probably unintentionally to the former) – is not a situation our style ever invokes, to my knowledge, so using nullptr would not solve any actual problems. However, it could be argued that using it more readily calls attention to the use of a pointer, as opposed to an integer, in the particular context at play. So it's something to consider (but, no doubt, the conversion process would be laborious, as there's no simple search-replace that would work).

= default for copy constructors and copy operators is now used in a few places; consider spreading this C++11 feature everywhere it's being done implicitly due to C++03 rules (good documentation practices suggest declaring them explicitly but of course leave the implementation to the compiler default, gaining best of both worlds – proper class API docs yet maintenance-friendly default body).

Consider PIMPL and related topics. Recommend scouring Boost docs, particularly for the smart pointer library, which discuss how to potentially use smart pointers for easy PIMPLing. In general, research the state of the art on the topic of library interface vs. implementation/hiding.

Return-by-copy binary operators of the form T operatorBLAH(const T& x1, const Some_type& x2) should be written as free functions instead of T members. I don't recall at this point why, but this tends to be recommended and done in STL and Boost. Maybe check the Meyer Effective C++ book on the theory; and if it makes sense find all such operators written as members and change them to be free functions. Should this be avoided if it requires friend though? Lastly, for Doxygen, use the relatesalso T command to link the free function to the class T in the documentation.

In many (most) cases we pass shared_ptrs (and their aliases) by value when it would be more performant to do so by const reference; at times possibly better to pass by raw pointer. Scott Meyer/Herb Sutter have opined on this to basically indicate that (1) it is often best to use a raw pointer, unless actual copying/ownership status is expected; but failing that (2) it is often best to use a const& when safe; and failing that passing by value is fine. This need not be a dogmatic change, but we should be more mindful than simply always passing by value. When searching for instances to potentially change, check for shared_ptr, Ptr, and _ptr tokens.

Namespace flow::async
The thread-to-core optimizations provided at this time are, at least, a good start, but more advanced logic can be devised with more low-level experience and/or by using certain open-source libraries. It's possible that a more knowledgeable person would devise more or better knobs and/or require less manual specification of values. The following background reading may help devise more advanced logic and/or knobs: [ https://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/ | https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2016.07.31a.pdf | https://lwn.net/Articles/255364/ | "hwloc" library (portable lib for detailed hardware topology info) | libNUMA ].
Member flow::async::Concurrent_task_loop::schedule_from_now (const Fine_duration &from_now, Scheduled_task &&task)=0
Deal with the scheduled-tasks-affected-from-outside-loop corner case of the Concurrent_task_loop::schedule_*() APIs. Perhaps add bool in_loop_use_only arg which, if false, will always disable the single_threaded optimization internally. At this time it always enables it if n_threads() == 1 which will cause thread un-safety if the returned handle is touched from outside an in-loop task. void versions of the schedule_*() APIs should be added which would lack this, as in that case there is no handle to misuse outside the loop.
Member flow::async::Concurrent_task_loop::start (Task &&init_task_or_empty=Task(), const Thread_init_func &thread_init_func_or_empty=Thread_init_func())=0
Concurrent_task_loop::start() has an optional thread-initializer-function arg; it could be reasonable to ass a thread-finalizer-function arg symmetrically. As of this writing there is no use case, but it's certainly conceivable.
Member flow::async::Concurrent_task_loop::stop ()=0
boost.asio has advanced features that might help to mark certain tasks as "must-run if already queued, even if one `stop()`s"; consider providing user-friendly access to these features, perhaps in the context of the existing Concurrent_task_loop::stop() API. These features are documented as of this writing at least in the io_context doc page.
Class flow::async::Cross_thread_task_loop
Add dynamic configurability of low-level thread/core behavior of Cross_thread_task_loop, Segregated_thread_task_loop, and the Concurrent_task_loop interface generally, as these parameters (including n_threads_or_zero, est_hw_core_sharing_helps_algo, est_hw_core_pinning_helps_algo) can only be set at construction time even though start() and stop() can be invoked anytime. For instance a non-virtual configure() method could be added to each Concurrent_task_loop subclass, potentially with different signatures. Note, also, that the decision to have stop() exit the actual threads, in addition to making some things easier to code/reason about internally, is also more amenable to these dynamic changes – messing with threading behavior dynamically is easier if one can exit and re-spawn the thread pool.
Member flow::async::optimize_pinning_in_thread_pool (flow::log::Logger *logger_ptr, const std::vector< util::Thread * > &threads_in_pool, bool est_hw_core_sharing_helps_algo, bool est_hw_core_pinning_helps_algo, bool hw_threads_is_grouping_collated)
For the Darwin/Mac platform only: There is likely a bug in optimize_pinning_in_thread_pool() regarding certain low-level pinning calls, the effect of which is that this function is probably effectively a no-op for now in Macs. The bug is explained inside the body of the function.
Member flow::async::S_ASYNC_AND_AWAIT_CONCURRENT_COMPLETION
Much like the promise/future mechanism provides optional timed wait functionality, it might make sense to provide the API ability to set an optional time limit for any wait invoked by Synchronicity::S_ASYNC_AND_AWAIT_CONCURRENT_COMPLETION or Synchronicity::S_ASYNC_AND_AWAIT_CONCURRENT_START. Probably best to add this only once a need clearly arises though.
Class flow::async::Segregated_thread_task_loop

In Segregated_thread_task_loop, when randomly selecting a thread, considering trying to bias in favor of threads with less load (perhaps measured as # of tasks enqueued in a given thread's Task_engine); and/or round-robin thread assignment; and/or other heuristics.

In Segregated_thread_task_loop, when dealing with the 2-arg post() (the one taking an async::Op and an async::Task) and similar schedule_*() methods – consider postponing the random thread selection until the last possible moment, as opposed to in create_op(). Given that Segregated_thread_task_loop is legacy-ish in nature, the resulting significant increase in internal complexity may or may not be worth accepting. This to-do also makes little practical difference without the nearby to-do that would bias in favor of less-loaded threads.

Member flow::async::Single_thread_task_loop::schedule_from_now (const Fine_duration &from_now, Scheduled_task &&task)
Deal with the scheduled-tasks-affected-from-outside-loop corner case of the Single_thread_task_loop::schedule_*() APIs. Perhaps add bool in_loop_use_only arg which, if false, will always disable the single_threaded optimization internally. At this time it always enables it which will cause thread un-safety if the returned handle is touched from outside an in-loop task. void versions of the schedule_*() APIs should be added which would lack this, as in that case there is no handle to misuse outside the loop.
Class flow::cfg::Config_manager< S_d_value_set >
Dynamic config support lacks crash rejection; though these features should be an incremental improvement around the existing code. By crash rejection I mean something like: config file X comes in; we rename it to X.unparsed; we try to parse it; program crashes – versus it's fine, so we rename it X.parsed, and next time X.parsed is what we presumably-safely parse after any restart. Similarly invalid config may not cause a crash but still shouldn't be repeatedly re-parsed (etc.). Exact design TBD and will be informed by previous work on other projects (by echan and ygoldfel at least).
Member flow::cfg::Config_manager< S_d_value_set >::apply_static (const fs::path &static_cfg_path, const typename Final_validator_func< Value_set >::Type &... final_validator_func, bool commit=true)
Add support for command line as a config source in addition to file(s), for static config in cfg::Config_manager and cfg::Static_config_manager.
Class flow::cfg::Dynamic_cfg_context< Root, Target, Target_ptr >
flow::cfg::Dynamic_cfg_context Target_ptr is deprecated and shall be always left at its default value in future code; eventually remove it entirely and hard-code the default value internally.
Class flow::cfg::Option_set< Value_set >
Add individual-option-changed detection API(s) in addition to the existing Option_set overall-value-set-changed detection API. This is contingent on a use case needing it. The existing code already detects this internally and logs a message for each changed option (in canonicalize_candidate()).
Member flow::cfg::Option_set< Value_set >::Declare_options_func_args::m_args
union used inside Option_set implementation should be replaced with the type-safe, equally-fast, more-expressive std::variant. The author was not aware of the latter's existence when originally writing this.
Member flow::cfg::Static_config_manager< Value_set >::apply (const fs::path &cfg_path, const typename Final_validator_func< Value_set >::Type &final_validator_func)
Add support for command line as a config source in addition to file(s), for static config in cfg::Config_manager.
Namespace flow::log

Lacking feature: log message rate-limiting. This could actually mean a few things. One is being able to rate-limit the messages produced a given log call site per unit time; this might be the minimum for this to-do. This could be done by a dirty macro hack; or it could be done in more civilized fashion (while minding perf) by configuring it by message ID (but message IDs are optional and not implemented as of this writing anyway). Another rate-limiting thing is general controls on frequency of logging; though arguably that is more applicable to individual Logger implementations rather than as a mandatory-general mechanism.

Lacking feature: boost.format-style logging call sites. This is conceptually similar to – and possibly even entirely subsuming in backwards-compatible fashion – the to-do just above for printf()-style logging. Likely both to-dos should be designed/implemented in one shot. Moreover, it is possible (and would be absolutely delightful if true) that boost.format's format class can be used entirely transparently on top of the ostream-focused existing flow::log API (such as FLOW_LOG_WARNING() and buddies)!

Lacking feature: printf()-style logging call sites, in contrast to the currently supported ostream fragment call sites. So, like, FLOW_LOG_WARNING_FMT("Hello, Mr. %s!", my_name.c_str()) could be used analogously to FLOW_LOG_WARNING("Hello, Mr " << my_name << "!!") (with std::string my_name). In terms of implementation, there is more to this than meets the eye perhaps; as ostreams potentially store (formatting) state between totally separate invocations of ostream<<, whereas printf() style functions normally do not. Internally, this likely allows for specially optimized logging code.

Lacking feature: message IDs. This is discussed a bit more in the log::Msg_metadata doc header.

There are third-party logging systems including boost.log. The to-do is to investigate using boost.log, partially or entirely replacing the manually implemented system. (To be honest I am rather partial to the simple but effective system already implemented, in particular the performance-focused trickery.) Boost is invaluable in many ways, and Flow modules use it extremely extensively, but my intuition says that in this case it may be irrelevant. Meanwhile, boost.log can, of course, be used within user's Logger interface implementation, but that's not what this to-do is about; it's about actually placing it all over Flow's implementation and APIs in place of Logger, log::Sev, etc. (Also, after a 5-minute glance at boost.log, I am noticing that some of its ideas I seem to have independently and unknowingly replicated here. It is, however, simple stuff, not rocket science.) My intuition is I hesitate to force-marry the Flow user to boost.log; the very simple Logger (abstract) interface is seemingly a gentler thing to force the Flow user into, at least relatively speaking.

Lacking feature: compiler hints for optimizing away log filter checks. This is inspired by a certain other proprietary logging API in which we noticed attempts to give hints to the compiler as to how likely or unlikely the verbosity check is to return a true or false value, seemingly so that the compiler might optimize out the check and hence the entire log statement in some conditions; or always log others and skip the check in the optimized code. I omit any details here, but that's the general idea. I don't know how effective this is given that verbosities are configurable dynamically potentially; but look into it.

Class flow::log::Async_file_logger
Lacking feature: Compress-as-you-log in Async_file_logger. So, optionally, when characters are actually written out to file-system, gzip/zip/whatever them instead of writing plain text. (This is possible at least for gzip.) Background: It is common-place to compress a log file after it has been rotated (e.g., around rotation time: F.log.1.gz -> F.log.2.gz, F.log -> F.log.1 -> F.log.1.gz). It is more space-efficient (at least), however, to write to F.log.gz directly already in compressed form; then rotation requires only renaming (e.g.: F.log.1.gz -> F.log.2.gz, F.log.gz [already gzipped from the start] -> F.log.1.gz).
Member flow::log::Async_file_logger::Async_file_logger (Logger *backup_logger_ptr, Config *config, const fs::path &log_path, bool capture_rotate_signals_internally)
Consider adding Async_file_logger constructor option to overwrite the file instead of appending.
Member flow::log::Async_file_logger::log_flush_and_reopen (bool async=true)
Async_file_logger::log_flush_and_reopen(true) is great for flushing, such as in an abort-signal handler, but providing just the flushing part without the reopening might be useful. At the moment we've left it this way, due to the vague feeling that closing the file upon flushing it is somehow more final and thus safer (in terms of accomplishing its goal) in an abort-signal scenario. Feelings aren't very scientific though.
Member flow::log::Async_file_logger::m_serial_logger
Async_file_logger::m_serial_logger (and likely a few other similar unique_ptr members of other classes) would be slightly nicer as an std::optional<> instead of unique_ptr. optional was not in STL back in the day and either did not exist or did not catch our attention back in the day. unique_ptr in situations like this is fine but uses the heap more.
Class flow::log::Config
Class Config doc header is wordy and might be hard to follow; rewrite for clarity/flow/size.
Member flow::log::Config::Config (Config &&)=delete
Reconsider providing a Config move constructor. I just didn't need to deal with it.
Member flow::log::Config::operator= (Config &&)=delete
Reconsider providing a Config move assignment. I just didn't need to deal with it.
Member flow::log::Config::operator= (const Config &)=delete
Reconsider providing a Config copy assignment. I just didn't need to deal with it.
Member flow::log::Logger::set_thread_info (std::string *call_thread_nickname, flow::util::Thread_id *call_thread_id)
It would be more consistent to rename set_thread_info() to this_thread_set_info(), since it operates in thread-local fashion. This was a naming convention oversight.
Member flow::log::Logger::set_thread_info_in_msg_metadata (Msg_metadata *msg_metadata)
It would be more consistent to rename set_thread_info_in_msg_metadata() to this_thread_set_info_in_msg_metadata(), since it operates in thread-local fashion. This was a naming convention oversight.
Member flow::log::Logger::this_thread_set_logged_nickname (util::String_view thread_nickname=util::String_view(), Logger *logger_ptr=0, bool also_set_os_name=true)
this_thread_set_logged_nickname() could take multiple Loggers, since it is an app-wide function and potentially would want to be reflected in multiple loggers. It could take a pair of iterators or a container (in both cases, of template argument type). Arguably, though, this is overkill; as (1) most code paths deal with only one get_logger()-owning object each; (2) naming of a thread is obvious enough in subsequent logs (hence this message only calls attention to that operation plus notes the thread ID before the nickname assignment, not necessarily the most critical of information). Certainly this zero-to-one-Logger version must continue to be available for syntactic-sugary convenience, even if the to-do is performed.
Class flow::log::Msg_metadata
Add support in Msg_metadata for a message ID which could more straightforwardly help the human log reader to map a log line to the originating log call site in source code. One approach, then, might be to output that message ID when available; else output m_msg_src_file, m_msg_src_line, m_msg_src_function; or maybe both.
Class flow::log::Serial_file_logger
Consider having Serial_file_logger internally buffer any attempted log requests that it couldn't write to file due to I/O error. The logic already attempts re-open repeatedly but doesn't attempt to re-log failed lines.
Class flow::net_flow::Ack_packet
Conceivably, since Ack_packet actually stores 1+ acknowledgments, it should become Acks_packet, of type ACKS (not ACK). Many comments and log messages would become clearer, as no one would assume an individual packet's acknowledgment when reading "an ACK" or similar phrase.
Member flow::net_flow::Ack_packet::ack_delay_t
Reconsider the encoding width. If Ack_delay_time_unit(1) is a nanosecond, then 32 bits would support a maximum delay of ~4.1 seconds which is likely fine for most real-world scenarios. This would reduce the size of ACK packets quite a bit.
Class flow::net_flow::asio::Node

The sync_*() early-destruction functionality explained in "Thread safety of destructor" above could be extended to async_*(). This is not a huge deal but would be nice for consistency. Not allowing it currently is really a result of internal implementation concerns (it is easier to not account for this corner case). Alternatively, maybe we should go the other way and not support the somewhat-weird corner case in sync_*().

To enable reactor-style async_*() operations, meaning waiting for readability/writability/etc. but not performing the actual operation before calling the user handler, we provide a null_buffers-style interface; like newer boost.asio versions we should deprecate this in favor of simpler async_wait() APIs. This would apply to net_flow::asio::Peer_socket and net_flow::asio::Server_socket APIs. Similarly consider doing this for the sync_*() operations in the non-asio superclasses net_flow::Peer_socket and net_flow::Server_socket. Note that Event_set::async_wait() and Event_set::sync_wait() already exist; they are powerful but a bit complex to use in these simple situations. Hence the following hypothetical wrappers would be welcome replacements for the deprecated null_buffers and "reactor-style" APIs in these classes: net_flow::Peer_socket::sync_wait(Event_set::Event_type), net_flow::asio::Peer_socket::async_wait(Event_set::Event_type), net_flow::Server_socket::sync_wait(), net_flow::Server_socket::async_wait(). (Timeout-arg versions would be desired also, as they exist now.)

Class flow::net_flow::Congestion_control_strategy
Tuck away all congestion control-related symbols into new namespace cong_ctl?
Class flow::net_flow::Drop_timer
Drop_timer has data members that are references to non-const. These should be pointers for max consistency of style with other code in the project (and that style is encouraged for certain good reasons). Scour rest of code for other such data members as well.
Member flow::net_flow::error::Code
Consider adding specific subclasses of flow::error::Runtime_error for the key codes, like WAIT_INTERRUPTED and USER_TIMEOUT. In doing so, make sure there's an automated syntactic-sugary way of doing this, so that one need only add a code<->class pair into some internal centralized mapping, automatically making system emit the proper exception class, defaulting to plain Runtime_error if not specially-mapped.
Member flow::net_flow::Event_set::emit_result_sockets (Sockets *target_set, Event_type ev_type, Error_code *err_code=0)
Event_set::emit_result_sockets() sets a Sockets structure which stores boost:anys each of which stores either a Peer_socket::Ptr or a Server_socket::Ptr; Sockets should be changed to store C++17 std::variants. Performance, both internally and externally, would improve by using this type-safe compile-time mechanism (which is akin to unions but much more pleasant to use). At the time this feature was written, Flow was in C++11, so variants were not available, and the author wanted to play around with anys instead of haxoring old-school unions. variant is much nicer, however, and the dynamic nature of any is entirely unnecessary here.
Member flow::net_flow::Event_set::Event_handler
Perhaps async_wait() and other APIs of Event_set taking handlers should be templated on handler type for syntactic sugar. Though, currently, this would be no faster (it is internally stored as this type anyway and must be so), nor would it actually improve call code which needs no explicit cast (an object of this type will implicitly be substituted as a conversion from whatever compatible-with-this-signature construction they used). So on balance, this currently appears superior. After all writing non-template bodies is easier/nicer.
Member flow::net_flow::Event_set::Mutex
This doc header for Event_set::Mutex should specify what specific behavior requires mutex reentrance, so that for example one could reevaluate whether there's a sleeker code pattern that would avoid it.
Member flow::net_flow::Event_set::Sockets
Is it necessary to have Event_set::Sockets be aliased to util::Linked_hash_set? unordered_set would also work and take somewhat less memory and computational overhead. It would become unordered, instead of ordered chronologically, but that seems like a price possibly worth paying.
Class flow::net_flow::Low_lvl_packet
With C++11, some lines of code could be eliminated by using using for Low_lvl_packet sub-types to inherit the constructor(s) of Low_lvl_packet. (For some of the sub-types an explicit constructor would still be necessary, but it would just shadow the inherited one, which is fine and still saves lines in the other sub-types.) However, for now I've left it as-is, partially to be friendly to the Doxygen documentation generator, and partially to make the interface easy to read. Still, it may be better the other way.
Member flow::net_flow::Low_lvl_packet::Const_buffer
Consider moving alias Low_lvl_packet::Const_buffer to util namespace or even outside it, as it is used commonly (or boost::asio::const_buffer is used where Const_buffer could be used for readability).
Class flow::net_flow::Net_env_simulator
One thing we should probably add for a bit of realism is the following. If a loss range [A, B] is specified, an individual packet's simulated latency will be a uniformly random number in that range. Because of this there will be frequent re-ordering of packets: if range is [50, 100] ms, and we physically get packet X at time 0 and packet Y at time 5 ms, and the simulated latencies are 80, 60 ms, respectively, then Y will "arrive" 80 - 60 - 5 = 15 ms BEFORE X despite being sent earlier (assuming perfect actual network conditions under the simulation). In reality, packets aren't re-ordered all that often. So if we wanted to simulate a high latency without re-ordering, we'd have to set the range to [A, A] for some A. To avoid having to do that, we can add some internal memory into Net_env_simulator that would put a floor on a randomly generated latency. In the previous example, the range for packet Y would be chosen as [75, 100] (instead of [50, 100]) to guarantee that packet Y arrives at least a bit later than packet X. Of course that might skew the latency range in bursty traffic, but that might be an OK behavior. We could add a knob for how often to ignore this new logic and allow a re-ordering, so that that is also simulated.
Class flow::net_flow::Node

Receive UDP datagrams as soon as possible (avoid internal buffer overflow): The OS UDP net-stack buffers arriving datagrams until they're recv()d by the application layer. This buffer is limited; on my Linux test machine the default appears to buffer ~80 1k datagrams. With a high sender CWND and high throughput (e.g., over loopback), thread W – which both reads off UDP datagrams and handles them, synchronously – cannot always keep up, and the buffer fills up. This introduces Flow loss despite the fact that the datagram actually safely got to the destination; and this is with just ONE sender; in a server situation there could be thousands. In Linux I was able to raise, via setsockopt(), the buffer size to somewhere between 1000 and 2000 1k datagrams. This helps quite a bit. However, we may still overflow the buffer in busy situations (I have seen it, still with only one connection). So, the to-do is to solve this problem. See below to-dos for ideas. WARNING: UDP buffer overflow can be hard to detect and may just look like loss; plus the user not reading off the Receive buffer quickly enough will also incur similar-looking loss. If there were a way to detect the total # of bytes or datagrams pending on a socket, that would be cool, but sock.available() (where sock is a UDP socket) just gives the size of the first queued datagram. Congestion control (if effective) should prevent this problem, if it is a steady-state situation (i.e., loss or queueing delay resulting from not keeping up with incoming datagrams should decrease CWND). However, if it happens in bursts due to high CPU use in the environment, then that may not help. NOTE 1: In practice a Node with many connections is running on a server and thus doesn't receive that much data but rather sends a lot. This mitigates the UDP-receive-buffer-overflow problem, as the Node receiving tons of data is more likely to be a client and thus have only one or two connections. Hence if we can handle a handful of really fast flows without such loss, we're good. (On other hand, ACKs are also traffic, and server may get a substantial amount of them. Much testing is needed.) Packet pacing on the sender side may also avoid this loss problem; on the other hand it may also decrease throughput. NOTE 2: Queue delay-based congestion control algorithms, such as FastTCP and Vegas, are highly sensitive to accurate RTT (round trip time) readings. Heavy CPU load can delay the recording of the "received" time stamp, because we call UDP recv() and handle the results all in one thread. Any solution, such as the dedicated thread proposed below, would allow one to record the time stamp immediately upon receipt of the packet by the dedicated thread; while W would independently handle everything else. Is that a good idea though? Maybe not. If the CPU load is such that ACK-receiver-side can't get to the time-stamp-saving, RTT-measuring step without tricks like doing it immediately upon some low-level datagram receipt hook, then that CPU-pegged jungle is, in a way, part of the network and should probably be fairly counted as part of the RTT. So perhaps we should continue to take the RTT time stamp while actually handling the individual acknowledgments. Instead we should strive to use multi-core resources efficiently, so that the gap between receipt (on whatever thread) and acknowledgment processing (on whatever thread) is as small as possible. Then RTT is RTT, but we make it smaller via improved performance. Meanwhile, we hopefully also solve the original problem (internal kernel buffer overflowing and dropping datagrams).

Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 3: Suppose we take APPROACH 1 (no-brainer) plus APPROACH 2. Certain decisions in the latter were made for certain stated reasons, but let's examine those more closely. First note that the APPROACH 1 part will ensure that, given a burst of incoming datagrams, the first UDP recv() will occur somewhere inside boost.asio, so that's definitely a success. Furthermore, strand S will invoke m_low_lvl_sock.async_send() as soon as the pacing module decides to do so; if I recall correctly, boost.asio will invoke the UDP send() right then and there, synchronously (at least I wrote that unequivocally in a Node::async_low_lvl_packet_send_impl() comment). Again, that's as good as we can possibly want. Finally, for messages 2, 3, ... in that incoming datagram burst, our handler will (indirectly but synchronously) perform the UDP recv()s in strand S2. Here we're somewhat at boost.asio's mercy, but assuming its strand task scheduling is as efficient as possible, it should occur on the thread that's free, and either W1 or W2 should be free given the rest of the design. Still, boost.asio docs even say that different strands' tasks are NOT guaranteed to be invoked concurrently (though common sense implies they will be when possible... but WHAT IF WE'RE WRONG!!!?). Also, we don't know how much computational overhead is involved in making strands work so nicely (again, hopefully it's well written... but WHAT IF!!!?). A negative is the small mutex section around the two m_low_lvl_sock calls; not complex and probably hardly a performance concern, but still, it's a small cost. Finally, using strands – while not ugly – does involve a bit more code, and one has to be careful not to forget to wrap each handler with the appropriate strand (there is no compile- or runtime error if we forget!) So what can we change about APPROACH 2 to avoid those negatives? As stated in that approach's description, we could have thread W not deal with m_low_lvl_sock at all; thread W2 would have a separate Task_engine handling only m_low_lvl_sock (so no mutex needed). W2 would do both sends and receives on the socket; and absolutely nothing else (to ensure it's as efficient as possible at getting datagrams off the kernel buffer, solving the original problem). Yes, the receiving would have to share time with the sends, but assuming nothing else interferes, this feels like not much of a cost (at least compared with all the heavy lifting thread W does today anyway). Each receive would read off all available messages into raw buffers and pass those (sans any copying) on to thread W via post(W). The other negative, also already mentioned, is that once pacing module (in thread W) decides that a datagram should be sent, the post(W2) for the task that would peform the send introduces a delay between the decision and the actual UDP send() done by boost.asio. Thinking about it now, it is questionable to me how much of a cost that really is. Without CPU contention, we can measure it; I expect it to be quite cheap, but I coudl be wrong. With CPU contention – which would have to come from many datagrams arriving at the same time – I'm not sure. It wouldn't be overly hard to test; basically flood with UDP traffic over loopback and log the delay between W deciding to send datagram and W2 calling m_low_lvl_sock.async_send_to() (obviously use Fine_clock!). All in all, if we name the dedicated thread approach described here as APPROACH 3, then APPROACH 3 is appealingly simpler than APPROACH 2; and in most ways appears like it'd be at least as efficient and good at solving the original problem as APPROACH 2. The only danger that worries me is this business with messing up pacing (and no, moving pacing into W2 just endangers the receiving efficiency and introduces thread safety problems and complexity) by having it compete with receiving during incoming-traffic-heavy times. Ultimately, I'd recommend timing this "danger zone" as described a bit earlier (also time delay introduced by post(W2) without any traffic coming in); and if it looks good, do APPROACH 3. Otherwise spring for APPROACH 2.

Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 2: To eliminate the problem to the maximum extent possible, we can dedicate its own thread – call it W2 – to reading m_low_lvl_sock. We could also write to m_low_lvl_sock on W2 (which would also allow us to use a different util::Task_engine from m_task_engine to exclusively deal with W2 and the UDP socket m_low_lvl_sock – simpler in some ways that the strand-based solution described below). There are a couple of problems with this. One, it may delay receive ops due to send ops, which compromises the goals of this project in the first place. Two, send pacing code is in thread W (and moving it to W2 would be complex and unhelpful); and once the decision to REALLY send a given datagram has been made, this send should occur right then and there – queuing that task on W2 may delay it, compromising the quality of send pacing (whose entire nature is about being precise down to the microsecond or even better). Therefore, we'd like to keep m_low_lvl_sock.async_send() in thread W along with all other work (which allows vast majority of internal state to be accessed without locking, basically just the Send/Receive buffers excluded); except the chain of [m_low_lvl_sock.async_receive() -> post handler of received datagram onto thread W; and immediately m_low_lvl_sock.async_receive() again] would be on W2. AS WRITTEN, this is actually hard or impossible to do with boost.asio because of its design: m_low_lvl_sock must belong to exactly one Task_engine (here, m_task_engine), whereas to directly schedule a specific task onto a specific thread (as above design requires) would require separate Task_engine objects (1 per thread): boost.asio guarantees a task will run on a thread which is currently executing run() – if 2 threads are executing run() on the same service, it is unknown which thread a given task will land upon, which makes the above design (AS WRITTEN) impossible. (Side note: I'm not sure it's possible in plain C with BSD sockets either. A naive design, at least, might have W select() on m_low_lvl_sock.native() for writability as well other stuff like timers, while W2 select()s on same for readability; then the two threads perform UDP send() and recv(), respectively, when so instructed by select()s. Is it allowed to use select() on the same socket concurrently like that? StackOverflow.com answers are not clear cut, and to me, at least, it seems somewhat dodgy.) However, an equivalent design IS possible (and supported cleanly by boost.asio): In addition to the 2 threads, set up 2 strands, S and S2. All work except the m_low_lvl_sock reads and posting of the handler onto S will be scheduled with a strand S. All work regarding m_low_lvl_sock reads and posting of its handler onto S will be scheduled with a strand S2. Recall that executing tasks in 1 strand, with 2+ threads executing run(), guarantees that no 2+ of those tasks will execute simultaneously – always in series. This is actually – in terms of efficiency and thread safety – equivalent to the above W/W2 design. Since the S tasks will always execute serially, no locking is necessary to prevent concurrent execution; thus what we know today as thread W tasks (which need no locking against each other) will be equally thread safe; and same holds for the new S2 tasks (which are considerably simpler and fewer in number). So that's the thread safety aspect; but why is efficiency guaranteed? The answer becomes clear if one pictures the original thread W/W2 design; basically little task blocks serially pepper thread W timeline; and same for W2. By doing it with strands S and S2 (running on top of threads W and W2), the only thing that changes is that the little blocks might at random be swapped between threads. So the series of tasks T1 -> T2 -> T3 -> T4 meant for for S might actually jump between W and W2 randomly; but whenever thread W2 is chosen instead of thread W, that leaves an empty "space" in thread W, which will be used by the S2 task queue if it needs to do work at the same time. So tasks may go on either thread, but the strands ensure that both are used with maximum efficiency while following the expected concurrency constraints (that each strand's tasks are to be executed in series). Note, however, that m_low_lvl_sock (the socket object) is not itself safe for concurrent access, so we WILL need a lock to protect the tiny/short calls m_low_lvl_sock.async_receive() and m_low_lvl_sock.async_send(): we specifically allow that a read and write may be scheduled to happen simultaneously, since the two are independent of each other and supposed to occur as soon as humanly possible (once the desire to perform either one is expressed by the code – either in the pacing module in strand S or the read handler in S2). In terms of nomenclature, if we do this, it'd be more fair to call the threads W1 and W2 (as they become completely equal in this design). (In general, any further extensions of this nature (if we want still more mostly-independent task queues to use available processor cores efficiently), we would add 1 strand and 1 worker thread per each such new queue. So basically there's a thread pool of N threads for N mostly-independent queues, and N strands are used to use that pool efficiently without needing to lock any data that are accessed exclusively by at most 1 queue's tasks only. Resources accessed by 2 or more task queues concurrently would need explicit locking (e.g., m_low_lvl_sock in this design).) So then where we start thread W today, we'd start the thread pool of 2 threads W1, W2, with each executing m_task_engine.run(). Before the run()s execute, the initial tasks (each wrapped in strand S or S2, as appropriate) need to be posted onto m_task_engine; this can even occur in the user thread U in the constructor, before W1 and W2 are created. The destructor would m_task_engine.stop() (like today), ending each thread's run() and trigger the imminent end of that thread; at which point destructor can W1.join() and W2.join() (today it's just W.join()).

Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 1 (CO-WINNER!): One approach is to note that, as of this writing, we call m_low_lvl_sock.async_receive(null_buffers); the null_buffers value for the buffers arg means that the handler is called without any actual UDP receive is performed by boost.asio; our handler is called once there is at least 1 message TO read; and then indeed our handler does read it (and any more messages that may also have arrived). Well, if we pass in an actual buffer instead, then boost.asio will read 1 (and no more, even if there are more) message into that buffer and have it ready in the handler. Assuming the mainstream case involves only 1 message being ready, and/or assuming that reading at least 1 message each time ASAP would help significantly, this may be a good step toward relieving the problem, when it exists. The code becomes a tiny bit less elegant, but that's negligible. This seems like a no-brainer that should be included in any solution, but it by itself may not be sufficient, since more than 1 datagram may be waiting, and datagrams 2, 3, ... would still have to be read off by our code in the handler. So other approaches should still be considered.

More uniform diagnostic logging: There is much diagnostic logging in the implementation (FLOW_ERROR*(), etc.), but some of it lacks useful info like sock or serv (i.e., the ostream representations of Peer_socket and Server_socket objects, which include the UDP/Flow endpoints involved in each socket). The messages that do include these have to do so explicitly. Provide some macros to automatically insert this info, then convert the code to use the macros in most places. Note that existing logging is rather exhaustive, so this is not the biggest of deals but would be nice for ease of coding (and detailed logging).

Based on some outside experience, there maybe be problems – particularly considering the to-do regarding dual-stack IPv6/v4 support – in servers listening in multiple-IP situations; make sure we support these seamlessly. For example consider a machine with a WAN IP address and a LAN (10.x.x.x) IP address (and possibly IPv6 versions of each also) that (as is typical) binds on all of them at ANY:P (where P is some port; and ANY is the IPv6 version, with dual-stack mode ensuring V4 datagrams are also received). If a client connects to a LAN IP, while in our return datagrams we set the source IP to the default, does it work? Outside experience shows it might not, depending, plus even if in our protocol it does, it might be defeated by some firewall... the point is it requires investigation (e.g., mimic TCP itself; or look into what IETF or Google QUIC does; and so on).

flow::net_flow should use flow::cfg for its socket-options mechanism. It is well suited for that purpose, and it uses some ideas from those socket-options in the first place but is generic and much more advanced. Currently net_flow socket-options are custom-coded from long before flow::cfg existed.

ostream output operators for Node and asio::Node should exist. Also scour through all types; possibly some others could use the same. (I have been pretty good at already implementing these as-needed for logging; but I may have "missed a spot.")

Some of the ostream<< operators we have take X* instead of const X&; this should be changed to the latter for various minor reasons and for consistency.

Actively support IPv6 and IPv4, particularly in dual-stack mode (wherein net_flow::Server_socket would bind to an IPv6 endpoint but accept incoming V4 and V6 connections alike). It already supports it nominally, in that one can indeed listen on either type of address and connect to either as well, but how well it works is untested, and from some outside experience it might involve some subtle provisions internally.

Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 4: It now occurs to me how to solve the main questionable part about APPROACH 3. If we fear that the reads and writes in thread W2 may compete for CPU, especially the reads delaying timing-sensitive paced writes, then we can eliminate the problem by taking W2's own util::Task_engine (which would be separate from m_task_engine) and have two equal threads W2' and W2'' start up and then each call Task_engine::run(). Without having to use any strand, this will essentially (as documented in boost.asio's doc overview) establish a thread pool of 2 threads and then perform the receive and send tasks at random on whichever thread(s) is/are available at a particular time. Since there's no strand(s) to worry about, the underlying overhead in boost.asio is probably small, so there's nothing to fear about efficiency. In fact, in this case we've now created 3 separate threads W, W2', W2'', all doing things independently of each other, which is an excellent feature in terms of using multiple cores. Do note we will now need a mutex and very short critical sections around the calls to m_low_lvl_sock::async_receive() and m_low_lvl_sock::async_send(), but as noted before this seems extremely unlikely to have any real cost due to the shortess of critical sections in both threads. If this is APPROACH 4, then I'd say just time how much absolute delay is introduced by a ‘post(W2’)orpost(W2'')` of the async send call compared to directly making such a call on thread W, as is done today. I suspect it's small, in which case the action is go for APPROACH 4... finally.

Receive UDP datagrams as soon as possible (avoid internal buffer overflow): APPROACH 5 (CO-WINNER!): Actually, the thing I've been dismissing in approaches 2-4, which was to combine the pacing logic with the actual m_low_lvl_sock.async_send() (like today) but in their own dedicated thread, now seems like the best way to solve the one questionable aspect of APPROACH 4. So, APPROACH 4, but: Move the pacing stuff into the task queue associated with threads W2' and W2''. So then these 2 threads/cores will be available for 2 task queues: one for pacing timers + datagram sending over m_low_lvl_sock (with mutex); the other for receiving over m_low_lvl_sock (with same mutex). Now, the only "delay" is moved to before pacing occurs: whatever minimal time cost exists of adding a queue from thread W to thread W2' or W2'' occurs just before the pacing logic, after which chances are the datagram will be placed on a pacing queue anyway and sent off somewhat later; intuitively this is better than the delay occurring between pacing logic and the actual UDP send. Note, also, that the timing-sensitive pacing logic now gets its own thread/core and should thus work better vs. today in situations when thread W is doing a lot of work. This is even more logical than APPROACH 4 in that sense; the pacing and sending are concern 1 and get their own thread (essentially; really they get either W2' or W2'' for each given task); the receiving is concern 2 and gets its own thread (same deal); and all the rest is concern 3 and remains in its own thread W (until we can think of ways to split that into concerns; but that is another to-do). Only one mutex with 2 very small critical sections, as in APPROACH 4, is used. The only subtlety regarding concurrent data access is in Node::mark_data_packet_sent(), which is called just before m_low_lvl_sock.async_send(), and which finalizes certain aspects of Peer_socket::Sent_packet::Sent_when; most notably Peer_socket::Sent_packet::Sent_when::m_sent_time (used in RTT calculation upon ACK receipt later). This is stored in Peer_socket::m_snd_flying_pkts_by_sent_when, which today is not protected by mutex due to only being accessed from thread W; and which is extremely frequently accessed. So either we protect the latter with a mutex (out of the question: it is too frequently accessed and would quite possibly reduce performance) or something else. Currently I think Node::mark_data_packet_sent() should just be placed onto m_task_engine (thread W) via post() but perhaps take all or most of the items to update Sent_when with as arguments, so that they (especially Sent_when::m_sent_time) could be determined in thread W2' or W2'' but written thread-safely in W. (There is no way some other thread W task would mess with this area of Peer_socket::m_snd_flying_pkts_by_sent_when before the proposed mark_data_packet_sent() was able to run; thread W had just decided to send that packet over wire in the first place; so there's no reason to access it until ACK – much later – or some kind of socket-wide catastrophe.) All that put together I dub APPROACH 5. Thus, APPROACH 1 + APPROACH 5 seems like the best idea of all, distilling all the trade-offs into the the fastest yet close to simplest approach.

It may be desirable to use not boost.asio's out-of-the-box UDP receive routines but rather extensions capable of some advanced features, such as recvmsg() – which can obtain kernel receipt time stamps and destination IP address via the cmsg feature. This would tie into various other to-dos listed around here. There is, as of this writing, also a to-do in the top-level flow namespace doc header about bringing some code into a new io namespace/Flow module; this includes the aforementioned recvmsg() wrapper.

It may be desirable to further use recvmmsg() for UDP input; this allows to read multiple UDP datagrams with one call for performance.

By the same token, wrapping sendmsg() and sendmmsg() may allow for futher perf and feature improvements – in some ways potentially symmetrically to recvmsg() and recvmmsg() respectively. However, as of this writing, I (ygoldfel) see this more of an opportunistic "look into it" thing and not something of active value; whereas recv[m]msg() bring actual features we actively desire for practical reasons.

Send and Receive buffer max sizes: These are set to some constants right now. That's not optimal. There are two competing factors: throughput and RAM. If buffer is too small, throughput can suffer in practice, if the Receiver can't read the data fast enough (there are web pages that show this). Possibly with today's CPUs it's no longer true, but I don't know. If buffer is too large and with a lot of users, a LOT of RAM can be eaten up (however note that a server will likely be mostly sending, not receiving, therefore it may need smaller Receive buffers). Therefore, as in Linux 2.6.17+, the buffer sizes should be adaptively sized. It may be non-trivial to come up with a good heuristic, but we may be able to copy Linux. The basic idea would probably be to use some total RAM budget and divide it up among the # of sockets (itself a constant estimate, or adaptive based on the number of sockets at a given time?). Also, buffer size should be determined on the Receive side; the Send side should make its buffer to be of equal size. Until we implement some sensible buffer sizing, it might be a good idea (for demo purposes with few users) to keep the buffers quite large. However, flow control (receive window) is now implemented and should cope well with momentary Receive buffer exhaustion. Related facts found on the web: In Linux, since a long time ago, Send buffer size is determined by other side's Receive buffer size (probably sent over in the SYN or SYN-ACK as the receive window). Also, in older Linuxes, Receive buffer defaults to 128k but can be manually set. Supposedly the default can lead to low throughput in high-speed (gigabit+) situations. Thus Linux 2.6.17+ apparently made the Receive buffer size adaptive.

Drop Acknowledgments: DCCP, a somewhat similar UDP-based protocol, uses the concept of Data-Dropped acknowledgments. If a packet gets to the receiver, but the receiver is forced to drop it (for example, no Receive buffer space; not sure if there are other reasons in Flow protocol), then the sender will only find out about this by inferring the drop via Drop Timeout or getting acknowledgments for later data. That may take a while, and since receiver-side drops can be quite large, it would be more constructive for the receive to send an un-ACK of sorts: a Data-Dropped packet informing the sender that specific data were dropped. The sender can then update his congestion state (and retransmit if we enable that). See RFC 4340 and 4341.

Add extra-thread-safe convention for setting options: We can provide a thread-safe (against other user threads doing the same thing) macro to set a given option. set_options() has a documented, albeit in practice not usually truly problematic, thread safety flaw if one calls options(), modifies the result, then calls set_options(). Since another thread may modify the Node's options between the two calls, the latter call may unintentionally revert an option's value. Macro would take an option "name" (identifier for the Node_options member), a Node, and a target value for the option and would work together with a helper method template to obtain the necessary lock, make the assignment to the internal option, and give up the lock. The implementation would probably require Node to expose its internal stored Node_options for the benefit of this macro only. Anyway, this feature is not super-important, as long as the user is aware that modifying options from multiple threads simultaneously may result in incorrect settings being applied.

The preceding to-do regarding Node_options applies to Peer_socket_options stored in Peer_socket in in an analogous way.

Consider removing Flow ports and even Server_socket: As explained above, we add the concept of a large set of available Flow ports within each Node, and each Node itself has a UDP port all to itself. So, for example, I could bind a Node to UDP port 1010, and within that listen() on Flow ports 1010 (unrelated to UDP port 1010!) and 1011. In retrospect, though, is that complexity necessary? We could save quite a few lines of code, particularly in the implementation (class Port_space, for example) and the protocol (extra bytes for Flow source and target ports, for example). (They're fun and pretty lines, but the absence of lines is arguably even prettier albeit less fun. On the other hand, bugs aren't fun, and more code implies a higher probability of bugs, maintenance errors, etc.) The interface would also be a bit simpler; and not just due to fewer items in Remote_endpoint (which would in fact reduce to util::Udp_endpoint and cease to exist). Consider Server_socket; currently listen() takes a flow_port_t argument and returns a Server_socket which is listening; calling accept() (etc.) on the latter yields Peer_socket, as the other side connects. Without Flow ports, there is no argument to listen(); in fact, Server_socket itself is not strictly necessary and could be folded into Node, with listen() becoming essentially something that turns on the "are we listening?" Boolean state, while stop_listening() would turn it off (instead of something like Server_socket::close()). (Important note: That was not an endorsement of removing Server_socket. Arguably it is still a nice abstraction. Removing it would certainly remove some boiler-plate machinery to do with Server_socket's life cycle, on the other hand. Perhaps it's best to take a two-step appraoch; remove Flow ports first; then after a long time, assuming it becomes clear that nothing like them is going to come back, remove Server_socket as well.) A key question is, of course what would we lose? At first glance, Flow port allows multiple connections on a single UDP-port-taking Flow server, including multiple connections from one client (e.g., with differing connection parameters such as reliability levels among the different connections, or "channels")... but actually that'd still work without Flow ports, assuming the "one client's" multiple connections can bind to different (presumably ephemeral) UDP ports; since the tuple (source host, source UDP port) is still enough to distinguish from the 2+ "channels" of the same "client" connecting to the one Flow Node (vs. today's tuple: source host, source UDP port, source Flow port, destination Flow port; see struct Socket_id). However, without Flow ports, it is not possible for one Node to connect to another Node twice, as each Node by definition is on one port. Is this important? Maybe, maybe not; for NAT purposes it can be important to use only 1 port; but that typically applies only to the server, while the client can send packets from anywhere. However, gaming applications can be very demanding and for the most restrictive NAT types might desire only a single port used on both sides. So whether to remove Flow ports is somewhat questionable, now that they exist; but arguably they didn't need to be added in the first place, until they were truly needed. I'd probably leave them alone, since they do exist.

Multi-core/multi-threading: The existing implementation already has a nice multi-threaded property, namely that each Node (object that binds to a single UDP endpoint/port) is entirely independent of any other such object – they have entirely separate data, and each one does all its work on a separate thread. So to make use of multiple cores/processors, one can set up multiple Node objects. (Obviously this only makes sense for apps where using multiple ports is acceptable or even desired. E.g., a server could listen on N different UDP ports, where N=# of cores.) However, it would be nice for a single Node to be as multi-core/processor-friendly as possible. This is partially addressed by the "Dedicated thread to receive UDP datagrams ASAP" to-do item elsewhere. We could go further. boost.asio lets one easily go from 1 thread to multiple threads by simply starting more worker threads like W (W1, W2, ...) and executing m_task_engine::run() in each one – note that m_task_engine is shared (sans lock). Then subsequent handlers (timer-fired handlers, ack-received handlers, data-received handlers, and many more) would be assigned evenly to available threads currently executing run(). However, then all data these handlers access would need to be protected by a mutex or mutexes, which would be a significant increase in complexity and maintenance headaches compared to existing code, which features mutexes for data accessed both by W and the user thread(s) U – which excludes most Node, Server_socket, Peer_socket state – essentially the user-visible "state" enums, and the Receive and Send buffers; but hugely complex things like the scoreboards, etc. etc., needed no mutex protection, but with this change they would need it. Actually, if the implementation essentially uses one mutex M, and every handler locks it for the entirety of its execution, then one isn't really making use of multi-cores/etc. anyway. One could make use of boost.asio "strands" to avoid the need for the mutex – just wrap every handler in a shared strand S, and no locking is needed; boost.asio will never execute two handlers simultaneously in different threads. While this would arguably make the code simpler, but in terms of performance it wouldn't make any difference anyway, as it is functionally equivalent to the lock-M-around-every-operation solution (in fact, internally, it might even amount to exactly that anyway). So that's probably not worth it. We need to have more mutexes or strands, based on some other criterion/criteria. After a datagram is demuxed, vast majority of work is done on a particular socket independently of all others. Therefore we could add a mutex (or use an equivalent) into the socket object and then lock on that mutex. Multiple threads could then concurrently handle multiple sockets. However, for this to be used, one would have to use a single Node (UDP endpoint) with multiple sockets at the same time. Without any changes at all, one can get the same concurrency by instead setting up multiple Node objects. Other than a bit of lost syntactic sugar (arguably) – multiple Node objects needed, each one having to initialize with its own set of options, for example – this is particularly cost-free on the client side, as each Node can just use its own ephemeral UDP port. On the server side the network architecture has to allow for multiple non-ephemeral ports, and the client must know to (perhaps randomly) connect to one of N UDP ports/endpoints on the server, which is more restrictive than on the client. So perhaps there are some reasons to add the per-socket concurrency – but I would not put a high priority on it. IMPORTANT UPDATE: Long after the preceding text was written, flow::async Flow module was created containing flow::async::Concurrent_task_loop interface. That looks at the general problem of multi-tasking thread pools and what's the user-friendliest-yet-most-powerful way of doing it. While the preceding discussion in this to-do has been left unchanged, one must first familiarize self with flow::async; and then read the above, both because some of those older ideas might need reevaluation; and because some of those ideas may have been implemented by flow::async and are now available easily.

In Node::low_lvl_packet_sent(), the UDP async_send() handler, there is an inline to-do about specially treating the corner case of the would_block and try_again boost.asio errors being reported (a/k/a POSIX EAGAIN/EWOULDBLOCK). Please see that inline comment for details.

Class Node private section is very large. It's so large that the implementations of the methods therein are split up among different files such as flow_peer_socket.cpp, flow_event_set.cpp, flow_low_lvl_io.cpp, etc. Consider the following scheme to both better express this separation as well as enforce which of a given method group's method(s) are meant to be called by outside code vs. being helpers thereof: Introduce static-method-only inner classes (and, conceivably, even classes within those classes) to enforce this grouping (public methods and private methods enforcing what is a "public" helper vs. a helper's helper).

We are now on Boost 1.75; the use of asio's null_buffers semantics is deprecated and should be changed to the replacement mechanism suggested in Boost docs – the async_wait() method.

Make use of flow::async::Concurrent_task_loop or flow::async::Single_thread_task_loop, instead of manually setting up a thread and util::Task_engine, for m_worker. I, Yuri, wrote the constructor, worker_run(), destructor, and related setup/teardown code as my very first boost.asio activity ever. It's solid, but flow::async just makes it easier and more elegant to read/maintain; plus this would increase Flow-wide consistency. It would almost certainly reduce the number of methods and, even nicely, state (such as m_event_loop_ready).

Member flow::net_flow::Node::close_connection_immediately (const Socket_id &socket_id, Peer_socket::Ptr sock, const Error_code &err_code, bool defer_delta_check)
Graceful close not yet implemented w/r/t close_connection_immediately().
Member flow::net_flow::Node::send_worker (Peer_socket::Ptr sock, bool defer_delta_check)
Are there other states where sending DATA packets is OK? If so it would be during graceful termination, if we implement it. See send_worker() for contedt for this to-do.
Member flow::net_flow::Node::socket_id (Peer_socket::Const_ptr sock)
Could make it a Socket_id constructor instead.
Member flow::net_flow::Node::validate_options (const Node_options &opts, bool init, Error_code *err_code) const
Is it necessary to return opts now that we've switched to C++11 or better?
Member flow::net_flow::Node::~Node () override

Provide another method to shut down everything gracefully?

Server_socket objects closed as if by what?

Member flow::net_flow::operator<< (std::ostream &os, Peer_socket::Int_state state)
There are a few guys like this which are marked @internal (Doxygen command) to hide from generated public documentation, and that works, but really they should not be visible in the publicly-exported (not in detail/) header source code; so this should be reorganized for cleanliness. The prototypes like this one can be moved to a detail/ header or maybe directly into .cpp that uses them (for those where it's only one).
Member flow::net_flow::operator>> (std::istream &is, Peer_socket_options::Congestion_control_strategy_choice &strategy_choice)
Peer_socket_options::Congestion_control_strategy_choice stream inserter << and >> operators should use the flow::util::istream_to_enum() pattern which is much easier than the overwrought old thing in there now involving two maps. Perhaps add a generic enum_to_strings() to provide the body for net_flow::Congestion_control_selector::get_ids().
Class flow::net_flow::Peer_socket

Closing connection considerations. May implement closing only via timeouts at first (as opposed to explicit closing). Below text refers to close_final() and close_start(), but those are just ideas and may be replaced with timeout, or nothing. At this time, the only closing supported is abrupt close due to error or abrupt close via close_abruptly().

Rename State and Open_sub_state to Phase and Open_sub_phase respectively; and Int_state to State. Explain difference between phases (application-layer, user-visible and used close to application layer) and states (transport layer, internal).

Look into a way to defeat the need for boiler-plate trickery – with low but non-zero perf cost – involving *_socket-vs-Node circular references in method templates, such as the way Peer_socket::send() and Peer_socket::receive() internally make Function<>s before forwarding to the core in Node. Can this be done with .inl files? Look into how Boost internally uses .inl files; this could inspire a solution... or not.

Member flow::net_flow::Peer_socket::close_abruptly (Error_code *err_code=0)

Currently this close_abruptly() is the only way for the user to explicitly close one specified socket. All other ways are due to error (or other side starting graceful shutdown, once we implement that). Once we implement graceful close, via close_start() and close_final(), use of close_abruptly() should be discouraged, or it may even be deprecated (e.g., Nodes lack a way to initiate an abrupt close for a specific socket).

close_abruptly() return bool (false on failure)?

Member flow::net_flow::Peer_socket::info () const
Provide a similar info() method that loads an existing structure (for structure reuse).
Member flow::net_flow::Peer_socket::Int_state
Peer_socket::Int_state will also include various states on way to a graceful close, once we implement that.
Member flow::net_flow::Peer_socket::m_node
boost::weak_ptr<Node> would be ideal for this, but of course then Node would have to (only?) be available via shared_ptr<>.
Member flow::net_flow::Peer_socket::m_rcv_packets_with_gaps
The memory use of this structure could be greatly optimized if, instead of storing each individual received packet's metadata separately, we always merged contiguous sequence number ranges. So for example if packet P1, P2, P3 (contiguous) all arrived in sequence, after missing packet P0, then we'd store P1's first sequence number and the total data size of P1+P2+P3, in a single struct instance. Since a typical pattern might include 1 lost packet followed by 100 received packets, we'd be able to cut down memory use by a factor of about 100 in that case (and thus worry much less about the limit). Of course the code would get more complex and potentially slower (but not necessarily significantly).
Member flow::net_flow::Peer_socket::m_snd_next_seq_num
Possibly m_snd_next_seq_num will apply to other packet types than DATA, probably anything to do with connection termination.
Member flow::net_flow::Peer_socket::Mutex
This doc header for Peer_socket::Mutex should specify what specific behavior requires mutex reentrance, so that for example one could reevaluate whether there's a sleeker code pattern that would avoid it.
Member flow::net_flow::Peer_socket::options () const
Provide a similar options() method that loads an existing structure (for structure reuse).
Member flow::net_flow::Peer_socket::Sent_packet::Sent_when::m_order_num
Can we make Sent_when::m_order_num and some of its peers const?
Member flow::net_flow::Peer_socket::Sent_packet::Sent_when::m_sent_cwnd_bytes

Can we make m_sent_cwnd_bytes and some of its peers const?

Why is m_sent_cwnd_bytes in Sent_when struct and not directly in Sent_packet? Or maybe it should stay here, but Sent_when should be renamed Send_attempt (or Send_try for brevity)? Yeah, I think that's it. Then Sent_packet::m_sent_when – a container of Sent_when objects – would become m_send_tries, a container of Send_try objects. That makes more sense (sentce?!) than the status quo which involves the singular-means-plural strangeness.

Member flow::net_flow::Peer_socket_info::m_low_lvl_max_buf_size
This is really a Node-wide value; we should probably have a Node_info class for such things.
Member flow::net_flow::Peer_socket_info::m_snd_cong_ctl_in_flight_bytes
Does m_snd_cong_ctl_in_flight_bytes count data queued in the pacing module or truly In-flight data only?
Member flow::net_flow::Peer_socket_options::m_st_cong_ctl_max_cong_wnd_blocks
Reconsider this value after Receive window feature is implemented.
Member flow::net_flow::Peer_socket_options::m_st_max_block_size
max-block-size should be dynamically determined (but overridable by this option). That is a complex topic, however, with considerations such as MTU discovery, ICMP errors, and who knows what else.
Member flow::net_flow::Peer_socket_send_stats::m_init_time
Peer_socket_receive_stats also independently stores a similar value, so to save memory put m_init_time elsewhere.
Member flow::net_flow::Peer_socket_send_stats_accumulator::low_lvl_packet_xfer_requested (const std::type_info &type)
It may make sense to serialize a packet as early as possible, so that the size is known at the stage of Peer_socket_send_stats_accumulator::low_lvl_packet_xfer_requested() call. Possibly knowing the serialized size this early can help the pacing module make smarter decisions.
Class flow::net_flow::Port_space
While the ephemeral port reuse avoidance works well within a single Port_space lifetime, it does not carry over across different Port_spaces. I.e., if Port_space is destructed and then constructed, the container of recently used ports starts out empty, so a port may (with some low but sizable probability) be reused. This could be avoided by persisting (e.g., via boost.serialization) those structures to disk (perhaps at a user-supplied boost::filesystem::path) in ~Port_space(). This wouldn't work if Port_space crashed, however. To solve that, we could easily start a thread in Port_space() (and join it in ~Port_space()) that would save that state every N seconds. If that's not enough, we can instead have that thread run a boost.asio io_service::run() event loop and io_service::post() the save operation onto it (from arbitrary threads) each time a new ephemeral port is reserved (with some kind of protection against incessant disk writing built into this worker thread; e.g., don't save if saved in the last N msec).
Class flow::net_flow::Remote_endpoint

There is a sub-case of the immediately preceding to-do that may be performed first without much controversy. That is the case where there is some member method of some type, that is then called by an extremely thin wrapper free function (not even a friend). In fact, hash_value() and Remote_endpoint::hash() are such a pair. Assuming one does not forget Doxygen's relatesalso command, it would be easy and concise to just get rid of the member and simply move its implementation directly into the free function. After all, people are meant to use the free function anyway, so why the middle-man method? In this example, hash_value() would directly compute the hash, and Remote_endpoint::hash() would not exist; but the former would include a Doxygen relatesalso Remote_endpoint command to ensure properly linked generated documentation.

The respected Scott Meyers ("Effective C++") would recommend that Remote_endpoint::hash() and Remote_endpoint::operator==() be written as free functions instead of as struct members (see http://www.drdobbs.com/cpp/how-non-member-functions-improve-encapsu/184401197). The same would apply to a potentially great number of other non-virtual methods (including operators) of other structs and classes that could be implemented without friend, so really pursuing this approach could touch quite a few things. I won't repeat Meyers' reasoning; see the link. I find the reasoning mostly compelling. To summarize his final suggestions: An operation on a type should be a free function if ALL of the following holds: it is not virtual by nature; AND: it is << or >>, and/or its would-be left operand would require type conversions, and/or it can be implemented entirely via the type's publicly exposed interface. We already follow the advice for << and >> (if only because we apply it to stream ops, and for that it just makes sense, << especially, since the stream is the left-most operand there, so it has to be a free function in that case – and thus also for >> for consistency). The type conversion is not a common thing for this library; so that leaves non-virtual operations (which is most of them) that can be implemented via public APIs only (which is probaby common for structs like Remote_endpoint, though we generally try not to have lots of struct methods in the first place... usually). Before rushing headlong into this project, consider a few more things, though. 1, Doxygen wouldn't pick up the relation between a non-friend free function dealing with struct or class C and C itself; so verbosity/error-proneness would increase by having to add a Doxygen relatesalso special command to each free function; otherwise documentation becomes less readable (and there's no way to enforce this by having Doxygen fail without this being done, somehow). 2, in the C-returning-static-member-of-C pattern, usually in our code it would call a private C constructor, meaning it would require friend to make it a free function, meaning it breaks Meyers' own rule and thus should be left a member. 3, Meyers tends to place very little value on brevity as its own virtue. If you find that following the above rule in some case seems to be significantly increasing code complexity/size, maybe it's best to leave it alone. (I am thinking of Low_lvl_packet and its sub-types like Ack_packet: they are structs but not typical ones, with complex APIs; making those APIs free function when non-virtual sounds like a rather hairy change that "feels" like it would reduce simplicity and may increase size, at least due to all the necessary 'splaining in the comments.) All that said, this is perhaps worth pursuing (in the pursuit of stylistic perfection) – but surgically and first-do-no-harm-edly. Oh, also, constructors don't count for this and should continue to remain constructors and not some free functions stuff (should be obvious why, but just in case you get any ideas...).

Class flow::net_flow::Send_bandwidth_estimator
Look into Google BBR, a recent (as of 2016) state-of-the-art congestion control algorithm that aims to estimate bandwidth among other things. Depending on what what one decides, another bandwidth estimator class could be written.
Class flow::net_flow::Send_pacing_data

One way to provide finer pacing, in the situation where we must send more than 1 packet within the minimal schedulable timer period, is to rely on events other than timer clock ticks that occur more frequently. Namely, ACKs will possibly arrive faster that the minimal timer schedulable ticks do. Therefore, when a time slice begins, we can send no packets (instead of all the packets that must be sent in that time slice, as is the case currently). Then over the course of the time slice we can opportunistically – upon each ACK send the proper number of queued packets up to that point into the time slice (this can be computed accurately, because we can MEASURE time very accurately – just can't schedule things as accurately). Finally, once the time slice does expire, send off any packets still remaining for that slice. This way we can opportunistically provide finer pacing, where the ACK clock allows it.

Obtain the first RTT measurement based on SYN-SYN_ACK or SYN_ACK-SYN_ACK_ACK interval; then SRTT will be available from the start, and packet pacing can be enabled with the very first DATA packet, avoiding the initial (albeit small) burst of DATA packets. This is pretty easy.

As noted above we perform packet pacing but currently choose to assign a value of 0 bytes to an ACK. That is, while we do preserve the order of DATA and ACK packets – if both happen to be in the outgoing stream – we do not delay the sending of the ACK once it is the next packet to be sent out. However, even so, an ACK's sending may be delayed by the pacing applied to DATA packets intermixed with it. Therefore the ACK delay measurement we take here may be incorrect (too low) in that case. This can cause overestimated RTTs on the sender's side. The to-do is to correct the ACK delay value in a given ACK by adding the pacing delay (if any) of the ACK to the individual ACK delays within it. Conceptually this is similar to the m_sent_when value being set when choosing to send a DATA packet and then corrected in the pacing module later. This to-do is not important until we in practice start mixing sending and receiving at the application layer... but still – it's worth knowing that there is a design bug here.

My quick look into Google BBR, the recent advanced congestion control algorithm, suggests that the pacing algorithm can be used as part of congestion control rather than something merely applied after congestion control has made the CWND decision. I believe the idea would be this: If we can somehow very reliably determine the available bandwidth, which Google BBR does purport to do, then instead of controlling send rate via CWND, instead one could apply it to this pacing module. It is a slight modification of the algorithm described above: instead of the rate of sending being equal to CWND/SRTT, make it equal to the bandwidth determined through Google BBR. If I understand correctly, Google BBR does maintain a CWND as well, but this is a safety upper bound, in case the computed bandwidth is too high to be practical (in this case CWND will slow down data being delivered to the pacing module – but, normally, this would not be the case, and the pacing module would be, in practice, the thing controlling the sending rate). Note this is a pretty big conceptual change; in particular, it would make the queue Q grow much larger than what we see currently.

Member flow::net_flow::Sequence_number::seq_num_delta_t
Allowing overflow by using this definition of seq_num_delta_t is somewhat unlike the rest of the implementation of Sequence_number which avoids overflow like the plague. Should we do something like use double or a big-integer type from boost.multiprecision?
Class flow::net_flow::Server_socket

Implement Server_socket::close() functionality – at least the equivalent of Peer_socket::close_abruptly().

Limit Server_socket listen queue/set length.

Rename State to Phase, as with similar to-do in class Peer_socket doc header.

Member flow::net_flow::Server_socket::accept (Error_code *err_code=0)
Reconsider allowing successful accept() in State::S_CLOSING state?
Member flow::net_flow::Server_socket::m_local_port
Make m_local_port const?
Member flow::net_flow::Server_socket::Mutex
This doc header for Server_socket::Mutex should specify what specific behavior requires mutex reentrance, so that for example one could reevaluate whether there's a sleeker code pattern that would avoid it.
Member flow::net_flow::Socket_buffer::copy_bytes_from_buf_seq (typename Const_buffer_sequence::const_iterator *cur_buf_it, size_t *pos_in_buf, size_t to_copy, util::Blob *dest_buf, util::Blob::Iterator dest)
It would not be crazy to move copy_bytes_from_buf_seq() and copy_bytes_to_buf_seq() into util::Blob to make it more widely available, though as of this writing there is no demand for this: perhaps a buffer-sequence version of util::Blob::emplace_copy() and in the other direction. In that case util::Blob::assign_copy() and all single-buffer util::Blob methods should probably be similarly generalized (overloaded).
Member flow::net_flow::Socket_buffer::Queue
Investigate pros and cons of deque vs. list empirically.
Member flow::net_flow::Syn_ack_packet::m_rcv_wnd
We could be similar to TCP by opportunistically sending rcv_wnd in other packet types, namely Data_packet. However this would only help in connections with heavy 2-way traffic. Personally I would then prefer to create a new packet type instead, Rcv_wnd_packet, and also implement some generalized "packet combo" scheme which would allow to piggy-back arbitrary packet types together into a single packet; and then we'd dissociate ACK/SYN_ACK/SYN_ACK_ACK from rcv_wnd.
Member flow::net_flow::Syn_packet::m_serialized_metadata
Possibly eliminate this, since NetFlow is reliable; the metadata can just be sent explicitly by the user once the connection is established. However, for now reliability is optional.
Class flow::perf::Checkpointing_timer
const std::string& to String_view in all of flow::perf where applicable. Look into string&& also.
Member flow::perf::Clock_type
Consider adding a system-calendar-clock (a/k/a POSIX time) type to perf::Clock_type. It would be a cousin of Clock_type::S_REAL_HI_RES. It would certainly be inferior in terms of resolution/monotonicity/etc., and one would think S_REAL_HI_RES would always be preferable. Nevertheless it would be interesting to "officially" see its characteristics including in particular (1) resolution and (2) its own perf cost especially vs. S_REAL_HI_RES which we know is quite fast itself. This may also help a certain to-do listed as of this writing in the doc header of flow::log FLOW_LOG_WITHOUT_CHECKING() (the main worker bee of the log system, the one that generates each log time stamp).
Member flow::perf::operator*= (Duration_set &target, uint64_t mult_scale)
Maybe allow operator*=(Duration_set) by a potentially negative number; same for division.
Member flow::perf::timed_function (Clock_type clock_type, Accumulator *accumulator, Func &&function)

timed_function() overload exists for a single Clock_type, but simultaneous multi-clock timing using the perf::Clock_types_subset paradigm (as used, e.g., in Checkpointing_timer) would be a useful and consistent API. E.g., one could measure user and system elapsed time simultaneously. As of this writing this only does not exist due to time constraints: a perf-niggardly version targeting one clock type was necessary.

timed_function(), when operating on an atomic<duration_rep_t>, uses += for accumulation which may be lock-free but uses strict ordering; a version that uses fetch_add() with relaxed ordering may be desirable for extra performance at the cost of not-always-up-to-date accumulation results in all threads. As of this writing this can be done by the user by providing a custom type that defines += as explicitly using fetch_add() with relaxed ordering; but we could provide an API for this.

Namespace flow::util
Since Flow gained its first users beyond the original author, some Flow-adjacent code has been written from which Flow can benefit, including additions to flow::util module. (Flow itself continued to be developed, but some features were added elsewhere for expediency; this is a reminder to factor them out into Flow for the benefit of all.) Some features to migrate here might be: conversion between boost.chrono and std.chrono types; (add more here).
Class flow::util::Basic_blob< Allocator, S_SHARING_ALLOWED >
Write a class template, perhaps Tight_blob<Allocator, bool>, which would be identical to Basic_blob but forego the framing features, namely size() and start(), thus storing only the RAII array pointer data() and capacity(); rewrite Basic_blob in terms of this Tight_blob. This simple container type has had some demand in practice, and Basic_blob can and should be cleanly built on top of it (perhaps even as an IS-A subclass).
Namespace flow::util::bind_ns
Investigate deeply the cause of the strange behavior with unqualified bind() and others described above this line in the source code.
Member flow::util::Linked_hash_map< Key, Mapped, Hash, Pred >::m_value_list
It is probably possible to cut down on the memory cost of storing, for each element, a copy of the Key in m_value_list (in addition to the mandatory one in the lookup table m_keys_into_list_map). Perhaps the key copy would be replaced by an iterator back into m_value_list. A custom iterator class would be necessary to properly dereference this (this is non-trivial given that operator*() would have to return a reference to a pair which is no longer stored anywhere in this hypothetical design). Moreover, iterators exposed to the user would become invalid the same way an unordered_map<> iterator does due to seemingly unrelated changes. Finally, the memory savings would not even exist for Key types roughly the size of a pointer. All in all, not a slam-dunk....
Member flow::util::Lock_guard_non_recursive
Lock_guard_non_recursive is deprecated, now that C++1x made the more flexible Lock_guard<Mutex_non_recursive> possible. Remove it and all (outside) uses eventually.
Member flow::util::Lock_guard_noop_shared_non_recursive_ex
Lock_guard_noop_shared_non_recursive_ex is deprecated, now that C++1x made the more flexible Lock_guard<Mutex_noop_shared_non_recursive> possible. Remove it and all (outside) uses eventually.
Member flow::util::Lock_guard_noop_shared_non_recursive_sh
Lock_guard_noop_shared_non_recursive_sh is deprecated, now that C++1x made the more flexible Shared_lock_guard<Mutex_noop_shared_non_recursive> possible. Remove it and all (outside) uses eventually.
Member flow::util::Lock_guard_recursive
Lock_guard_recursive is deprecated, now that C++1x made the more flexible Lock_guard<Mutex_recursive> possible. Remove it and all (outside) uses eventually.
Member flow::util::Lock_guard_shared_non_recursive_ex
Lock_guard_shared_non_recursive_ex is deprecated, now that C++1x made the more flexible Lock_guard<Mutex_shared_non_recursive> possible. Remove it and all (outside) uses eventually.
Member flow::util::Lock_guard_shared_non_recursive_sh
Lock_guard_shared_non_recursive_sh is deprecated, now that C++1x made the more flexible Shared_lock_guard<Mutex_shared_non_recursive> possible. Remove it and all (outside) uses eventually.
Member flow::util::Mutex_shared_non_recursive
Consider changing util::Mutex_shared_non_recursive from boost::shared_mutex to std::shared_mutex, as the former has a long-standing unresolved ticket about its impl being slow and/or outdated (as of Boost-1.80). However see also the note on util::Mutex_shared_non_recursive that explains why it might be best to avoid this type of mutex altogether in the first place (in most cases).
Class flow::util::Rnd_gen_uniform_range< range_t >
Rnd_gen_uniform_range and Rnd_gen_uniform_range_mt, on reflection, are very similar to boost::random::random_number_generator, and should be written to conform to it, so as to be usable in std::random_shuffle(), and also to gain its appealing extra features without losing any of the already-available simplicity. Note, however, that std::random_shuffle() is deprecated (and gone in C++17) in favor of std::shuffle(), which is available from C++11 on, and which works directly with a formal URNG such as Rnd_gen_uniform_range_base::Random_generator. So this to-do is less about that and more about gaining those other features while being suitable for random_shuffle(), which some code does still use out there.
Member flow::util::schedule_task_from_now (log::Logger *logger_ptr, const Fine_duration &from_now, bool single_threaded, Task_engine *task_engine, Scheduled_task_handler &&task_body_moved)

We could eliminate schedule_task_from_now() potential limitation versus Timer wherein each call constructs (internally) a new Timer. A pool of Timers can be internally maintained to implement this. This may or may not be worth the complexity, but if the API can remain identically simple while cleanly eliminating the one perf-related reason to choose Timer over this simpler facility, then that is a clean win from the API user's point of view. By comparison, other possible improvements mentioned complicate the API which makes them less attractive.

schedule_task_from_now() and surrounding API provides an easy way to schedule a thing into the future, but it is built on top of boost.asio util::Timer directly; an intermediate wrapper class around this would be quite useful in its own right so that all boost.asio features including its perf-friendliness would be retained along with eliminating its annoyances (around canceling-but-not-really and similar). Then scheduled_task_from_now() would be internally even simpler, while a non-annoying util::Timer would become available for more advanced use cases. echan may have such a class (in a different project) ready to adapt (called Serial_task_timer). I believe it internally uses integer "task ID" to distinguish between scheduled tasks issued in some chronological order, so that boost.asio firing a task after it has been canceled/pre-fired can be easily detected.

Member flow::util::setup_auto_cleanup (const Cleanup_func &func)
setup_auto_cleanup() should take a function via move semantics.
Class flow::util::Shared_ptr_alias_holder< Target_ptr, Const_target_ptr >

flow::util::Shared_ptr_alias_holder Const_target_ptr is deprecated and shall be always left at its default value in future code; eventually remove it entirely and hard-code the default value internally.

Add example usage snippets in Shared_ptr_alias_holder doc header, illustrating the pattern.

Class flow::util::String_ostream
Consider using alt_sstream in boost.format; it doesn't seem to have public documentation but isn't in a detail directory either. So that's interesting. It might have better performance than the implementation here (by being more "direct" probably). Then again it might not.
Member flow::util::Timer

Test macOS timer fidelity.

Upgrade to newer Boost and keep testing timer resolutions on the above and other OS versions. Update: As of this writing we are on Boost 1.74 now. Needless to say, it is time to reassess the above, as it has been years since 1.50 – and we had been on 1.63 and 1.66 for long period of time, let alone 1.74. Update: No changes with Boost 1.75, the newest version as of this writing.

Member flow::util::to_mbit_per_sec (N_items items_per_time, size_t bits_per_item=8)
boost.unit "feels" like it would do this for us in some amazingly pithy and just-as-fast way. Because Boost.
Member FLOW_LOG_DO_LOG (ARG_logger_ptr, ARG_component, ARG_sev, ARG_file_view, ARG_line, ARG_func_view, ARG_time_stamp, ARG_call_thread_nickname_str_moved, ARG_call_thread_id, ARG_stream_fragment)

An alternative copy-free implementation of the asynchronous FLOW_LOG_DO_LOG() work-flow is possible. The basic idea of using a thread-local non-reallocating work string is just fine in the non-logs_asynchronously() (i.e., synchronous) Logger flow. In the asynchronous flow, however, it involves an added message copy. Instead – as done in certain major server software author is familiar with – one could (perhaps in the async flow only) allocate a new string in each FLOW_LOG_DO_LOG() (in rare cases reallocating, even repeatedly, if more space is needed); then pass that pointer around, until it is asynchronously written out by Logger impl; then deallocate it. Thus, a copy is eliminated in the async workflow. A complicating factor is that the current system maintains a format state across separate log call sites in a given thread; this change would (if naively implemented at least) eliminate that feature – but this could be considered acceptable. (Still, do realize that, for example, in today's feature set one can set the chrono I/O formatting to show short unit names like 2ms instead of 2 milliseconds; and one need do it only once; but with this change one would need to do it in every log call site. That would be, I can attest, rather annoying. Additionally, make sure the behavior is consistent in the sync and async work-flows.) A related wrinkle is that if we add special support for printf()-style log call sites (a to-do in flow::log doc header as of this writing), then in that case since there is no expectation of such format statefulness in the first place, in that flow that particular concern isn't a concern. (Sub-to-do: If one did this, an extra-optimized further idea is to avoid the repeated allocs/deallocs by maintaining a pool of already-allocated buffers to be reused opportunistically.) Bottom line: I claim the existing thing is pretty good; the extra copy is IMO unlikely to affect real performance, because (1) it's only one copy in the context of quite a bit of similar copying and other ops going on while writing out the string; and (2) if the message is so verbose as to affect performance at all, then it will probably affect it regardless of the extra copy (in other words, its verbosity must be increased, or the filter verbosity must be decreased – avoiding this exta internal copy feels in my [ygoldfel] personal opinion like rearranging deck chairs on the Titanic). So, this to-do should probably be to-done at some point, but it doesn't feel urgent. And since there are quite a few subtleties involved, as shown above, it's natural to resist doing it until truly necessary.

Time stamp source: The current implementation uses the system clock to generate time stamps (a/k/a POSIX time), but consider optionally or mandatorily using the high-resolution clock instead (or additionally?). This has pros and cons; all or most time stamps elsewhere use the system clock also, so this allows for easy cross-referencing against other systems and logs. There's also the question of how to express an absolute time, as usually the high-resolution clock starts at system startup – not as humanly useful as a "calendar" (UTC) time, which – while useful humanly – is not steady/monotonic/etc. There is no reasonable conversion between Fine_clock::now() and a calendar time (boost.chrono docs say this unequivocally which is a confirmation). The pros include: (1) higher precision and resolution; (2) that time always flows forward and at a uniform rate without possibility of time going back or forward due to human/otherwise clock sets or rare events like daylight savings and leap seconds; or (3) to summarize, something more suited as a quick-and-dirty way to measure how long things take in the program, like an always-on, log-integrated version of perf::Checkpointing_timer. As of this writing all out-of-the-box log::Logger implementations and log::Config allow the output of human-readable as well as sec.usec-from-Epoch time stamps. One approach might be to replace the latter only with the high-rez clock's time stamps, perhaps optionally, while leaving the human-readable one alone. Note: There is an important test to perform here, which is the time cost of obtaining either time stamp. E.g., the high-rez time stamp might be much slower – or maybe the opposite! To test this, (1) add the POSIX-time clock into the perf::Clock_type enum, with all associated (fairly straightforward) changes in flow::perf; and (2) test the perf characteristics of this new clock. Certain code exists outside of Flow itself that already automatically tests all Clock_types, so it would quickly give the answer. (Secondary to-do: Be less vague about where this program resides, in this comment. I, ygoldfel, have the answer at any rate and am only omitting it here for boring process reasons.)

Time stamp subtlety: It might not be crazy to log not just the time stamp of entry to this macro but also some indication how long it took to evaluate the rest and finally output it to the ultimate device/whatever. It could be an optional differential or something, like "timestamp+diff," with diff in microseconds or something.

Member FLOW_LOG_WARNING (ARG_stream_fragment)
We can avoid using macros for this and similar APIs by requiring the user to use commas instead of the usual << and by implementing this as a variadic function template (C++11 feature).
Member FLOW_LOG_WITH_CHECKING (ARG_sev, ARG_stream_fragment)
In FLOW_LOG_WITH_CHECKING(), save get_logger() and get_log_component() return values in such a way as to be reused by the FLOW_LOG_WITHOUT_CHECKING() invoked by the former macro if should_log() == true. As it stands, they are called again inside the latter macro. In the most-common case, wherein Log_context is used for those two expressions, this should get inline-optimized to be maximally fast anyway. With FLOW_LOG_SET_*(), though, it might be a bit slower than that. Technically, one can make their own get_logger and get_log_component identifiers that might do something slower still – though it's unlikely (and as of this writing unprecedented). So I would not call this pressing, but on the other hand... just do it! The implementation code will be somewhat hairier though.
Member FLOW_UTIL_WHERE_AM_I_STR ()
See if FLOW_UTIL_WHERE_AM_I_STR() can be coaxed into a compile-time expression after all. It is used in quite a lot of frequently executed code, namely at the top of most Flow (and Flow-like) APIs that can emit errors. See notes inside flow::util::get_where_am_i_str() on this as of this writing.