internals.rst revision eea99896
1############
2Library Guts
3############
4
5.. highlight:: c
6
7Introduction
8************
9
10
11lsquic inception dates back to the fall of 2016. Since that time, lsquic
12underwent several major changes. Some of those had to do with making the
13library more performant; others were needed to add important new
14functionality (for example, IETF QUIC and HTTP/3). Throughout this time,
15one of the main principles we embraced is that **performance trumps
16everything else**, including code readability and maintainability. This
17focus drove code design decisions again and again and it explains some
18of the hairiness that we will come across in this document.
19
20Code Version
21============
22
23The code version under discussion is v2.29.6.
24
25High-Level Structure
26********************
27
28At a high level, the lsquic library can be used to instantiate an engine
29(or several engines). An engine manages connections; each connection has
30streams. Engine, connection, and stream objects are exposed to the user
31who interacts with them using the API (see :doc:`apiref`). All other data
32structures are internal and are hanging off, in one way or another, from
33the engine, connection, or stream objects.
34
35Coding Style
36************
37
38Spacing and Cuddling
39====================
40
41lsquic follows the LiteSpeed spacing and cuddling conventions:
42
43-  Two empty lines between function definitions
44
45-  Four-space indentation
46
47-  Ifs and elses are not cuddled
48
49Function Name Alignment
50=======================
51
52In function definitions, the name is always left-aligned, for example:
53
54::
55
56    static void
57    check_flush_threshold (lsquic_stream_t *stream)
58
59
60Naming Conventions
61==================
62
63-  Struct members usually have prefixes derived from the struct name.
64   For example members of ``struct qpack_dec_hdl`` begin with ``qdh_``,
65   members of ``struct cid_update_batch`` begin with ``cub_``, and so on.
66   This is done to reduce the need to memorize struct member names as
67   vim's autocomplete (Ctrl-P) functionality makes it easy to fill in
68   the needed identifier.
69
70-  Non-static functions all begin with ``lsquic_``.
71
72-  Functions usually begin with a module name (e.g. ``lsquic_engine_`` or
73   ``stream_``) which is then followed by a verb (e.g.
74   ``lsquic_engine_connect`` or ``stream_activate_hq_frame``). If a
75   function does not begin with a module name, it begins with a verb
76   (e.g. ``check_flush_threshold`` or ``maybe_remove_from_write_q``).
77
78-  Underscores are used to separate words (as opposed to, for example,
79   theCamelCase).
80
81Typedefs
82========
83
84Outside of user-facing API, structs, unions, and enums are not
85typedefed. On the other hand, some integral types are typedefed.
86
87List of Common Terms
88********************
89
90-  **gQUIC** Google QUIC. The original lsquic supported only Google
91   QUIC. gQUIC is going to become obsolete. (Hopefully soon).
92
93-  **HQ** This stands for "HTTP-over-QUIC", the original name of HTTP/3.
94   The code predates the official renaming to HTTP/3 and thus there
95   are many types and names with some variation of ``HQ`` in them.
96
97-  **iQUIC** This stands for IETF QUIC. To differentiate between gQUIC
98   and IETF QUIC, we use ``iquic`` in some names and types.
99
100Engine
101******
102
103*Files: lsquic_engine.c, lsquic_engine_public.h, lsquic.h*
104
105Data Structures
106===============
107
108out_batch
109---------
110
111::
112
113    /* The batch of outgoing packets grows and shrinks dynamically */
114    /* Batch sizes do not have to be powers of two */
115    #define MAX_OUT_BATCH_SIZE 1024
116    #define MIN_OUT_BATCH_SIZE 4
117    #define INITIAL_OUT_BATCH_SIZE 32
118
119    struct out_batch
120    {
121        lsquic_conn_t           *conns  [MAX_OUT_BATCH_SIZE];
122        struct lsquic_out_spec   outs   [MAX_OUT_BATCH_SIZE];
123        unsigned                 pack_off[MAX_OUT_BATCH_SIZE];
124        lsquic_packet_out_t     *packets[MAX_OUT_BATCH_SIZE * 2];
125        struct iovec             iov    [MAX_OUT_BATCH_SIZE * 2];
126    };
127
128The array of struct lsquic_out_specs -- outs above -- is what gets
129passed to the user callback ``ea_packets_out()``. ``conns`` array corresponds
130to the spec elements one to one.
131
132``pack_off`` records which packet in ``packets`` corresponds to which
133connection in ``conns``. Because of coalescing, an element in ``outs`` can
134correspond (logically) to more than one packet. (See how the batch is
135constructed in `Batching packets`_.) On the
136other hand, ``packets`` and ``iov`` arrays have one-to-one correspondence.
137
138There is one instance of this structure per engine: the whole thing is
139allocated as part of `struct lsquic_engine <#lsquic-engine>`__.
140
141
142cid_update_batch
143----------------
144
145::
146
147    struct cid_update_batch
148    {
149        lsquic_cids_update_f    cub_update_cids;
150        void                   *cub_update_ctx;
151        unsigned                cub_count;
152        lsquic_cid_t            cub_cids[20];
153        void                   *cub_peer_ctxs[20];
154    };
155
156This struct is used to batch CID updates.
157
158There are three user-defined CID liveness callbacks: ``ea_new_scids``,
159``ea_live_scids``, and ``ea_old_scids``. These functions all have the same
160signature, ``lsquic_cids_update_f``. When the batch reaches the count of
16120 (kept in ``cub_count``), the callback is called.
162
163The new SCIDs batch is kept in `struct
164lsquic_engine <#lsquic-engine>`__. Other batches are allocated on the
165stack in different functions as necessary.
166
16720 is an arbitrary number.
168
169lsquic_engine_public
170--------------------
171
172This struct, defined in lsquic_engine_public.h, is the "public"
173interface to the engine. ("Public" here means accessible by other
174modules inside lsquic, not that it's a public interface like the
175:doc:`apiref`.) Because there are many things in the engine object that
176are accessed by other modules, this struct is used to expose those
177(``public``) parts of the engine.
178
179``lsquic_engine_struct`` is the first member of
180`lsquic_engine <#lsquic-engine>`__. The functions declared in
181lsquic_engine_public.h take a pointer to lsquic_engine_public as the
182first argument, which is then case to lsquic_engine.
183
184This is somewhat ugly, but it's not too bad, as long as one remembers
185that the two pointers are interchangeable.
186
187lsquic_engine
188-------------
189
190This is the central data structure. The engine instance is the root of
191all other data structures. It contains:
192
193-  Pointers to connections in several lists and hashes (see `Connection Management <#connection-management>`__)
194
195-  Memory manager
196
197-  Engine settings
198
199-  Token generator
200
201-  CID Purgatory
202
203-  Server certificate cache
204
205-  Transport parameter cache
206
207-  Packet request queue
208
209-  `Outgoing packet batch <#out-batch>`__
210
211-  And several other things
212
213Some of the members above are stored in the ``pub`` member of type
214`lsquic_engine_public <#lsquic-engine-public>`__. These are accessed
215directly from other parts of lsquic.
216
217The engine is instantiated via ``lsquic_engine_new()`` and destroyed via
218``lsquic_engine_destroy()``
219
220Connection Management
221=====================
222
223Lifetime
224--------
225
226There are several `connection types`_. All types of
227connections begin their life inside the engine module, where their
228constructors are called. They all also end their life here as well: this
229is where the destructors are called.
230
231The connection constructors are all different function calls:
232
233-  lsquic_ietf_full_conn_client_new
234
235-  lsquic_gquic_full_conn_client_new
236
237-  lsquic_ietf_full_conn_server_new
238
239-  lsquic_gquic_full_conn_server_new
240
241-  lsquic_mini_conn_ietf_new
242
243-  lsquic_mini_conn_new
244
245-  lsquic_prq_new_req
246
247-  lsquic_prq_new_req_ext
248
249(See `Evanescent Connection`_ for information about the last two.)
250
251After a connection is instantiated, all further interactions with it,
252including destruction, are done via the `Common Connection Interface`_.
253
254Refcounting Model
255-----------------
256
257Each connection is referenced by at least one of the following data
258structures:
259
2601. CID-to-connection hash. This hash is used to find connections in
261   order to dispatch an incoming packet. Connections can be hashed by
262   CIDs or by address. In the former case, each connection has one or
263   more mappings in the hash table. IETF QUIC connections have up to
264   eight (in our implementation) source CIDs and each of those would
265   have a mapping. In client mode, depending on QUIC versions and
266   options selected, it is may be necessary to hash connections by
267   address, in which case incoming packets are delivered to
268   connections based on the address.
269
2702. Outgoing queue. This queue holds connections that have packets to
271   send.
272
2733. `Tickable Queue`_. This queue holds connections
274   that `can be ticked now <#tickability>`__.
275
2764. `Advisory Tick Time Queue`_.
277
2785. Closing connections queue. This is a transient queue -- it only
279   exists for the duration of
280   `process_connections() <#processing-connections>`__ function call.
281
2826. Ticked connections queue. Another transient queue, similar to the
283   above.
284
285The idea is to destroy the connection when it is no longer referenced.
286For example, a connection tick may return TICK_SEND|TICK_CLOSE. In that
287case, the connection is referenced from two places: (2) and (5). After
288its packets are sent, it is only referenced in (5), and at the end of
289the function call, when it is removed from (5), reference count goes to
290zero and the connection is destroyed. (See function ``destroy_conn``.) If
291not all packets can be sent, at the end of the function call, the
292connection is referenced by (2) and will only be removed once all
293outgoing packets have been sent.
294
295.. image:: lsquic-engine-conns.png
296
297In the diagram above, you can see that the CID-to-connection hash has
298several links to the same connection. This is because an IETF QUIC
299connection has more than one Source Connection IDs (SCIDs), any of which
300can be included by the peer into the packet. See ``insert_conn_into_hash``
301for more details.
302
303References from each of these data structures are tracked inside the
304connection object by bit flags:
305
306::
307
308    #define CONN_REF_FLAGS  (LSCONN_HASHED          \
309                            |LSCONN_HAS_OUTGOING    \
310                            |LSCONN_TICKABLE        \
311                            |LSCONN_TICKED          \
312                            |LSCONN_CLOSING         \
313                            |LSCONN_ATTQ)
314
315Functions ``engine_incref_conn`` and ``engine_decref_conn`` manage setting
316and unsetting of these flags.
317
318
319Notable Code
320============
321
322Handling incoming packets
323-------------------------
324
325Incoming UDP datagrams are handed off to the lsquic library using the
326function ``lsquic_engine_packet_in``. Depending on the engine mode --
327client or server -- the appropriate `packet
328parsing <#parsing-packets>`__ function is selected.
329
330Because a UDP datagram can contain more than one QUIC packet, the
331parsing is done in a loop. If the first part of packet parsing is
332successful, the internal ``process_packet_in`` function is called.
333
334There, most complexity is contained in ``find_or_create_conn``, which gets
335called for the server side. Here, parsing of the packet is finished, now
336via the version-specific call to ``pf_parse_packet_in_finish``. If
337connection is not found, it may need to be created. Before that, the
338following steps are performed:
339
340-  Check that engine is not in the cooldown mode
341
342-  Check that the maximum number of mini connections is not exceeded
343
344-  Check that the (D)CID specified in the packet is not in the `CID Purgatory`_
345
346-  Check that the packet can be used to create a mini conn: it contains
347   version information and the version is supported
348
349-  Depending on QUIC version, perform token verification, if necessary
350
351Only then does the mini connection constructor is called and the
352connection is inserted into appropriate structures.
353
354Processing connections
355----------------------
356
357Connections are processed in the internal function
358``process_connections``. There is the main connection processing loop and
359logic.
360
361All connections that the iterator passed to this function returns are
362processed in the first while loop. The ``ci_tick()`` call is what causes
363the underlying connection to do all it needs to (most importantly,
364dispatch user events and generate outgoing packets). The return value
365dictates what lists -- global and local to the function -- the
366connection will be placed upon.
367
368Note that mini connection promotion happens inside this loop. Newly
369created full connections are processed inside the same while loop. For a
370short time, a mini and a full connection object exist that are
371associated with the same logical connection.
372
373After all connections are ticked, outgoing packets, if there are any,
374`are sent out <#batching-packets>`__.
375
376Then, connections that were closed by the first while loop above are
377finally closed.
378
379Connections that were ticked (and not closed) are either:
380
381-  Put back onto the ``tickable`` queue;
382
383-  Added to the `Advisory Tick Time Queue`_; or
384
385-  Left unqueued. This can happen when both idle and ping timer are
386   turned off. (This should not happen for the connections that we
387   expect to process, though.)
388
389And lastly, CID liveness updates are reported to the user via the
390optional SCIDs callbacks: ``ea_new_scids`` etc.
391
392Tickable Queue Cycle
393--------------------
394
395When a connection is ticked, it is removed from the `Tickable
396Queue <#tickable-queue>`__ and placed onto the transient Ticked Queue.
397After outgoing packets are sent and some connections are closed, the
398Ticked Queue is examined: the engine queries each remaining connection
399again whether it's tickable. If it is, back onto the Tickable Queue it
400goes. This should not happen often, however. It may occur when RTT is
401low and there are many connections to process. In that case, once all
402connections have been processed, the pacer now allows to send another
403packet because some time has passed.
404
405Batching packets
406----------------
407
408Packet-sending entry point is the function ``send_packets_out``. The main
409idea here is as follows:
410
411Iterate over connections that have packets to send (those are on the
412Outgoing queue in the engine). For each connection, ask it for the next
413outgoing packet, encrypt it, and place it into the batch. When the batch
414is full, `send the batch <#sending-a-batch>`__.
415
416The outgoing packets from all connections are interleaved. For example,
417if connections A, B, and C are on the Outgoing queue, the batch will
418contain packets A1, B1, C1, A2, B2, C2, A3, B3, C3, ��� and so on. This is
419done to ensure fairness. When a connection runs out of packets to send,
420it returns NULL and is removed from the iterator.
421
422The idea is simple, but the devil is in the details. The code may be
423difficult to read. There are several things going on:
424
425Conns Out Iterator
426^^^^^^^^^^^^^^^^^^
427
428This iterator, ``conns_out_iter``, sends packets from connections on the
429Outgoing queue and packets on the Packet Request queue. (The latter
430masquerade as `Evanescent Connections <#evanescent-connection>`__ so that they
431are simple to use.) First, the Outgoing queue (which is a min-heap) is
432drained. Then, packets from the Packet Request queue are sent, if there
433are any. Then, remaining connections from the first pass are returned in
434the round-robin fashion.
435
436After sending is completed, the connections that still have outgoing
437packets to send are placed back onto the Outgoing queue.
438
439
440Packet Coalescing
441^^^^^^^^^^^^^^^^^
442
443Some IETF QUIC packets can be coalesced. This reduces the number of UDP
444datagrams that need to be sent during the handshake. To support this, if
445a packet matches some parameters, the same connection is queried for
446another packet, which, if it returns, is added to the current batch
447slot's iov.
448
449::
450
451    if ((conn->cn_flags & LSCONN_IETF)
452        && ((1 << packet_out->po_header_type)
453          & ((1 << HETY_INITIAL)|(1 << HETY_HANDSHAKE)|(1 << HETY_0RTT)))
454        && (engine->flags & ENG_COALESCE)
455        && iov < batch->iov + sizeof(batch->iov) / sizeof(batch->iov[0]))
456    {
457        const struct to_coal to_coal = {
458            .prev_packet = packet_out,
459            .prev_sz_sum = iov_size(packet_iov, iov),
460        };
461        packet_out = conn->cn_if->ci_next_packet_to_send(conn, &to_coal);
462        if (packet_out)
463            goto next_coa;
464    }
465    batch->outs   [n].iovlen = iov - packet_iov;
466
467*With some debug code removed for simplicity*
468
469Also see the description of the batch in `out_batch`_.
470
471Note that packet coalescing is only done during the handshake of an IETF
472QUIC connection. Non-handshake and gQUIC packets cannot be coalesced.
473
474Sending and Refilling the Batch
475^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
476
477When the batch is sent inside the while loop, and the whole batch was
478sent successfully, the batch pointers are reset, the batch potentially
479grows larger, and the while loop continues.
480
481Batch Resizing
482^^^^^^^^^^^^^^
483
484When all datagrams in the batch are sent successfully, the batch may
485grow -- up to the hardcoded maximum value of ``MAX_OUT_BATCH_SIZE``. When
486not all datagrams are sent, the batch shrinks. The batch size survives
487the call into the library: when packets are sent again, the same batch
488size is used to begin the sending.
489
490Deadline Checking
491^^^^^^^^^^^^^^^^^
492
493This is a rather old safety check dating back to the summer of 2017,
494when we first shipped QUIC support. The way we send packets has changed
495since then -- there is high possibility that this code can be removed
496with no ill effect.
497
498Sending a batch
499---------------
500
501When the batch is filled, it is handed off to the function ``send_batch``,
502which calls the user-supplied callback to send packets out. The
503high-level logic is as follows:
504
505-  Update each packet's ``sent`` time
506
507-  Call the "send packets out" callback
508
509-  For packets that were sent successfully, call ``ci_packet_sent``
510
511-  For packets that were not sent, call ``ci_packet_not_sent``. This is
512   important: all packets returned by ``ci_next_packet_to_send`` must
513   be returned to the connection via either these two calls above or
514   via ``ci_packet_too_large`` (see below).
515
516-  Return the number of packets sent
517
518Because of support for coalescing, we have to map from outgoing spec to
519packets via ``batch->pack_off``. This is done in several places in this
520function.
521
522To handle the case when a PMTU probe is too large (stuff happens!), the
523code checks for EMSGSIZE and returns the packet back to the connection
524via ``ci_packet_too_large``. Because this error is of our own making, this
525does not count as inability to send. The too-large packet is skipped and
526sending of the datagrams in the batch continues.
527
528Growing min-heaps
529-----------------
530
531The Outgoing and Tickable connection queues are actually min-heaps. The
532number of elements in these min-heaps never exceeds the number of
533connections. As optimization, allocation of the underlying arrays is
534done not in the min-heap module itself but in the engine module in the
535function ``maybe_grow_conn_heaps``. The engine knows how many connections
536there are and it grows the arrays as necessary.
537
538As an additional optimization, the two arrays use a single memory region
539which is allocated once.
540
541The min-heap arrays are never shrunk.
542
543Connection
544**********
545
546*Files: lsquic_conn.h, lsquic_conn.c -- others are covered in dedicated
547chapters*
548
549The connection represents the QUIC connection. Connections are `managed
550by the engine <#connection-management>`__. A connection, in turn,
551manages `streams <#stream>`__.
552
553Connection Types
554================
555
556lsquic supports two different QUIC protocols: Google QUIC and IETF QUIC.
557Each of these has a separate implementation, which includes connection
558logic, parsing/generating mechanism, and encryption.
559
560Each of the QUIC connection types on the server begin their life as a
561``mini`` connection. This connection type is used while handshake is
562proceeding. Once the handshake has completed, the mini connection is
563``promoted`` to a ``full`` connection. (See `Mini vs Full
564Connection <#mini-vs-full-connections>`__ for more.)
565
566In addition to the above, an "evanescent" connection type is used to
567manage replies to incoming packets that do not result in connection
568creation. These include version negotiation, stateless retry, and
569stateless reset packets.
570
571Each of the five connection types above are covered in their own
572dedicated chapters elsewhere in this document:
573
574-  `Mini gQUIC Connection <#mini-gquic-connection>`__
575
576-  `Full gQUIC Connection <#connection-public-interface>`__
577
578-  `Mini IETF QUIC Connection <#mini-ietf-connection>`__
579
580-  `Full IETF QUIC Connection <#mini-ietf-connection>`__
581
582-  `Evanescent Connection <#evanescent-connection>`__
583
584lsquic_conn
585===========
586
587All connection types expose the same connection interface via a pointer
588to ``struct lsquic_conn``. (This is the same type pointer to which is
589exposed to the user, but the user can only treat the connection as an
590opaque pointer.)
591
592This structure contains the following elements:
593
594Pointers to Crypto Implementation
595---------------------------------
596
597The crypto session pointer, ``cn_enc_session``, points to a type-specific
598(gQUIC or iQUIC) instance of the encryption session. This session
599survives `connection promotion <#connection-promotion>`__.
600
601The two types of crypto session have a set of common functionality; it
602is pointed to by ``cn_esf_c`` (where ``c`` stands for ``common``). Each of
603them also has its own, type-specific functionality, which is pointed to
604by ``cn_esf.g`` and ``cn_esf.i``
605
606Pointer to Common Connection Interface
607--------------------------------------
608
609``cn_if`` points to the set of functions that implement the Common
610Connection Interface (`see below <#common-connection-interface>`__).
611
612Pointer to Parsing Interface
613----------------------------
614
615The parsing interface is version-specific. It is pointed to by ``cn_pf``.
616
617Various list and heap connectors
618--------------------------------
619
620A connection may be pointed to by one or several queues and heaps (see
621"\ `Connection Management <#connection-management>`__\ "). There are
622several struct members that make it possible: all the \*TAILQ_ENTRYs,
623``cn_attq_elem``, and ``cn_cert_susp_head``.
624
625Version
626-------
627
628``cn_version`` is used to make some decisions in several parts of the
629code.
630
631Flags
632-----
633
634The flags in ``cn_flags`` specify which lists the connection is on and
635some other properties of the connection which need to be accessible by
636other modules.
637
638Stats
639-----
640
641``cn_last_sent`` and ``cn_last_ticked`` are used to determine the
642connection's place on the outgoing queue (see `Batching
643Packets <#batching-packets>`__) and on the `Advisory Tick Time
644Queue <#alarm-set>`__.
645
646List of SCIDs
647-------------
648
649IETF QUIC connections have one or more SCIDs (Source Connection IDs),
650any one of which can be used by the peer as the DCID (Destination CID)
651in the packets it sends. Each of the SCIDs is used to hash the
652connection so it can be found. ``cn_cces`` points to an array of size
653``cn_n_cces`` which is allocated internally inside each connection type.
654
655Google QUIC connections use only one CID (same for source and
656destination). In order not to modify old code, the macro ``cn_cid`` is
657used.
658
659Common Connection Interface
660===========================
661
662The struct ``conn_iface`` defines the common connection interface. All
663connection types implement all or some of these functions.
664
665Some of these functions are used by the engine; others by other modules
666(for example, to abort a connection); yet others are for use by the
667user, e.g. ``lsquic_conn_close`` and others in lsquic.h. In that case,
668these calls are wrapped in lsquic_conn.c.
669
670Tickability
671===========
672
673A connection is processed when it is tickable. More precisely, the
674connection is placed onto the `Tickable Queue <#tickable-queue>`__,
675which is iterated over when `connections are
676processed <#processing-connections>`__. A connection reports its own
677tickability via the ``ci_is_tickable`` method.
678
679In general, a connection is tickable if it has productive user callbacks
680to dispatch (that is, user wants to read and there is data to read or
681user wants to write and writing is possible), if there are packets to
682send or generate, or if its advisory tick time is in the past. (The
683latter is handled in ``lsquic_engine_process_conns()`` when expired
684connections from the `Advisory Tick Time Queue`_ are added
685to the Tickable Queue.)
686
687Stream
688******
689
690*Files: lsquic_stream.h, lsquic_stream.c*
691
692Overview
693========
694
695The lsquic stream is the conduit for data. This object is accessible by
696the user via any of the ``lsquic_stream_*`` functions declared in
697lsquic.h. The stream is bidirectional; in our user code, it represents
698the HTTP request and response. The client writes its request to the
699stream and the server reads the request in its corresponding instance of
700the stream. The server sends its response using the same stream, which
701the client reads from the stream.
702
703Besides streams exposed to the application, connections use streams
704internally:
705
706-  gQUIC has the HANDSHAKE and HEADERS streams
707
708-  IETF QUIC has up to four HANDSHAKE streams
709
710-  HTTP/3 has at least three unidirectional streams:
711
712   -  Settings stream
713
714   -  QPACK encoder stream
715
716   -  QPACK decoder stream
717
718In addition, HTTP/3 push promises use unidirectional streams. In the
719code, we make a unidirectional stream simply by closing one end in the
720constructor.
721
722All of the use cases above are handled by the single module,
723lsquic_stream. The differences in behavior -- gQUIC vs IETF QUIC, HTTP
724vs non-HTTP -- are handled either by explicit conditionals or via
725function pointers.
726
727The streams hang off full connections via stream ID-to-stream hashes and
728in various queues. This is similar to the way the connections hang off
729the engine.
730
731Streams are only used in the full connections; mini connections use
732their own, minimalistic, code to handle streams.
733
734.. _data-structures-1:
735
736Data Structures
737===============
738
739stream_hq_frame
740---------------
741
742This structure is used to keep information about an HTTP/3 frame that is
743being, or is about to be, written. In our implementation, frame headers
744can be two or three bytes long: one byte is HTTP/3 frame type and the
745frame length is encoded in 1 or 2 bytes, giving us the maximum payload
746size of 2\ :sup:`14` - 1 bytes. You will find literal ``2`` or ``3`` values
747in code that deals with writing HQ frames.
748
749If the HQ frame's size is known in advance (SHF_FIXED_SIZE) -- which is
750the case for HEADERS and PUSH_PROMISE frames -- then the HQ header
751contents are written immediately. Otherwise, ``shf_frame_ptr`` points to
752the bytes in the packet where the HQ header was written, to be filled in
753later.
754
755See `Writing HTTP/3 Streams`_ for more information.
756
757hq_filter
758---------
759
760This structure is used to read HTTP/3 streams. A single instance of it
761is stored in the stream in ``sm_hq_filter``. The framing is removed
762transparently (see `Reading HTTP/3 Streams`_).
763
764Frame type and length are read into ``hqfi_vint2_state``. Due to greasing,
765the reader must be able to support arbitrary frame types and so the code
766is pretty generic: varints of any size are supported.
767
768``hqfi_flags`` and ``hqfi_state`` contain information needed to resume
769parsing the frame header, as only partial data may have arrived.
770
771``hqfi_hist_buf`` and ``hqfi_hist_idx`` are used to record the last few
772incoming headers. This information is used to check for validity, as
773some sequences of HTTP/3 frames are invalid.
774
775stream_filter_if
776----------------
777
778This struct is used to specify functionality required to strip arbitrary
779framing when reading from the stream. At the moment (and for the
780foreseeable future) only one mechanism is used: that to strip the HTTP/3
781framing. At the time the code was written, however, the idea was to
782future-proof it in case we needed to support more than one framing format
783at a time.
784
785lsquic_stream
786-------------
787
788This struct is the stream object. It contains many members that deal
789with
790
791-  Reading data
792
793-  Writing data
794
795-  Maintaining stream list memberships
796
797-  Enforcing flow control
798
799-  Dispatching read and write events
800
801-  Calling various user callbacks
802
803-  Interacting with HEADERS streams
804
805The stream has an ID (``id``). It is used to hash the stream.
806
807A stream can be on one or more lists: see ``next_send_stream``,
808``next_read_stream``, and so on.
809
810Incoming data is stored in ``data_in``. Outgoing data is packetized
811immediately or buffered in ``sm_buf``.
812
813HTTP/3 frames that are being actively written are on the ``sm_hq_frames``
814list.
815
816A note on naming: newer members of the stream begin with ``sm_`` for
817simplicity. Originally, the structure members lacked a prefix.
818
819progress
820--------
821
822This structure is used to determine whether the user callback has made
823any progress during an ``on_write`` or ``on_read`` event loop. If progress
824is not made for a number of calls, the callback is interrupted, breaking
825out of a suspected infinite loop. (See ``es_progress_check`` setting.)
826
827
828frame_gen_ctx
829-------------
830
831This structure holds function pointers to get user data and write it to
832packets. ``fgc_size``, ``fgc_fin``, and ``fgc_read`` are set based on framing
833requirements. This is a nice abstraction that gets passed to several
834packetization functions and allows them not to care about how or whether
835framing is performed.
836
837pwritev_ctx
838-----------
839
840Used to aid ``lsquic_stream_pwritev``. ``hq_arr`` is used to roll back
841HTTP/3 framing if necessary. (The rollback is the most complicated part
842of the ``pwritev`` functionality).
843
844Event Dispatch
845==============
846
847The "on stream read" and "on stream write" callbacks are part of the
848lsquic API. These callbacks are called when the user has registered
849interest in reading from or writing to the stream and reading or writing
850is possible.
851
852Calling ``lsquic_stream_wantwrite`` and ``lsquic_stream_wantread`` places
853the stream on the corresponding "want to write" and "want to read" list.
854These lists are processed by a connection when it's ticked. For each
855stream on the list, the internal function
856``lsquic_stream_dispatch_read_events`` or
857``lsquic_stream_dispatch_write_events``, whichever may be the case.
858
859Dispatching read events is simple. When ``es_rw_once`` is set, the "on
860stream read" callback is called once -- if the stream is readable.
861Otherwise, the callback is called in a loop as long as:
862
863-  The stream is readable;
864
865-  The user wants to read from it; and
866
867-  Progress is being made
868
869Dispatching write events is more complicated due to the following
870factors:
871
872-  In addition to calling the "on stream write" callback, the flushing
873   mechanism also works by using the "want to write" list.
874
875-  When writing occurs, the stream's position on the list may change
876
877STREAM frames in
878================
879
880The data gets in from the transport into the stream via
881``lsquic_stream_frame_in`` function. The connection calls this function
882after parsing a STREAM frame.
883
884The data from the STREAM frame is stored in one of the two "data in"
885modules: ``di_nocopy`` and ``di_hash``. The two are abstracted out behind
886``stream->data_in``.
887
888The "data in" module is used to store incoming stream data. The data is
889read from this module using the ``di_get_frame`` function. See the next
890section.
891
892Reading Data
893============
894
895There are three user-facing stream-reading functions; two of them are
896just wrappers around ``"lsquic_stream_readf``. This function performs some
897checks (we will cover HTTP mode separately) and calls
898``lsquic_stream_readf``, which also performs some checks and calls
899``read_data_frames``. This is the only function in the stream module where
900data is actually read from the "data in" module.
901
902Writing Data
903============
904
905There are four user-facing functions to write to stream, and all of them
906are wrappers around ``stream_write``. (``lsquic_stream_pwritev`` is a bit
907more involved than the other three, but it's pretty well-commented --
908and the complexity is in the rollback, not writing itself.)
909
910Small writes get buffered. If the write size plus whatever is buffered
911already exceeds the threshold -- which is the size of the largest STREAM
912frame that could be fit into a single outgoing packet -- the data is
913packetized instead by calling ``stream_write_to_packets``. See the next
914section.
915
916Packetization
917=============
918
919``stream_write_to_packets`` is the only function through which user data
920makes it into outgoing packets. There are three ways to write STREAM
921frames:
922
9231. ``stream_write_to_packet_hsk``
924
9252. ``stream_write_to_packet_std``
926
9273. ``stream_write_to_packet_crypto``
928
929The particular function is selected based on connection and stream type
930when the stream is first created.
931
932stream_write_to_packets
933-----------------------
934
935Depending on the need to frame data, a reader is selected. The job of
936the reader is to copy user data into the outgoing STREAM frame. In
937HTTP/3 mode, HTTP/3 framing is added transparently -- see `Writing
938HTTP/3 Streams`_ for more information.
939
940The while loop is entered if there is user data to be copied or if the
941end of the stream has been reached and FIN needs to be written. Note the
942threshold check: when writing data from a user call, the threshold is
943set and frames smaller than the full packet are not generated. This is
944to allow for usage like "write 8KB", "write 8KB", "write 8KB" not to
945produce jagged STREAM frames. This way, we utilize the bandwidth most
946effectively. When flushing data, the threshold is not set, so even a
9471-byte data gets packetized.
948
949The call ``stream->sm_write_to_packet`` writes data to a single packet.
950This packet is allocated by the `Send Controller <#send-controller>`__.
951(Depending on when writing is performed, the returned packet may be
952placed onto the scheduled queue immediately or it may be a "buffered"
953packet. The stream code is oblivious to that.) If the send controller
954does not give us a packet, STOP is returned and the while loop exits. An
955ERROR should never happen -- this indicates a bug or maybe failure to
956allocate memory -- and so the connection is aborted in that case. If
957everything is OK, the while loop goes on.
958
959The ``seen_ok`` check is used to place the connection on the tickable list
960on the first successfully packetized STREAM frame. This is so that if
961the packet is buffered (meaning that the writing is occurring outside of
962the callback mechanism), the connection will be processed (ticked) and
963the packets will be scheduled and sent out.
964
965After the while loop, we conditionally close an outstanding HTTP/3
966frame, save any leftover data, schedule STREAM_BLOCKED or BLOCKED frames
967to be sent out if needed, and return the number of user-provided bytes
968that were copied into the outgoing packets and into the internal stream
969buffer (leftovers).
970
971Write a single STREAM frame
972---------------------------
973
974We will examine ``stream_write_to_packet_std`` as it is the most
975complicated of these three functions.
976
977First, we flush the headers stream if necessary -- this is because we
978want the HTTP (gQUIC or HTTP/3) headers to be sent before the payload.
979
980Then, the number of bytes needed to generate a STREAM frame is
981calculated. This value depends on the QUIC version, whether we need to
982generate HTTP/3 framing, and whether the data to write exists (or we
983just need to write an empty STREAM frame with the FIN bit set).
984
985(Note that the framing check is made to overshoot the estimate for
986simplicity. For one, we might not need 3 bytes for the DATA frame, but
987only 2. Secondly, there may already be an open HTTP/3 frame in one of
988the previous packets and so we don't need to write it at all.)
989
990Then, a packet is allocated and ``write_stream_frame`` is called. It is in
991this function that we finally make the call to generate the STREAM frame
992and to copy the data from the user. The function ``pf_gen_stream_frame``
993returns the number of bytes actually written to the packet: this
994includes both the STREAM frame header and the payload (which may also
995include HTTP/3 frame).
996
997The fact that this frame type has been written is added to
998``po_frame_types`` and the STREAM frame location, type, and size are
999recorded. This information is necessary to be able to elide the frame
1000from the packet in case the stream is reset.
1001
1002``PO_STREAM_END`` is set if the STREAM frame extends to the end of the
1003packet. This is done to prevent this packet from being used again to
1004append frames to it (after, for example, some preceding frames are
1005elided from it). This is because both in gQUIC and IETF QUIC the STREAM
1006frame header is likely to omit the ``length`` field and instead use the
1007"extends to the end of the packet" field. If frames are shifted, the
1008packet cannot be appended to because it will lead to data loss and
1009corruption.
1010
1011Writing HTTP/3 Streams
1012======================
1013
1014HTTP/3 streams use framing. In most cases, a single HEADERS frame is
1015followed by zero or more DATA frames. The user code does not know this:
1016both gQUIC and IETF QUIC streams appear to behave in exactly the same
1017manner. This makes lsquic simple to use.
1018
1019The drawback is internal complexity. To make the code both easy to use
1020and performant, HTTP/3 framing is generated on-the-fly, as data is being
1021written to packets (as opposed to being buffered and then written). (OK,
1022*mostly* on-the-fly: the HEADERS frame payload is generated and then
1023copied.)
1024
1025On the high level, the way it works is as follows:
1026
1027-  When a write call is made, a variable-size (that is, unknown size;
1028   it's called variable-size because the size of the DATA header may
1029   be 2 or 3 bytes; it's not the best name in the world) frame is
1030   opened/activated.
1031
1032-  When data is written to stream, the DATA header placeholder bytes are
1033   written to the stream transparently and a pointer is saved to this
1034   location.
1035
1036-  The active frame header is closed when
1037
1038   -  It reaches its maximum size; or
1039
1040   -  The data we are writing runs out.
1041
1042-  When the header is closed, the number of bytes that follows is now
1043   written to the location we saved when the header was activated.
1044
1045This mechanism allows us to create a DATA frame that spans several
1046packets before we know how many packets there will be in a single write.
1047(As outgoing packet allocation is governed by the `Send Controller`_.)
1048This is done to minimize the goodput overhead incurred by the DATA frame header.
1049
1050.. image:: stream-http3-framing.png
1051
1052There are a couple of things that do not fit into this model:
1053
10541. The HEADERS frame is fixed size [1]_. It is generated separately
1055   (written by QPACK encoder into a buffer on the stack) and later
1056   copied into the stream. (See the ``send_headers_ietf`` function.) It
1057   can happen that the whole buffer cannot be written. In that case,
1058   a rather complicated dance of buffering the unwritten HEADERS
1059   frame bytes is performed. Here, the "on stream write" callback is
1060   replaced with an internal callback (see the ``select_on_write``
1061   function) and user interaction is prohibited until the whole of
1062   the HEADERS frame is written to the stream.
1063
10642. Push promise streams are even weirder. In addition to the HEADERS
1065   handling above, the push promise stream must begin with a
1066   variable-integer Push ID. To make this fit into the framed stream
1067   model, the code makes up the concept of a "phantom" HTTP/3 frame.
1068   This type of frame's header is not written. This allows us to
1069   treat the Push ID as the payload of a regular HTTP/3 frame.
1070
1071The framing code has had its share of bugs. Because of that, there is a
1072dedicated unit test program just for the framing code,
1073*tests/test_h3_framing.c*. In addition to manually-written tests, the
1074program has a "fuzzer driver" mode, in which the American Fuzzy Lop
1075fuzzer drives the testing of the HTTP/3 framing mechanism. The advantage
1076of this approach is that AFL tries to explore all the code paths.
1077
1078
1079Debates regarding DATA framing raged in 2018 on the QUIC mailing list.
1080Some of the discussion is quite interesting: for example, the debate about
1081"optimizing" DATA frames and `calculations of the header
1082cost <https://lists.w3.org/Archives/Public/ietf-http-wg/2018OctDec/0236.html>`__.
1083
1084Reading HTTP/3 Streams
1085======================
1086
1087HTTP/3 frame headers are stripped out transparently -- they are never
1088seen by the user. From the user's perspective, the lsquic stream
1089represents the payload of HTTP message; a dedicated call is made first
1090to get at the HTTP headers.
1091
1092To accomplish this, the stream implements a generic deframing mechanism.
1093The `stream_filter_if`_ interface allows one to
1094specify functions to a) check whether the stream is readable, b) strip
1095header bytes from a data frame fetched from "data in" module; and c)
1096update byte count in the filter once bytes have been read:
1097
1098hq_filter_readable
1099------------------
1100
1101This function tests for availability of non-frame-header data, stripping
1102frame headers from the stream transparently. Note how it calls
1103``read_data_frames`` with its own callback, ``hq_read``. It is inside this
1104callback that the HEADERS frame is fed to the QPACK decoder.
1105
1106hq_filter_df
1107------------
1108
1109This function's job is to strip framing from data frames returned by the
1110"data in" module inside the ``read_data_frames`` function. It, too, calls
1111the ``hq_read`` function. This allows the two functions that read from
1112stream (this one) and the readability-checking function
1113(``hq_filter_readable``) to share the same state. This is crucial:
1114Otherwise this approach is not likely to work well.
1115
1116hq_decr_left
1117------------
1118
1119This function is needed to update the filter state. Once all payload
1120bytes from the frame have been consumed, the filter is readied to strip
1121the next frame header again.
1122
1123.. _notable-code-1:
1124
1125Notable Code
1126============
1127
1128frame_hq_gen_read
1129-----------------
1130
1131This is where HTTP/3 frame headers are generated. Note the use of
1132``shf_frame_ptr`` to record the memory location to which the correct frame
1133size will be written by a different function.
1134
1135Parsing
1136*******
1137
1138Parsing Packets
1139===============
1140
1141Parsing Frames
1142==============
1143
1144Mini vs Full Connections
1145************************
1146
1147Mini Purpose
1148============
1149
1150The reason for having a mini connection is to conserve resources: a mini
1151connection allocates a much smaller amount of memory. This protects the
1152server from a potential DoS attack. The mini connection's job is to get
1153the handshake to succeed, after which the connection is
1154`promoted <#connection-promotion>`__.
1155
1156Mini/Full Differences
1157=====================
1158
1159Besides their size, the two connection types differ in the following
1160ways:
1161
1162Mini connections' lifespan is limited. If the handshake does not succeed
1163within 10 seconds (configurable), the mini connection is destroyed.
1164
1165A mini connection is only `tickable <#tickability>`__ if it has unsent
1166packets.
1167
1168Mini connections do not process packets that carry application (as
1169opposed to handshake) data. The 0-RTT packet processing is deferred;
1170these packets are stashed and handed over to the full connection during
1171promotion.
1172
1173Connection Promotion
1174====================
1175
1176A mini connection is promoted when the handshake succeeds. The mini
1177connection reports this via the return status of ``ci_tick`` by setting
1178the ``TICK_PROMOTE`` bit. The engine creates a new connection object and
1179calls the corresponding server constructor. The latter copies all the
1180relevant state information from mini to full connection.
1181
1182For a time, two connection objects -- one mini and one full -- exist at
1183the same state. Most of the time, the mini connection is destroyed
1184within the same function call to ``process_connections()``. If, however,
1185the mini connection has unsent packets, it will remain live until those
1186packets are sent successfully. Because the mini connection is by then
1187removed from the CID-to-connection hash (``engine->conns_hash``), it will
1188not receive any more incoming packets.
1189
1190Also see `Connection Processing <#processing-connections>`__.
1191
1192Mini gQUIC Connection
1193*********************
1194
1195*Files: lsquic_mini_conn.h, lsquic_mini_conn.c*
1196
1197.. _overview-1:
1198
1199Overview
1200========
1201
1202The original version of ``struct mini_conn`` fit into paltry 128 bytes.
1203The desire to fit into 128 bytes [2]_ led to, for example,
1204``mc_largest_recv`` -- in effect, a 3-byte integer! Since that time,
1205the mini conn has grown to over 512 bytes.
1206
1207Looking at the struct, we can see that a lot of other data structures
1208are squeezed into small fields:
1209
1210Received and sent packet history is each packed into a 64-bit integer,
1211``mc_received_packnos`` and ``mc_sent_packnos``, respectively. The HEADERS
1212stream offsets are handled by the two two-byte integers ``mc_read_off``
1213and ``mc_write_off``.
1214
1215.. _notable-code-2:
1216
1217Notable Code
1218============
1219
1220continue_handshake
1221------------------
1222
1223This function constructs a contiguous buffer with all the HANDSHAKE
1224stream chunks in order and passes it to ``esf_handle_chlo()``. This is
1225done because the gQUIC crypto module does not buffer anything: it's all
1226or nothing.
1227
1228The code has been written in a generic way, so that even
1229many small packets can be reconstructed into a CHLO. The lsquic client
1230can be made to split the CHLO by setting the max packet size
1231sufficiently low.
1232
1233sent/unsent packets
1234-------------------
1235
1236To conserve space, only a single outgoing packet header exists in the
1237mini connection struct, ``mc_packets_out``. To differentiate between
1238packets that are to be sent and those that have already been sent, the
1239``PO_SENT`` flag is used.
1240
1241Mini IETF Connection
1242********************
1243
1244*Files: lsquic_mini_conn_ietf.h, lsquic_mini_conn_ietf.c*
1245
1246.. _overview-2:
1247
1248Overview
1249========
1250
1251The IETF QUIC mini connection has the same idea as the gQUIC mini
1252connection: use as little memory as possible. This is more difficult to
1253do with the IETF QUIC, however, as there are more moving parts in this
1254version of the protocol.
1255
1256.. _data-structures-2:
1257
1258Data Structures
1259===============
1260
1261mini_crypto_stream
1262------------------
1263
1264This structure is a minimal representation of a stream. The IETF QUIC
1265protocol uses up to four HANDSHAKE streams (one for each encryption
1266level) during the handshake and we need to keep track of them. Even a
1267basic event dispatch mechanism is supported.
1268
1269packno_set_t
1270------------
1271
1272This bitmask is used to keep track of sent, received, and acknowledged
1273packet numbers. It can support up to 64 packet numbers: 0 through 63. We
1274assume that the server will not need to send more than 64 packets to
1275complete the handshake.
1276
1277imc_recvd_packnos
1278-----------------
1279
1280Because the client is allowed to start its packet number sequence with
1281any number in the [0, 2\ :sup:`32`-1] range, the received packet history
1282must be able to accommodate numbers larger than 63. To do that, the
1283receive history is a union. If all received packet numbers are 63 or
1284smaller, the packno_set_t bitmask is used. Otherwise, the receive
1285history is kept in `Tiny Receive History <#tiny-receive-history>`__
1286(trechist). The flag ``IMC_TRECHIST`` indicates which data structure is
1287used.
1288
1289ietf_mini_conn
1290--------------
1291
1292This structure is similar to the gQUIC mini conn. It is larger, though,
1293as it needs to keep track of several instances of things based on
1294encryption level or packet number space.
1295
1296``imc_cces`` can hold up to three SCIDs: one for the original DCID from
1297the client, one for SCID generated by the server, and one for when
1298preferred address transport parameter is used. (The preferred address
1299functionality is not compiled by default.)
1300
1301ietf_mini_rechist
1302-----------------
1303
1304The receive history is in the header file because, in addition to
1305generating the ACK frames in the IETF mini conn, it is used to migrate
1306the receive history during promotion.
1307
1308.. _notable-code-3:
1309
1310Notable Code
1311============
1312
1313Switching to trechist
1314---------------------
1315
1316The switch to the Tiny Receive History happens when the incoming packet
1317number does not fit into the bitmask anymore -- see
1318``imico_switch_to_trechist()``. To keep the trechist code exercised, about
1319one in every 16 mini connection uses trechist unconditionally -- see
1320``lsquic_mini_conn_ietf_new()``.
1321
1322crypto_stream_if
1323----------------
1324
1325A set of functions to drive reading and writing CRYPTO frames to move
1326the handshake along is specified. It is passed to the crypto session.
1327After promotion, the full connection installs its own function pointers.
1328
1329imico_read_chlo_size
1330--------------------
1331
1332This function reads the first few bytes of the first CRYPTO frame on the
1333first HANDSHAKE stream to figure out the size of ClientHello. The
1334transport parameters will not be read until the full ClientHello is
1335available.
1336
1337
1338Duplicated Code
1339---------------
1340
1341Some code has been copied from gQUIC mini connection. This was done on
1342purpose, with the expectation that gQUIC is going away.
1343
1344ECN Blackhole Detection
1345-----------------------
1346
1347ECN blackhole at the beginning of connection is guessed at when none of
1348packets sent in the initial batch were acknowledged. This is done by
1349``imico_get_ecn()``. ``lsquic_mini_conn_ietf_ecn_ok()`` is also used during
1350promotion to check whether to use ECN.
1351
1352Connection Public Interface
1353***************************
1354
1355*Files: lsquic_conn_public.h*
1356
1357TODO
1358
1359Full gQUIC Connection
1360*********************
1361
1362*Files: lsquic_full_conn.h, lsquic_full_conn.c*
1363
1364.. _overview-3:
1365
1366Overview
1367========
1368
1369The full gQUIC connection implements the Google QUIC protocol, both
1370server and client side. This is where a large part of the gQUIC protocol
1371logic is contained and where everything -- engine, streams, sending,
1372event dispatch -- is tied together.
1373
1374Components
1375==========
1376
1377In this section, each member of the ``full_conn`` structure is documented.
1378
1379fc_conn
1380-------
1381
1382The first member of the struct is the common connection object,
1383`lsquic_conn`_.
1384
1385It must be first in the struct because the two pointer are cast to each
1386other, depending on circumstances.
1387
1388fc_cces
1389-------
1390
1391This array holds two connection CID elements.
1392
1393The reason for having two elements in this array instead of one (even
1394though gQUIC only uses one CID) is for the benefit of the client: In
1395some circumstances, the client connections are hashed by the port
1396number, in which case the second element is used to hash the port value.
1397The relevant code is in lsquic_engine.c
1398
1399fc_rechist
1400----------
1401
1402This member holds the `packet receive history <#receive-history>`__. It
1403is used to generate ACK frames.
1404
1405fc_stream_ifs
1406-------------
1407
1408This three-element array holds pointers to stream callbacks and the
1409stream callback contexts.
1410
1411From the perspective of lsquic, Google QUIC has three stream types:
1412
14131. HANDSHAKE stream;
1414
14152. HEADERS stream; and
1416
14173. Regular (message, or request/response) streams.
1418
1419The user provides stream callbacks and the context for the regular
1420streams (3) in ``ea_stream_if`` and ``ea_stream_if_ctx``.
1421
1422The other two stream types are internal. The full connection specifies
1423internal callbacks for those streams. One set handles the handshake and
1424the other handles reading and writing of HTTP/2 frames: SETTINGS,
1425HEADERS, and so on.
1426
1427fc_send_ctl
1428-----------
1429
1430This is the `Send Controller <#send-controller>`__. It is used to
1431allocate outgoing packets, control sending rate, and process
1432acknowledgements.
1433
1434fc_pub
1435------
1436
1437This member holds the `Connection Public
1438Interface <#connection-public-interface>`__.
1439
1440fc_alset
1441--------
1442
1443This is the `Alarm Set <#alarm-set>`__. It is used to set various timers
1444in the connection and the send controller.
1445
1446fc_closed_stream_ids
1447--------------------
1448
1449The two sets in this array hold the IDs of closed streams.
1450
1451There are two of them because of the uneven distribution of stream IDs.
1452It is more efficient to hold even and odd stream IDs in separate
1453structures.
1454
1455fc_settings
1456-----------
1457
1458Pointer to the engine settings.
1459
1460This member is superfluous -- the settings can be fetched from
1461``fc_enpub->enp_settings``.
1462
1463fc_enpub
1464--------
1465
1466This points to the `engine's public interface <#lsquic-engine-public>`__.
1467
1468fc_max_ack_packno
1469-----------------
1470
1471Recording the maximum packet number that contained an ACK allows us to
1472ignore old ACKs.
1473
1474fc_max_swf_packno
1475-----------------
1476
1477This is the maximum packet number that contained a STOP_WAITING frame.
1478It is used to ignore old STOP_WAITING frames.
1479
1480fc_mem_logged_last
1481------------------
1482
1483This timestamp is used to limit logging the amount of memory used to
1484most once per second.
1485
1486fc_cfg
1487------
1488
1489This structure holds a few important configuration parameters. (Looks
1490like ``max_conn_send`` is no longer used���)
1491
1492fc_flags
1493--------
1494
1495The flags hold various boolean indicators associated with the full
1496connections. Some of them, such as ``FC_SERVER``, never change, while
1497others change all the time.
1498
1499fc_n_slack_akbl
1500---------------
1501
1502This is the number of ackable (or, in the new parlance, *ack-eliciting*)
1503packets received since the last ACK was sent.
1504
1505This counter is used to decide whether an ACK should be sent (or, more
1506precisely, queued to be sent) immediately or whether to wait.
1507
1508fc_n_delayed_streams
1509--------------------
1510
1511Count how many streams have been delayed.
1512
1513When ``lsquic_conn_make_stream()`` is called, a stream may not be created
1514immediately. It is delayed if creating a stream would go over the
1515maximum number of stream allowed by peer.
1516
1517fc_n_cons_unretx
1518----------------
1519
1520Counts how many consecutive unretransmittable packets have been sent.
1521
1522
1523fc_last_stream_id
1524-----------------
1525
1526ID of the last created stream.
1527
1528Used to assign ID to streams created by this side of the connection.
1529Clients create odd-numbered streams, while servers initiate
1530even-numbered streams (push promises).
1531
1532fc_max_peer_stream_id
1533---------------------
1534
1535Maximum value of stream ID created by peer.
1536
1537fc_goaway_stream_id
1538-------------------
1539
1540Stream ID received in the GOAWAY frame.
1541
1542This ID is used to reset locally-initiated streams with ID larger than
1543this.
1544
1545fc_ver_neg
1546----------
1547
1548This structure holds the version negotiation state.
1549
1550This is used by the client to negotiate with the server.
1551
1552
1553With gQUIC going away, it is probably not very important anymore.
1554
1555fc_hsk_ctx
1556----------
1557
1558Handshake context for the HANDSHAKE stream.
1559
1560Client and server have different HANDSHAKE stream handlers -- and
1561therefore different contexts.
1562
1563fc_stats
1564--------
1565
1566Connection stats
1567
1568fc_last_stats
1569-------------
1570
1571Snapshot of connection stats
1572
1573This is used to log the changes in counters between calls to
1574``ci_log_stats()``. The calculation is straightforward in
1575``lsquic_conn_stats_diff()``.
1576
1577fc_stream_histories and fc_stream_hist_idx
1578------------------------------------------
1579
1580Rolling log of histories of closed streams
1581
1582
1583fc_errmsg
1584---------
1585
1586Error message associated with connection termination
1587
1588This is set when the connection is aborted for some reason. This error
1589message is only set once. It is used only to set the error message in
1590the call to ``ci_status()``
1591
1592fc_recent_packets
1593-----------------
1594
1595Dual ring-buffer log of packet history
1596
1597The first element is for incoming packets, the second is for outgoing
1598packets. Each entry holds received or sent time and frame information.
1599
1600This can be used for debugging. It is only compiled into debug builds.
1601
1602fc_stream_ids_to_reset
1603----------------------
1604
1605List of stream ID to send STREAM_RESET for
1606
1607These STREAM_RESET frames are associated with streams that are not
1608allowed to be created because we sent a GOAWAY frame. (There is a period
1609when GOAWAY is in transit, but the peer keeps on creating streams). To
1610queue the reset frames for such a stream, an element is added to this
1611list.
1612
1613fc_saved_ack_received
1614---------------------
1615
1616Timestamp of the last received ACK.
1617
1618This is used for `ACK merging <#ack-merging>`__.
1619
1620fc_path
1621-------
1622
1623The network path -- Google QUIC only has one network path.
1624
1625fc_orig_versions
1626----------------
1627
1628List (as bitmask) of original versions supplied to the client
1629constructor.
1630
1631Used for version negotiation. See `fc_ver_neg`_ for more
1632coverage of this topic.
1633
1634fc_crypto_enc_level
1635-------------------
1636
1637Latest crypto level
1638
1639This is for Q050 only, which does away with the HANDSHAKE stream and
1640uses CRYPTO frames instead. (This was part of Google's plan to move
1641Google QUIC protocol closer to IETF QUIC.)
1642
1643fc_ack
1644------
1645
1646Saved ACK -- latest or merged
1647
1648This ACK structure is used in `ACK merging <#ack-merging>`__.
1649
1650Instantiation
1651=============
1652
1653The largest difference between the server and client mode of the full
1654connection is in the way it is created. The client creates a brand-new
1655connection, performs version negotiation, and runs the handshake before
1656dispatching user events. The server connection, on the other hand, gets
1657created from a mini connection during `connection
1658promotion <#connection-promotion>`__. By that time, both version
1659negotiation and handshake have already completed.
1660
1661Common Initialization
1662---------------------
1663
1664The ``new_conn_common()`` function contains initialization common to both
1665server and client. Most full connection's internal data structures are
1666initialized or allocated here, among them `Send
1667Controller <#send-controller>`__, `Receive
1668History <#receive-history>`__, and `Alarm Set <#alarm-set>`__.
1669
1670The HEADERS stream is created here, if necessary. (Throughout the code,
1671you can see checks whether the connection is in HTTP mode or not. Even
1672though gQUIC means that HTTP is used, our library supports a non-HTTP
1673mode, in which there is no HEADERS stream. This was done for testing
1674purposes and made possible the echo and md5 client and server programs.)
1675
1676Server
1677------
1678
1679After initializing the common structures in ``new_conn_common()``,
1680server-specific initialization continues in
1681``lsquic_gquic_full_conn_server_new()``.
1682
1683The HANDSHAKE stream is created. The handler (see
1684``lsquic_server_hsk_stream_if``) simply throws out data that it reads from
1685the client.
1686
1687Outgoing packets are inherited -- they will be sent during the next tick
1688-- and deferred incoming packets are processed.
1689
1690Client
1691------
1692
1693The client's initialization takes place in
1694``lsquic_gquic_full_conn_client_new()``. Crypto session is created and the
1695HANDSHAKE stream is initialized. The handlers in
1696``lsquic_client_hsk_stream_if`` drive the handshake process.
1697
1698Incoming Packets
1699================
1700
1701The entry point for incoming packets is ``ci_packet_in()``, which is
1702implemented by ``full_conn_ci_packet_in``. Receiving a packet restarts the
1703idle timer.
1704
1705The function ``process_incoming_packet`` contains some client-only logic
1706for processing version negotiation and stateless retry packets. In the
1707normal case, ``process_regular_packet()`` is called. This is where the
1708incoming process is decrypted, the `Receive
1709History <#receive-history>`__ is updated, ``parse_regular_packet()`` is
1710called, and some post-processing takes place (most importantly,
1711scheduling an ACK to be sent).
1712
1713The function ``parse_regular_packet`` is simple: It iterates over the
1714whole decrypted payload of the incoming packet and parses out frames one
1715by one. An error aborts the connection.
1716
1717ACK Merging
1718===========
1719
1720Processing ACKs is `expensive <#handling-acks>`__. When sending data, a
1721batch of incoming packets is likely to contain an ACK frame each. The
1722ACK frame handler, ``process_ack_frame()``, merges consecutive ACK frames
1723and stores the result in `fc_ack`_. The ACK is processed
1724during the `next tick <#ticking>`__. If the two ACK cannot be merged
1725(which is unlikely), the cached ACK is processed immediately and the new
1726ACK is cached.
1727
1728Caching an ACK has a non-trivial memory cost: the 4KB-plus data
1729structure ``ack_info`` accounts for more than half of the size of the
1730``full_conn`` struct. Nevertheless, the tradeoff is well worth it. ACK
1731merging reduces the number of calls to ``lsquic_send_ctl_got_ack()`` by a
1732factor of 10 or 20 in some high-throughput scenarios.
1733
1734Ticking
1735=======
1736
1737When a `connection is processed by the
1738engine <#processing-connections>`__, the engine calls the connection's
1739``ci_tick()`` method. This is where most of the connection logic is
1740exercised. In the full gQUIC connection, this method is implemented by
1741``full_conn_ci_tick()``.
1742
1743The following steps are performed:
1744
1745-  A cached ACK, if it exists, is processed
1746
1747-  Expired alarms are rung
1748
1749-  Stream read events are dispatched
1750
1751-  An ACK frame is generated if necessary
1752
1753-  Other control frames are generated if necessary
1754
1755-  Lost packets are rescheduled
1756
1757-  More control frames and stream resets are generated if necessary
1758
1759-  HEADERS stream is flushed
1760
1761-  Outgoing packets that carry stream data are scheduled in four steps:
1762
1763   a. High-priority `buffered packets <#buffered-queue>`__ are scheduled
1764
1765   b. Write events are dispatched for high-priority streams
1766
1767   c. Non-high-priority buffered packets are scheduled
1768
1769   d. Write events are dispatched for non-high-priority streams
1770
1771-  Connection close or PING frames are generated if necessary
1772
1773-  Streams are serviced (closed, freed, created)
1774
1775.. _notable-code-4:
1776
1777Notable Code
1778============
1779
1780TODO
1781
1782Full IETF Connection
1783********************
1784
1785*Files: lsquic_full_conn_ietf.h, lsquic_full_conn_ietf.c*
1786
1787Overview
1788========
1789
1790This module implements IETF QUIC
1791`Transport <https://tools.ietf.org/html/draft-ietf-quic-transport-34>`_
1792and
1793`HTTP/3 <https://tools.ietf.org/html/draft-ietf-quic-http-34>`_ logic,
1794plus several QUIC extensions.  To attain an overall grasp of the code,
1795at least some familiarity with these protocols is required.  To understand
1796the code in detail, especially *why* some things are done, a closer reading
1797of the specification may be in order.
1798
1799In some places, the code contains comments with references to the
1800specification, e.g.
1801
1802::
1803
1804    if (conn->ifc_flags & IFC_SERVER)
1805    {   /* [draft-ietf-quic-transport-34] Section 19.7 */
1806        ABORT_QUIETLY(0, TEC_PROTOCOL_VIOLATION,
1807                            "received unexpected NEW_TOKEN frame");
1808        return 0;
1809    }
1810
1811(A search for "[draft-ietf" will reveal over one hundred places in the
1812code thus commented.)
1813
1814The Full IETF Connection module is similar in structure to the `Full gQUIC
1815Connection`_ module, from which it originated.  Some code is quite similar
1816as well, including logic for `ACK Merging`_ and `Ticking`_.
1817
1818Components
1819==========
1820
1821In this section, each member of ``ietf_full_conn`` is documented.
1822
1823ifc_conn
1824--------
1825
1826The first member of the struct is the common connection object,
1827`lsquic_conn`_.
1828
1829It must be first in the struct because the two pointer are cast to each
1830other, depending on circumstances.
1831
1832ifc_cces
1833--------
1834
1835This array holds eight connection CID elements.
1836See `Managing SCIDs`_.
1837
1838ifc_rechist
1839-----------
1840
1841This member holds the `packet receive history <#receive-history>`__.
1842The receive history is used to generate ACK frames.
1843
1844ifc_max_ackable_packno_in
1845-------------------------
1846
1847This value is used to detect holes in incoming packet number sequence.
1848This information is used to queue ACK frames.
1849
1850ifc_send_ctl
1851------------
1852
1853This is the `Send Controller`_. It is used to
1854allocate outgoing packets, control sending rate, and process
1855acknowledgements.
1856
1857ifc_pub
1858-------
1859
1860This member holds the `Connection Public Interface`_
1861
1862ifc_alset
1863---------
1864
1865This is the `Alarm Set`_. It is used to set various timers
1866in the connection and the send controller.
1867
1868ifc_closed_stream_ids
1869---------------------
1870
1871The two sets in this array hold the IDs of closed streams.
1872
1873There are two of them because of the uneven distribution of stream IDs.
1874The set data structure is meant to hold sequences without gaps.
1875It is more efficient to hold stream IDs for each stream type in
1876separate structures.
1877
1878ifc_n_created_streams
1879---------------------
1880
1881Counters for locally initiated streams.  Used to generate next
1882stream ID.
1883
1884ifc_max_allowed_stream_id
1885-------------------------
1886
1887Maximum allowed stream ID for each of the four (``N_SITS``) stream types.
1888This is used all over the place.
1889
1890ifc_closed_peer_streams
1891-----------------------
1892
1893Counts how many remotely-initiated streams have been closed.  Because the
1894protocol mandates that the stream IDs be assigned in order, this allows us
1895to make some logical inferences in the code.
1896
1897ifc_max_streams_in
1898------------------
1899
1900Maximum number of open streams the peer is allowed to initiate.
1901
1902ifc_max_stream_data_uni
1903-----------------------
1904
1905Initial value of the maximum amount of data locally-initiated unidirectional
1906stream is allowed to send.
1907
1908ifc_flags
1909---------
1910
1911All kinds of flags.
1912
1913ifc_mflags
1914----------
1915
1916More flags!
1917
1918ifc_send_flags
1919--------------
1920
1921The send flags keep track of which control frames are queued to be sent.
1922
1923ifc_delayed_send
1924----------------
1925
1926Some send flags are delayed.
1927
1928We stop issuing streams credits if peer stops opening QPACK decoder window.
1929This addresses a potential attack whereby client can cause the server to keep
1930allocating memory.  See `Security Considerations in the QPACK Internet-Draft
1931<https://tools.ietf.org/html/draft-ietf-quic-qpack-21#section-7.3>`__.
1932
1933ifc_send
1934--------
1935
1936This is the `Send Controller`_. It is used to allocate outgoing packets,
1937control sending rate, and process acknowledgements.
1938
1939ifc_error
1940---------
1941
1942This struct records which type of error has occurred (transport or application)'
1943and the error code.
1944
1945ifc_n_delayed_streams
1946---------------------
1947
1948Count how many streams have been delayed.
1949
1950When ``lsquic_conn_make_stream()`` is called, a stream may not be created
1951immediately. It is delayed if creating a stream would go over the
1952maximum number of stream allowed by peer.
1953
1954ifc_n_cons_unretx
1955-----------------
1956
1957Counts how many consecutive unretransmittable packets have been sent.
1958
1959Enough unretransittable sent packets in a row causes a PING frame to
1960be sent.  This forces the peer to send an ACK.
1961
1962ifc_pii
1963-------
1964
1965Points to the selected priority iterator.
1966
1967The IETF Full Connection supports two priority mechanisms: the original
1968Google QUIC priority mechanism and the `HTTP/3 Extensible Priorities
1969<https://tools.ietf.org/html/draft-ietf-httpbis-priority-03>`__.
1970
1971ifc_errmsg
1972----------
1973
1974Holds dynamically generated error message string.
1975
1976Once set, the error string does not change until the connection is
1977destroyed.
1978
1979ifc_enpub
1980---------
1981
1982This points to the `engine's public interface <#lsquic-engine-public>`__.
1983
1984ifc_settings
1985------------
1986
1987Pointer to the engine settings.
1988
1989This member is superfluous -- the settings can be fetched from
1990``ifc_enpub->enp_settings``.
1991
1992ifc_stream_ids_to_ss
1993--------------------
1994
1995Holds a queue of STOP_SENDING frames to send as response to remotely
1996initiated streams that came in after we sent a GOAWAY frame.
1997
1998ifc_created
1999-----------
2000
2001Time when the connection was created.  This is used for the Timestamp
2002and Delayed ACKs extensions.
2003
2004ifc_saved_ack_received
2005----------------------
2006
2007Time when cached ACK frame was received.  See `ACK Merging`_.
2008
2009ifc_max_ack_packno
2010------------------
2011
2012Holding the maximum packet number containing an ACK frame allows us
2013to ignore old ACK frames.  One value per Packet Number Space is kept.
2014
2015ifc_max_non_probing
2016-------------------
2017
2018Maximum packet number of a received non-probing packets.  This is used
2019for path migration.
2020
2021ifc_cfg
2022-------
2023
2024Local copy of a couple of transport parameters.  We could get at them
2025with a function call, but these are used often enough to optimize
2026fetching them.
2027
2028ifc_process_incoming_packet
2029---------------------------
2030
2031The client goes through version negotiation and the switches to the
2032fast function.  The server begins to use the fast function immediately.
2033
2034ifc_n_slack_akbl
2035----------------
2036
2037Number ackable packets received since last ACK was sent.  A count is
2038kept for each Packet Number Space.
2039
2040ifc_n_slack_all
2041---------------
2042
2043Count of all packets received since last ACK was sent.  This is only
2044used in the Application PNS (Packet Number Space).  (This is regular
2045PNS after the handshake completes).
2046
2047ifc_max_retx_since_last_ack
2048---------------------------
2049
2050This number is the maximum number of ack-eliciting packets to receive
2051before an ACK must be sent.
2052
2053The default value is 2.  When the Delayed ACKs extension is used, this
2054value gets modified by peer's ACK_FREQUENCY frames.
2055
2056ifc_max_ack_delay
2057-----------------
2058
2059Maximum amount of allowed after before an ACK is sent if the threshold
2060defined by ifc_max_retx_since_last_ack_ has not yet been reached.
2061
2062The default value is 25 ms.  When the Delayed ACKs extension is used, this
2063value gets modified by peer's ACK_FREQUENCY frames.
2064
2065ifc_ecn_counts_in
2066-----------------
2067
2068Incoming ECN counts in each of the Packet Number Spaces.  These counts
2069are used to generate ACK frames.
2070
2071ifc_max_req_id
2072--------------
2073
2074Keeps track of the maximum ID of bidirectional stream ID initiated by the
2075peers.  It is used to construct the GOAWAY frame.
2076
2077ifc_hcso
2078--------
2079
2080State for outgoing HTTP/3 control stream.
2081
2082ifc_hcsi
2083--------
2084
2085State for incoming HTTP/3 control stream.
2086
2087ifc_qeh
2088-------
2089
2090QPACK encoder streams handler.
2091
2092The handler owns two unidirectional streams: a) locally-initiated QPACK
2093encoder stream, to which it writes; and b) peer-initiated QPACK decoder
2094stream, from which it reads.
2095
2096ifc_qdh
2097-------
2098
2099QPACK decoder streams handler.
2100
2101The handler owns two unidirectional streams: a) peer-initiated QPACK
2102encoder stream, from which it reads; and b) locally-initiated QPACK
2103decoder stream, to which it writes.
2104
2105ifc_peer_hq_settings
2106--------------------
2107
2108Peer's HTTP/3 settings.
2109
2110ifc_dces
2111--------
2112
2113List of destination connection ID elements (DCEs).  Each holds a DCID
2114and the associated stateless reset token.  When lsquic uses a DCID, it
2115inserts the stateless reset token into a hash so that stateless resets
2116can be found.
2117
2118Outside of the initial migration, the lsquic client code does not switch
2119DCIDs.  One idea (suggested in the drafts somewhere) is to switch DCIDs
2120after a period of inactivity.
2121
2122ifc_to_retire
2123-------------
2124
2125List of DCIDs to retire.
2126
2127ifc_scid_seqno
2128--------------
2129
2130Sequence generator for SCIDs generated by the endpoint.
2131
2132ifc_scid_timestamp
2133------------------
2134
2135List of timestamps for the generated SCIDs.
2136
2137This list is used in the SCID rate-limiting mechanism.
2138
2139
2140ifc_incoming_ecn
2141----------------
2142
2143History indicating presence of ECN markings on most recent incoming packets.
2144
2145ifc_cur_path_id
2146---------------
2147
2148Current path ID -- indexes `ifc_paths`_.
2149
2150ifc_used_paths
2151--------------
2152
2153Bitmask of which paths in `ifc_paths`_ are being used.
2154
2155ifc_mig_path_id
2156---------------
2157
2158Path ID of the path being migrated to.
2159
2160ifc_active_cids_limit
2161---------------------
2162
2163This is the maximum number of CIDs at any one time this
2164endpoint is allowed to issue to peer.  If the TP value exceeds ``cn_n_cces``,
2165it is reduced to it.
2166
2167ifc_active_cids_count
2168---------------------
2169
2170This value tracks how many CIDs have been issued.  It is decremented
2171each time a CID is retired.
2172
2173ifc_first_active_cid_seqno
2174--------------------------
2175
2176Another piece of the SCID rate-limiting mechanism.
2177
2178ifc_ping_unretx_thresh
2179----------------------
2180
2181Once the number consecutively sent non-ack-elicing packets
2182(`ifc_n_cons_unretx`_) exceeds this value, this endpoint will send
2183a PING frame to force the peer to respond with an ACK.
2184
2185The threshold begins at 20 and then made to fluctuate randomly between
218612 and 27.
2187
2188ifc_last_retire_prior_to
2189------------------------
2190
2191Records the maximum value of ``Retire Prior To`` value of the
2192`NEW_CONNECTION_ID frame
2193<https://tools.ietf.org/html/draft-ietf-quic-transport-34#section-19.15>`_.
2194
2195ifc_ack_freq_seqno
2196------------------
2197
2198Sequence number generator for ACK_FREQUENCY frames generated by this
2199endpoint.
2200
2201ifc_last_pack_tol
2202-----------------
2203
2204Last value of the ``Packet Tolerance`` field sent in the last
2205``ACK_FREQUENCY`` frame generated by this endpoint.
2206
2207ifc_last_calc_pack_tol
2208----------------------
2209
2210Last *calculated* value of the ``Packet Tolerance`` field.
2211
2212ifc_min_pack_tol_sent
2213---------------------
2214
2215Minimum value of the ``Packet Tolerance`` field sent.  Only used for
2216statistics display.
2217
2218ifc_max_pack_tol_sent
2219---------------------
2220
2221Maximum value of the ``Packet Tolerance`` field sent.  Only used for
2222statistics display.
2223
2224ifc_max_ack_freq_seqno
2225----------------------
2226
2227Maximum seen sequence number of incoming ``ACK_FREQUENCY`` frame.  Used
2228to discard old frames.
2229
2230ifc_max_udp_payload
2231-------------------
2232
2233Maximum UDP payload.  This is the cached value of the transport parameter.
2234
2235ifc_last_live_update
2236--------------------
2237
2238Last time ``ea_live_scids()`` was called.
2239
2240ifc_paths
2241---------
2242
2243Array of network paths.  Most of the time, only one path is used when the
2244peer migrates.  The array has four elements as a safe upper limit.
2245
2246ifc_u.cli
2247---------
2248
2249Client-specific state.  This is where pointers to "crypto streams" are
2250stored; they are not in the ``ifc_pub.all_streams`` hash.
2251
2252ifc_u.ser
2253---------
2254
2255The server-specific state is only about push promises.
2256
2257ifc_idle_to
2258-----------
2259
2260Idle timeout.
2261
2262ifc_ping_period
2263---------------
2264
2265Ping period.
2266
2267ifc_bpus
2268--------
2269
2270A hash of buffered priority updates.  It is used when a priority update
2271(part of the Extensible HTTP Priorities extension) arrives before the
2272stream it is prioritizing.
2273
2274ifc_last_max_data_off_sent
2275--------------------------
2276
2277Value of the last MAX_DATA frame sent.  This is used to limit the number
2278of times we send the MAX_DATA frame in response to a DATA_BLOCKED frame.
2279
2280ifc_min_dg_sz
2281-------------
2282
2283Minimum size of the DATAGRAM frame.  Used by the eponymous extension.
2284
2285ifc_max_dg_sz
2286-------------
2287
2288Maximum size of the DATAGRAM frame.  Used by the eponymous extension.
2289
2290ifc_pts
2291-------
2292
2293PTS stands for "Packet Tolerance Stats".  Information collected here
2294is used to calculate updates to the packet tolerance advertised to the
2295peer via ACK_FREQUENCY frames.  Part of the Delayed ACKs extension.
2296
2297ifc_stats
2298---------
2299
2300Cumulative connection stats.
2301
2302ifc_last_stats
2303--------------
2304
2305Copy of `ifc_stats`_ last time ``ci_log_stats()`` was called.  Used
2306to calculate the difference.
2307
2308ifc_ack
2309-------
2310
2311One or more cached incoming ACK frames.  Used for `ACK merging`_.
2312
2313Managing SCIDs
2314==============
2315
2316Source Connection IDs -- or SCIDs for short -- are stored in the `ifc_cces`_
2317array.
2318
2319Each of ``struct conn_cid_elem`` contains the CID itself, the CID's port or
2320sequence number, and flags:
2321
2322-   ``CCE_USED`` means that this Connection ID has been used by the peer.  This
2323    information is used to check whether the peer's incoming packet is using
2324    a new DCID or reusing an old one when the packet's DCID does not match
2325    this path's current DCID.
2326
2327-   ``CCE_REG`` signifies that the CID has been registered with the user-defined
2328    ``ea_new_scids()`` callback.
2329
2330-   ``CCE_SEQNO`` means that the connection has been issued by this endpoint
2331    and ``cce_seqno`` contains a valid value.  Most of SCIDs are issued by
2332    either endpoint, with one exception:  The DCID included in the first few
2333    packets sent by the client becomes an interim SCID for the server and it
2334    does not have a sequence number.  This "original" SCID gets retired 2
2335    seconds after the handshake succeeds, see the ``AL_RET_CIDS`` alarm.
2336
2337-   ``CCE_PORT`` is used to mark the special case of hashing connections by
2338    port number.  In client mode, the lsquic engine may, under some circumstances,
2339    hash the connections by local port number instead of connection ID.
2340    In that case, ``cce_port`` contains the port number used to hash the
2341    connection.
2342
2343Each CIDs is hashed in the of the "CID-to-connection" mapping that the engine
2344maintains.  If it is not in the hash, incoming packets that use this CID as
2345DCID will not be dispatched to the connection (because the connection will not
2346be found).
2347
2348Path Migration
2349==============
2350
2351Stream Priority Iterators
2352=========================
2353
2354Creating Streams on the Server
2355==============================
2356
2357Calculating Packet Tolerance
2358============================
2359
2360When the Delayed ACKs extension is used, we advertise our ``Packet Tolerance``
2361to peer.  This is the number of packets the peer can receive before having to
2362send an acknowledgement.  By default -- without the extension -- the packet
2363tolerance is 2.
2364
2365Because we `merge ACKs <#ack-merging>`__, receiving more than one ACK between
2366ticks is wasteful.  Another consideration is that a packet is not declared
2367lost until at least one RTT passes -- the time to send a packet and receive
2368the acknowledgement from peer.
2369
2370To calculate the packet tolerance, we use a feedback mechanism: when number
2371of ACKs per RTT is too high, we increase packet tolerance; when number of
2372ACKs per RTT is too low, we decrease packet tolerance.  The feedback is
2373implemented with a `PID Controller <https://en.wikipedia.org/wiki/PID_controller>`__:
2374the target is the number of ACKs per RTT, normalized to 1.0.
2375
2376See the function ``packet_tolerance_alarm_expired()`` as well as comments
2377in ``lsquic.h`` that explain the normalization as well as the knobs available
2378for tuning.
2379
2380The pre-normalized target is a function of RTT.  It was obtained
2381empirically using netem.  This function together with the default
2382PID controller parameters give good performance in the lab and in
2383some limited interop testing.
2384
2385Anatomy of Outgoing Packet
2386**************************
2387
2388Evanescent Connection
2389*********************
2390
2391Send Controller
2392***************
2393
2394*Files: lsquic_send_ctl.h, lsquic_send_ctl.c*
2395
2396.. _overview-4:
2397
2398Overview
2399========
2400
2401The Send Controller manages outgoing packets and the sending rate:
2402
2403-  It decides whether packets can be sent
2404
2405-  It figures out what the congestion window is
2406
2407-  It processes acknowledgements and detects packet losses
2408
2409-  It allocates packets
2410
2411-  It maintains sent packet history
2412
2413The controller allocates, manages, splits, coalesces, and destroys
2414outgoing packets. It owns these packets.
2415
2416The send controller services two modules:
2417
2418-  Full connection. gQUIC and IETF full connections use the send
2419   controller to allocate packets and delegate packet-sending
2420   decisions to it.
2421
2422-  Stream. The stream uses the stream controller as the source of
2423   outgoing packets to write STREAM frames to.
2424
2425Packet Life Cycle
2426=================
2427
2428A new outgoing packet is allocated and returned to the connection or the
2429stream. Around this time (just before or just after, depending on the
2430particular function call to get the packet), the packet is placed on the
2431Scheduled Queue.
2432
2433When the engine is creating a batch of packets to send, it calls
2434``ci_next_packet_to_send()``. The connection removes the next packet from
2435its Scheduled Queue and returns it. The engine now owns the outgoing
2436packet, but only while the batch is being sent. The engine *always
2437returns the packet* after it tries to send it.
2438
2439If the packet was sent successfully, it is returned via the
2440``ci_packet_sent`` call, after which it is appended to the Unacked Queue.
2441If the packet could not be sent, ``ci_packet_not_sent()`` is called, at
2442which point it is prepended back to the Schedule Queue to be tried
2443later.
2444
2445There are two ways to get off the Unacked Queue: being acknowledged or
2446being lost. When a packet is acknowledged, it is destroyed. On the other
2447hand, if it is deemed lost, it is placed onto the Lost Queue, where it
2448will await being rescheduled.
2449
2450Packet Queues
2451=============
2452
2453.. image:: lsquic-packet-queues.png
2454
2455Buffered Queue
2456--------------
2457
2458The Buffered Queue is a special case. When writing to the stream occurs
2459outside of the event dispatch loop, new outgoing packets begin their
2460life in the Buffered Queue. They get scheduled during a connection tick,
2461making their way onto the Scheduled Queue.
2462
2463There are two buffered queues: one for packets holding STREAM frames
2464from the highest-priority streams and one for packets for streams with
2465lower priority.
2466
2467Scheduled Queue
2468---------------
2469
2470Packets on the Scheduled Queue have packet numbers assigned to them. In
2471rare cases, packets may be removed from this queue before being sent
2472out. (For example, a stream may be cancelled, in which case packets that
2473carry its STREAM frames may end up empty.) In that case, they are marked
2474with a special flag to generate the packet number just before they are
2475sent.
2476
2477Unacked Queue
2478-------------
2479
2480This queue holds packets that have been sent but are yet to be
2481acknowledged. When a packet on this queue is acknowledged, it is
2482destroyed.
2483
2484The loss detection code runs on this queue when ACKs are received or
2485when the retransmission timer expires.
2486
2487This queue is actually three queues: one for each of the IETF QUIC's
2488Packet Number Spaces, or PNSs. The PNS_APP queue is what is used by
2489gQUIC and IETF QUIC server code. PNS_INIT and PNS_HSK are only used by
2490the IETF QUIC client. (IETF QUIC server handles those packet number
2491spaces in its mini conn module.)
2492
2493In addition to regular packets, the Unacked Queue holds `loss
2494records <#loss-records>`__ and `poisoned packets <#poisoned-packets>`__.
2495
2496Lost Queue
2497----------
2498
2499This queue holds lost packets. These packets are removed from the
2500Unacked Queue when it is decided that they have been lost. Packets on
2501this queue get rescheduled after connection schedules a packet with
2502control frames, as those have higher priority.
2503
25040-RTT Stash Queue
2505-----------------
2506
2507This queue is used by the client to retransmit packets that carry 0-RTT
2508data.
2509
2510Handling ACKs
2511=============
2512
2513Acknowledgements are processed in the function
2514``lsquic_send_ctl_got_ack``.
2515
2516One of the first things that is done is ACK validation. We confirm that
2517the ACK does not contain any packet numbers that we did not send.
2518Because of the way we `generate packet numbers <#packet-numbers>`__,
2519this check is a simple comparison.
2520
2521The nested loops work as follows. The outer loop iterates over the
2522packets in the Unacked Queue in order -- meaning packet numbers
2523increase. In other words, older packets are examined first. The inner
2524loop processes ACK ranges in the ACK *backwards*, meaning that both
2525loops follow packets in increasing packet number order. It is done this
2526way as an optimization. The (previous) alternative design of looking
2527up a packet number in the ACK frame, even if using binary search, is
2528slower.
2529
2530The code is optimized: the inner loop has a minimum possible number of
2531branches. These optimizations predate the more-recent, higher-level
2532optimization. The latest ACK-handling optimization added to the code
2533combines incoming ACKs into a single ACK (at the connection level), thus
2534reducing the number of times this loop needs to be used by a lot,
2535sometimes by a significant factor (when lots of data is being sent).
2536This makes some of the code-level optimizations, such as the use of
2537``__builtin_prefetch``, an overkill.
2538
2539Loss Records
2540============
2541
2542A loss record is a special type of outgoing packet. It marks a place in
2543the Unacked Queue where a lost packet had been -- the lost packet itself
2544having since moved on to the Lost Queue or further. The loss record and
2545the lost packet form a circular linked list called the "loss chain."
2546This list contains one real packet and zero or more loss records. The
2547real packet can move from the Unacked Queue to the Lost Queue to the
2548Scheduled Queue and back to the Unacked Queue; its loss records live
2549only on the Unacked Queue.
2550
2551We need loss records to be able to handle late acknowledgements -- those
2552that acknowledge a packet *after* it has been deemed lost. When an
2553acknowledgment for any of the packet numbers associated with this packet
2554comes in, the packet is acknowledged and the whole loss chain is
2555destroyed.
2556
2557Poisoned Packets
2558================
2559
2560A poisoned packet is used to thwart opportunistic ACK attacks. The
2561opportunistic ACK attack works as follows:
2562
2563-  The client requests a large resource
2564
2565-  The server begins sending the response
2566
2567-  The client sends ACKs for packet number before it sees these packets,
2568   tricking the server into sending packets faster than it would
2569   otherwise
2570
2571The poisoned packet is placed onto the Unacked Queue. If the peer lies
2572about packet numbers it received, it will acknowledge the poisoned
2573packet, in which case it will be discovered during ACK processing.
2574
2575Poisoned packets cycle in and out of the Unacked Queue. A maximum of one
2576poisoned packet is outstanding at any one time for simplicity. (And we
2577don't need more).
2578
2579Packet Numbers
2580==============
2581
2582The Send Controller aims to send out packets without any gaps in the
2583packet number sequence. (The only exception to this rule is the handling
2584of poisoned packets, where the gap is what we want.) Not having gaps in
2585the packet number sequence is beneficial:
2586
2587-  ACK verification is cheap
2588
2589-  Send history updates are fast
2590
2591-  Send history uses very little memory
2592
2593The downside is code complexity and having to renumber packets when they
2594are removed from the Scheduled Queue (due to, for example, STREAM frame
2595elision or loss chain destruction) or resized (due to a path or MTU
2596change, for instance).
2597
2598Some scenarios when gaps can be produced inadvertently are difficult to
2599test or foresee. To cope with that, a special warning in the send
2600history code is added when the next packet produces a gap. This warning
2601is limited to once per connection. Having a gap does not break
2602functionality other than ACK verification, but that's minor. On the
2603other hand, we want to fix such bugs when they crop up -- that's why the
2604warning is there.
2605
2606Loss Detection and Retransmission
2607=================================
2608
2609The loss detection and retransmission logic in the Send Controller was
2610taken from the Chromium code in the fall of 2016, in the beginning of
2611the lsquic project. This logic has not changed much since then -- only
2612some bugs have been fixed here and there. The code bears no resemblance
2613to what is described in the QUIC Recovery Internet Draft. Instead, `the
2614much earlier
2615document <https://tools.ietf.org/html/draft-iyengar-quic-loss-recovery-01>`__,
2616describing gQUIC, could be looked to for reference.
2617
2618Congestions Controllers
2619=======================
2620
2621The Send Controller has a choice of two congestion controllers: Cubic
2622and BBRv1. The latter was translated from Chromium into C. BBRv1 does
2623not work well for very small RTTs.
2624
2625To cope with that, lsquic puts the Send Controller into the "adaptive CC"
2626mode by default. The CC is selected after RTT is determined: below a
2627certain threshold (configurable; 1.5 ms by default), Cubic is used.
2628Until Cubic or BBRv1 is selected, *both* CC controllers are used --
2629because we won't have the necessary state to instantiate a controller
2630when the decision is made.
2631
2632Buffered Packet Handling
2633========================
2634
2635Buffered packets require quite a bit of special handling. Because they
2636are created outside of the regular event dispatch, a lot of things are
2637unknown:
2638
2639-  Congestion window
2640
2641-  Whether more incoming packets will arrive before the next tick
2642
2643-  The optimal packet number size
2644
2645The Send Controller tries its best to accommodate the buffered packets
2646usage scenario.
2647
2648ACKs
2649----
2650
2651When buffered packets are created, we want to generate an ACK, if
2652possible. This can be seen in ``send_ctl_get_buffered_packet``, which
2653calls ``ci_write_ack()``
2654
2655This ACK should be in the first buffered packet to be scheduled. Because
2656the Send Controller does not dictate the order of buffered packet
2657creation -- high-priority versus low-priority -- it may need to move (or
2658steal) the ACK frame from a packet on the low-priority queue to a packet
2659on the high-priority queue.
2660
2661When buffered packets are finally scheduled, we have to remove ACKs from
2662them if another ACK has already been sent. This is because Chrome errors
2663out if out-of-order ACKs come in.
2664
2665Flushing QPACK Decoder
2666----------------------
2667
2668The priority-based write events dispatch is emulated when the first
2669buffered packet is allocated: the QPACK decoder is flushed. Without it,
2670QPACK updates are delayed, which may negatively affect compression
2671ratio.
2672
2673Snapshot and Rollback
2674=====================
2675
2676The Send Controller snapshot and rollback functionality was implemented
2677exclusively for the benefit of the optimized ``lsquic_stream_pwritev``
2678call.
2679
2680Complexity Woes
2681===============
2682
2683The Send Controller is complicated. Because we write stream data to
2684packets directly and packets need to be resized, a lot of complexity
2685resides in the code to resize packets, be it due to repathing, STREAM
2686frame elision, or MTU changes. This is the price to be paid for
2687efficiency in the normal case.
2688
2689
2690Alarm Set
2691*********
2692
2693*Files: lsquic_alarmset.h, lsquic_alarmset.c, test_alarmset.c*
2694
2695TODO
2696
2697Tickable Queue
2698**************
2699
2700*Files: lsquic_engine.c, lsquic_min_heap.h, lsquic_min_heap.c*
2701
2702The Tickable Queue is a min-heap used as a priority queue. Connections
2703on this queue are in line to be processed. Connections that were last
2704ticked a longer time ago have higher priority than those ticked
2705recently. (``cn_last_ticked`` is used for ordering.) This is intended to
2706prevent starvation as multiple connections vye for the ability to send
2707packets.
2708
2709The way the min-heap grows is described in `Growing
2710Min-Heaps <#growing-min-heaps>`__.
2711
2712Advisory Tick Time Queue
2713************************
2714
2715*Files: lsquic_attq.h, lsquic_attq.c*
2716
2717This data structure is a mini-heap. Connections are ordered by the value
2718of the next time they should be processed (ticked). (Because this is not
2719a hard limit, this value is advisory -- hence its name.)
2720
2721This particular min-heap implementation has two properties worth
2722highlighting:
2723
2724Removal of Arbitrary Elements
2725=============================
2726
2727When a connection's next tick time is updated (or when the connection is
2728destroyed), the connection is removed from the ATTQ. At that time, it
2729may be at any position in the min-heap. The position is recorded in the
2730heap element, ``attq_elem->ae_heap_idx`` and is updated when elements are
2731swapped. This makes it unnecessary to search for the entry in the
2732min-heap.
2733
2734Swapping Speed
2735==============
2736
2737To make swapping faster, the array that underlies the min-heap is an
2738array of *pointers* to ``attq_elem``. This makes it unnecessary to update
2739each connection's ``cn_attq_elem`` as array elements are swapped: the
2740memory that stores ``attq_elem`` stays put. This is why there are both
2741``aq_elem_malo`` and ``aq_heap``.
2742
2743CID Purgatory
2744*************
2745
2746Memory Manager
2747**************
2748
2749Malo Allocator
2750**************
2751
2752Receive History
2753***************
2754
2755*Files: lsquic_rechist.h, lsquic_rechist.c, test_rechist.c*
2756
2757.. _overview-5:
2758
2759Overview
2760========
2761
2762The reason for keeping the history of received packets is to generate
2763ACK frames. The Receive History module provides functionality to add
2764packet numbers, truncate history, and iterate over the received packet
2765number ranges.
2766
2767.. _data-structures-3:
2768
2769Data Structures
2770===============
2771
2772.. _overview-6:
2773
2774Overview
2775--------
2776
2777The receive history is a singly-linked list of packet number ranges,
2778ordered from high to low:
2779
2780.. image:: rechist-linked-list.png
2781
2782The ordering is maintained as an invariant with each addition to the
2783list and each truncation. This makes it trivial to iterate over the
2784ranges.
2785
2786To limit the amount of memory this data structure can allocate, the
2787maximum number of elements is specified when Receive History is
2788initialized. In the unlikely case that that number is reached, new
2789elements will push out the elements at the tail of the linked list.
2790
2791Memory Layout
2792-------------
2793
2794In memory, the linked list elements are stored in an array. Placing them
2795into contiguous memory achieves three goals:
2796
2797-  Receive history manipulation is fast because the elements are all
2798   close together.
2799
2800-  Memory usage is reduced because each element does not use pointers to
2801   other memory locations.
2802
2803-  Memory fragmentation is reduced.
2804
2805The array grows as necessary as the number of elements increases.
2806
2807The elements are allocated from and returned to the array with the aid
2808of an auxiliary data structure. An array of bitmasks is kept where each
2809bit corresponds to an array element. A set bit means that the element is
2810allocated; a cleared bit indicates that the corresponding element is
2811free.
2812
2813To take memory savings and speed further, the element array and the
2814array of bitmasks are allocated in a single span of memory.
2815
2816.. image:: rechist-memory-layout.png
2817
2818rechist_elem
2819------------
2820
2821``re_low`` and ``re_count`` define the packet range. To save memory, we
2822assume that the range will not contain more than 4 billion entries and
2823use a four-byte integer instead of a second ``lsquic_packno_t``.
2824
2825``re_next`` is the index of the next element. Again, we assume that there
2826will be no more than 4 billion elements. The NULL pointer is represented
2827by ``UINT_MAX``.
2828
2829This struct is just 16 bytes in size, which is a nice number.
2830
2831lsquic_rechist
2832--------------
2833
2834``rh_elems`` and ``rh_masks`` are the element array and the bitmask array,
2835respectively, as described above. The two use the same memory chunk.
2836
2837``rh_head`` is the index of the first element of the linked list.
2838
2839The iterator state, ``rh_iter``, is embedded into the main object itself,
2840as there is no expectation that more than one iterations will need to be
2841active at once.
2842
2843.. _notable-code-5:
2844
2845Notable Code
2846============
2847
2848Inserting Elements
2849------------------
2850
2851Elements may be inserted into the list when a new packet number is added
2852to history via ``lsquic_rechist_received()``. If the new packet number
2853requires a new range (e.g. it does not expand one of the existing
2854ranges), a new element is allocated and linked.
2855
2856There are four cases to consider:
2857
28581. Inserting the new element at the head of the list, with it becoming
2859   the new head. (This is the most common scenario.) The code that
2860   does it is labeled ``first_elem``.
2861
28622. Appending the new element to the list, with it becoming the new tail.
2863   This code is located right after the ``while`` loop.
2864
28653. Inserting the new element between two existing elements. This code is
2866   labeled ``insert_before``.
2867
28684. Like (3), but when the insertion is between the last and the
2869   next-to-last elements and the maximum number of elements has been
2870   reached. In this case, the last element's packet number
2871   information can simply be replaced. This code is labeled
2872   ``replace_last_el``.
2873
2874Growing the Array
2875-----------------
2876
2877When all allocated elements in ``rh_elems`` are in use
2878(``rh_n_used >= rh_n_alloced``), the element array needs to be expanded.
2879This is handled by the function ``rechist_grow``.
2880
2881Note how, after realloc, the bitmask array is moved to its new location
2882on the right side of the array.
2883
2884Handling Element Overflow
2885-------------------------
2886
2887When the array has grown to its maximum allowed size, allocating a new
2888element occurs via reusing the last element on the list, effectively
2889pushing it out. This happens in ``rechist_reuse_last_elem``.
2890
2891The first loop finds the last linked list element: that's the element
2892whose ``re_next`` is equal to ``UINT_MAX``.
2893
2894Then, the second loop finds the element that points to the last element.
2895This is the next-to-last (penultimate) element. This element's next
2896pointer will now be set to NULL, effectively dropping the last element,
2897which can now be reused.
2898
2899Iterating Over Ranges
2900---------------------
2901
2902Iteration is performed by the ``lsquic_rechist_first`` and
2903``lsquic_rechist_next`` pair of functions. The former resets the internal
2904iterator. Only one iteration at a time is possible.
2905
2906These functions have a specific signature: they and the pointer to the
2907receive history are passed to the ``pf_gen_ack_frame`` function, which
2908generates an ACK frame.
2909
2910Clone Functionality
2911-------------------
2912
2913The Receive History can be initialized from another instance of a
2914receive history. This is done by ``lsquic_rechist_copy_ranges``. This
2915functionality is used during connection promotion, when `Tiny Receive
2916History <#tiny-receive-history>`__ that is used by the `IETF mini
2917connection <#mini-ietf-connection>`__ is converted to Receive History.
2918
2919Tiny Receive History
2920********************
2921
2922*Files: lsquic_trechist.h, lsquic_trechist.c, test_trechist.c*
2923
2924.. _overview-7:
2925
2926Overview
2927========
2928
2929The Tiny Receive History is similar to `Receive
2930History <#receive-history>`__, but it is even more frugal with memory.
2931It is used in the `IETF mini connection <#mini-ietf-connection>`__ as a
2932more costly `alternative to using bare bitmasks <#imc-recvd-packnos>`__.
2933
2934Because it is so similar to Receive History, only differences are
2935covered in this section.
2936
2937Less Memory
2938===========
2939
2940No Trechist Type
2941----------------
2942
2943There is no ``lsquic_trechist``. The history is just a single 32-bit
2944bitmask and a pointer to the array of elements. The bitmask and the
2945pointer are passed to all ``lsquic_trechist_*`` functions.
2946
2947This gives the user of Tiny Receive History some flexibility and saves
2948memory.
2949
2950Element
2951-------
2952
2953The linked list element, ``trechist_elem``, is just 6 bytes in size. The
2954assumptions are:
2955
2956-  No packet number is larger than 2\ :sup:`32` - 1
2957
2958-  No packet range contains more than 255 packets
2959
2960-  Linked list is limited to 256 elements
2961
2962Head Does Not Move
2963==================
2964
2965Because of memory optimizations described above, the head element is
2966always at index 0. The NULL pointer ``te_next`` is indicated by the value
29670 (because nothing points to the first element).
2968
2969Array Does Not Grow
2970===================
2971
2972The size of the element array is limited by the 32-bit bitmask. As a
2973further optimization, the number of ranges is limited to 16 via the
2974``TRECHIST_MAX_RANGES`` macro.
2975
2976Insertion Range Check
2977=====================
2978
2979A packet range spanning more than 255 (UCHAR_MAX) packets cannot be
2980represented. This will cause a failure, as it is checked for in the
2981code.
2982
2983This many packets are unlikely to even be required to complete the
2984handshake. If this limit is hit, it is perhaps good to abort the mini
2985connection.
2986
2987Set64
2988*****
2989
2990Appendix A: List of Data Structures
2991***********************************
2992
2993The majority of data structures employed by lsquic are linked lists and,
2994to a lesser extent, arrays. This makes the code simple and fast
2995(assuming a smart memory layout).
2996
2997Nevertheless, a few places in the code called for more involved and, at
2998times, customized data structures. This appendix catalogues them.
2999
3000This is the list of non-trivial data structures implemented in lsquic:
3001
3002Ring Buffer Linked Lists
3003========================
3004
3005-  `Receive History <#receive-history>`__
3006
3007-  `Tiny Receive History <#tiny-receive-history>`__
3008
3009Hashes
3010======
3011
3012-  lsquic_hash
3013
3014-  hash_data_in
3015
3016Min-heaps
3017=========
3018
3019-  `Advisory Tick Time Queue <#advisory-tick-time-queue>`__
3020
3021-  lsquic_min_heap
3022
3023Bloom Filters
3024=============
3025
3026-  CID Purgatory
3027
3028
3029Appendix B: Obsolete and Defunct Code
3030*************************************
3031
3032Mem Used
3033========
3034
3035Engine History
3036==============
3037
3038QLOG
3039====
3040
3041
3042.. [1]
3043   This is due to the limitation of the QPACK library: the decoder can
3044   read input one byte at a time, but the encoder cannot write output
3045   one byte at a time. It could be made to do that, but the effort is
3046   not worth it.
3047
3048.. [2]
3049   Mini conn structs are allocated out of the `Malo
3050   Allocator <#malo-allocator>`__, which used to be limited to objects
3051   whose size is a power of two, so it was either fitting it into 128
3052   bytes or effectively doubling the mini conn size.
3053