internals.rst revision eea99896
1############ 2Library Guts 3############ 4 5.. highlight:: c 6 7Introduction 8************ 9 10 11lsquic inception dates back to the fall of 2016. Since that time, lsquic 12underwent several major changes. Some of those had to do with making the 13library more performant; others were needed to add important new 14functionality (for example, IETF QUIC and HTTP/3). Throughout this time, 15one of the main principles we embraced is that **performance trumps 16everything else**, including code readability and maintainability. This 17focus drove code design decisions again and again and it explains some 18of the hairiness that we will come across in this document. 19 20Code Version 21============ 22 23The code version under discussion is v2.29.6. 24 25High-Level Structure 26******************** 27 28At a high level, the lsquic library can be used to instantiate an engine 29(or several engines). An engine manages connections; each connection has 30streams. Engine, connection, and stream objects are exposed to the user 31who interacts with them using the API (see :doc:`apiref`). All other data 32structures are internal and are hanging off, in one way or another, from 33the engine, connection, or stream objects. 34 35Coding Style 36************ 37 38Spacing and Cuddling 39==================== 40 41lsquic follows the LiteSpeed spacing and cuddling conventions: 42 43- Two empty lines between function definitions 44 45- Four-space indentation 46 47- Ifs and elses are not cuddled 48 49Function Name Alignment 50======================= 51 52In function definitions, the name is always left-aligned, for example: 53 54:: 55 56 static void 57 check_flush_threshold (lsquic_stream_t *stream) 58 59 60Naming Conventions 61================== 62 63- Struct members usually have prefixes derived from the struct name. 64 For example members of ``struct qpack_dec_hdl`` begin with ``qdh_``, 65 members of ``struct cid_update_batch`` begin with ``cub_``, and so on. 66 This is done to reduce the need to memorize struct member names as 67 vim's autocomplete (Ctrl-P) functionality makes it easy to fill in 68 the needed identifier. 69 70- Non-static functions all begin with ``lsquic_``. 71 72- Functions usually begin with a module name (e.g. ``lsquic_engine_`` or 73 ``stream_``) which is then followed by a verb (e.g. 74 ``lsquic_engine_connect`` or ``stream_activate_hq_frame``). If a 75 function does not begin with a module name, it begins with a verb 76 (e.g. ``check_flush_threshold`` or ``maybe_remove_from_write_q``). 77 78- Underscores are used to separate words (as opposed to, for example, 79 theCamelCase). 80 81Typedefs 82======== 83 84Outside of user-facing API, structs, unions, and enums are not 85typedefed. On the other hand, some integral types are typedefed. 86 87List of Common Terms 88******************** 89 90- **gQUIC** Google QUIC. The original lsquic supported only Google 91 QUIC. gQUIC is going to become obsolete. (Hopefully soon). 92 93- **HQ** This stands for "HTTP-over-QUIC", the original name of HTTP/3. 94 The code predates the official renaming to HTTP/3 and thus there 95 are many types and names with some variation of ``HQ`` in them. 96 97- **iQUIC** This stands for IETF QUIC. To differentiate between gQUIC 98 and IETF QUIC, we use ``iquic`` in some names and types. 99 100Engine 101****** 102 103*Files: lsquic_engine.c, lsquic_engine_public.h, lsquic.h* 104 105Data Structures 106=============== 107 108out_batch 109--------- 110 111:: 112 113 /* The batch of outgoing packets grows and shrinks dynamically */ 114 /* Batch sizes do not have to be powers of two */ 115 #define MAX_OUT_BATCH_SIZE 1024 116 #define MIN_OUT_BATCH_SIZE 4 117 #define INITIAL_OUT_BATCH_SIZE 32 118 119 struct out_batch 120 { 121 lsquic_conn_t *conns [MAX_OUT_BATCH_SIZE]; 122 struct lsquic_out_spec outs [MAX_OUT_BATCH_SIZE]; 123 unsigned pack_off[MAX_OUT_BATCH_SIZE]; 124 lsquic_packet_out_t *packets[MAX_OUT_BATCH_SIZE * 2]; 125 struct iovec iov [MAX_OUT_BATCH_SIZE * 2]; 126 }; 127 128The array of struct lsquic_out_specs -- outs above -- is what gets 129passed to the user callback ``ea_packets_out()``. ``conns`` array corresponds 130to the spec elements one to one. 131 132``pack_off`` records which packet in ``packets`` corresponds to which 133connection in ``conns``. Because of coalescing, an element in ``outs`` can 134correspond (logically) to more than one packet. (See how the batch is 135constructed in `Batching packets`_.) On the 136other hand, ``packets`` and ``iov`` arrays have one-to-one correspondence. 137 138There is one instance of this structure per engine: the whole thing is 139allocated as part of `struct lsquic_engine <#lsquic-engine>`__. 140 141 142cid_update_batch 143---------------- 144 145:: 146 147 struct cid_update_batch 148 { 149 lsquic_cids_update_f cub_update_cids; 150 void *cub_update_ctx; 151 unsigned cub_count; 152 lsquic_cid_t cub_cids[20]; 153 void *cub_peer_ctxs[20]; 154 }; 155 156This struct is used to batch CID updates. 157 158There are three user-defined CID liveness callbacks: ``ea_new_scids``, 159``ea_live_scids``, and ``ea_old_scids``. These functions all have the same 160signature, ``lsquic_cids_update_f``. When the batch reaches the count of 16120 (kept in ``cub_count``), the callback is called. 162 163The new SCIDs batch is kept in `struct 164lsquic_engine <#lsquic-engine>`__. Other batches are allocated on the 165stack in different functions as necessary. 166 16720 is an arbitrary number. 168 169lsquic_engine_public 170-------------------- 171 172This struct, defined in lsquic_engine_public.h, is the "public" 173interface to the engine. ("Public" here means accessible by other 174modules inside lsquic, not that it's a public interface like the 175:doc:`apiref`.) Because there are many things in the engine object that 176are accessed by other modules, this struct is used to expose those 177(``public``) parts of the engine. 178 179``lsquic_engine_struct`` is the first member of 180`lsquic_engine <#lsquic-engine>`__. The functions declared in 181lsquic_engine_public.h take a pointer to lsquic_engine_public as the 182first argument, which is then case to lsquic_engine. 183 184This is somewhat ugly, but it's not too bad, as long as one remembers 185that the two pointers are interchangeable. 186 187lsquic_engine 188------------- 189 190This is the central data structure. The engine instance is the root of 191all other data structures. It contains: 192 193- Pointers to connections in several lists and hashes (see `Connection Management <#connection-management>`__) 194 195- Memory manager 196 197- Engine settings 198 199- Token generator 200 201- CID Purgatory 202 203- Server certificate cache 204 205- Transport parameter cache 206 207- Packet request queue 208 209- `Outgoing packet batch <#out-batch>`__ 210 211- And several other things 212 213Some of the members above are stored in the ``pub`` member of type 214`lsquic_engine_public <#lsquic-engine-public>`__. These are accessed 215directly from other parts of lsquic. 216 217The engine is instantiated via ``lsquic_engine_new()`` and destroyed via 218``lsquic_engine_destroy()`` 219 220Connection Management 221===================== 222 223Lifetime 224-------- 225 226There are several `connection types`_. All types of 227connections begin their life inside the engine module, where their 228constructors are called. They all also end their life here as well: this 229is where the destructors are called. 230 231The connection constructors are all different function calls: 232 233- lsquic_ietf_full_conn_client_new 234 235- lsquic_gquic_full_conn_client_new 236 237- lsquic_ietf_full_conn_server_new 238 239- lsquic_gquic_full_conn_server_new 240 241- lsquic_mini_conn_ietf_new 242 243- lsquic_mini_conn_new 244 245- lsquic_prq_new_req 246 247- lsquic_prq_new_req_ext 248 249(See `Evanescent Connection`_ for information about the last two.) 250 251After a connection is instantiated, all further interactions with it, 252including destruction, are done via the `Common Connection Interface`_. 253 254Refcounting Model 255----------------- 256 257Each connection is referenced by at least one of the following data 258structures: 259 2601. CID-to-connection hash. This hash is used to find connections in 261 order to dispatch an incoming packet. Connections can be hashed by 262 CIDs or by address. In the former case, each connection has one or 263 more mappings in the hash table. IETF QUIC connections have up to 264 eight (in our implementation) source CIDs and each of those would 265 have a mapping. In client mode, depending on QUIC versions and 266 options selected, it is may be necessary to hash connections by 267 address, in which case incoming packets are delivered to 268 connections based on the address. 269 2702. Outgoing queue. This queue holds connections that have packets to 271 send. 272 2733. `Tickable Queue`_. This queue holds connections 274 that `can be ticked now <#tickability>`__. 275 2764. `Advisory Tick Time Queue`_. 277 2785. Closing connections queue. This is a transient queue -- it only 279 exists for the duration of 280 `process_connections() <#processing-connections>`__ function call. 281 2826. Ticked connections queue. Another transient queue, similar to the 283 above. 284 285The idea is to destroy the connection when it is no longer referenced. 286For example, a connection tick may return TICK_SEND|TICK_CLOSE. In that 287case, the connection is referenced from two places: (2) and (5). After 288its packets are sent, it is only referenced in (5), and at the end of 289the function call, when it is removed from (5), reference count goes to 290zero and the connection is destroyed. (See function ``destroy_conn``.) If 291not all packets can be sent, at the end of the function call, the 292connection is referenced by (2) and will only be removed once all 293outgoing packets have been sent. 294 295.. image:: lsquic-engine-conns.png 296 297In the diagram above, you can see that the CID-to-connection hash has 298several links to the same connection. This is because an IETF QUIC 299connection has more than one Source Connection IDs (SCIDs), any of which 300can be included by the peer into the packet. See ``insert_conn_into_hash`` 301for more details. 302 303References from each of these data structures are tracked inside the 304connection object by bit flags: 305 306:: 307 308 #define CONN_REF_FLAGS (LSCONN_HASHED \ 309 |LSCONN_HAS_OUTGOING \ 310 |LSCONN_TICKABLE \ 311 |LSCONN_TICKED \ 312 |LSCONN_CLOSING \ 313 |LSCONN_ATTQ) 314 315Functions ``engine_incref_conn`` and ``engine_decref_conn`` manage setting 316and unsetting of these flags. 317 318 319Notable Code 320============ 321 322Handling incoming packets 323------------------------- 324 325Incoming UDP datagrams are handed off to the lsquic library using the 326function ``lsquic_engine_packet_in``. Depending on the engine mode -- 327client or server -- the appropriate `packet 328parsing <#parsing-packets>`__ function is selected. 329 330Because a UDP datagram can contain more than one QUIC packet, the 331parsing is done in a loop. If the first part of packet parsing is 332successful, the internal ``process_packet_in`` function is called. 333 334There, most complexity is contained in ``find_or_create_conn``, which gets 335called for the server side. Here, parsing of the packet is finished, now 336via the version-specific call to ``pf_parse_packet_in_finish``. If 337connection is not found, it may need to be created. Before that, the 338following steps are performed: 339 340- Check that engine is not in the cooldown mode 341 342- Check that the maximum number of mini connections is not exceeded 343 344- Check that the (D)CID specified in the packet is not in the `CID Purgatory`_ 345 346- Check that the packet can be used to create a mini conn: it contains 347 version information and the version is supported 348 349- Depending on QUIC version, perform token verification, if necessary 350 351Only then does the mini connection constructor is called and the 352connection is inserted into appropriate structures. 353 354Processing connections 355---------------------- 356 357Connections are processed in the internal function 358``process_connections``. There is the main connection processing loop and 359logic. 360 361All connections that the iterator passed to this function returns are 362processed in the first while loop. The ``ci_tick()`` call is what causes 363the underlying connection to do all it needs to (most importantly, 364dispatch user events and generate outgoing packets). The return value 365dictates what lists -- global and local to the function -- the 366connection will be placed upon. 367 368Note that mini connection promotion happens inside this loop. Newly 369created full connections are processed inside the same while loop. For a 370short time, a mini and a full connection object exist that are 371associated with the same logical connection. 372 373After all connections are ticked, outgoing packets, if there are any, 374`are sent out <#batching-packets>`__. 375 376Then, connections that were closed by the first while loop above are 377finally closed. 378 379Connections that were ticked (and not closed) are either: 380 381- Put back onto the ``tickable`` queue; 382 383- Added to the `Advisory Tick Time Queue`_; or 384 385- Left unqueued. This can happen when both idle and ping timer are 386 turned off. (This should not happen for the connections that we 387 expect to process, though.) 388 389And lastly, CID liveness updates are reported to the user via the 390optional SCIDs callbacks: ``ea_new_scids`` etc. 391 392Tickable Queue Cycle 393-------------------- 394 395When a connection is ticked, it is removed from the `Tickable 396Queue <#tickable-queue>`__ and placed onto the transient Ticked Queue. 397After outgoing packets are sent and some connections are closed, the 398Ticked Queue is examined: the engine queries each remaining connection 399again whether it's tickable. If it is, back onto the Tickable Queue it 400goes. This should not happen often, however. It may occur when RTT is 401low and there are many connections to process. In that case, once all 402connections have been processed, the pacer now allows to send another 403packet because some time has passed. 404 405Batching packets 406---------------- 407 408Packet-sending entry point is the function ``send_packets_out``. The main 409idea here is as follows: 410 411Iterate over connections that have packets to send (those are on the 412Outgoing queue in the engine). For each connection, ask it for the next 413outgoing packet, encrypt it, and place it into the batch. When the batch 414is full, `send the batch <#sending-a-batch>`__. 415 416The outgoing packets from all connections are interleaved. For example, 417if connections A, B, and C are on the Outgoing queue, the batch will 418contain packets A1, B1, C1, A2, B2, C2, A3, B3, C3, ��� and so on. This is 419done to ensure fairness. When a connection runs out of packets to send, 420it returns NULL and is removed from the iterator. 421 422The idea is simple, but the devil is in the details. The code may be 423difficult to read. There are several things going on: 424 425Conns Out Iterator 426^^^^^^^^^^^^^^^^^^ 427 428This iterator, ``conns_out_iter``, sends packets from connections on the 429Outgoing queue and packets on the Packet Request queue. (The latter 430masquerade as `Evanescent Connections <#evanescent-connection>`__ so that they 431are simple to use.) First, the Outgoing queue (which is a min-heap) is 432drained. Then, packets from the Packet Request queue are sent, if there 433are any. Then, remaining connections from the first pass are returned in 434the round-robin fashion. 435 436After sending is completed, the connections that still have outgoing 437packets to send are placed back onto the Outgoing queue. 438 439 440Packet Coalescing 441^^^^^^^^^^^^^^^^^ 442 443Some IETF QUIC packets can be coalesced. This reduces the number of UDP 444datagrams that need to be sent during the handshake. To support this, if 445a packet matches some parameters, the same connection is queried for 446another packet, which, if it returns, is added to the current batch 447slot's iov. 448 449:: 450 451 if ((conn->cn_flags & LSCONN_IETF) 452 && ((1 << packet_out->po_header_type) 453 & ((1 << HETY_INITIAL)|(1 << HETY_HANDSHAKE)|(1 << HETY_0RTT))) 454 && (engine->flags & ENG_COALESCE) 455 && iov < batch->iov + sizeof(batch->iov) / sizeof(batch->iov[0])) 456 { 457 const struct to_coal to_coal = { 458 .prev_packet = packet_out, 459 .prev_sz_sum = iov_size(packet_iov, iov), 460 }; 461 packet_out = conn->cn_if->ci_next_packet_to_send(conn, &to_coal); 462 if (packet_out) 463 goto next_coa; 464 } 465 batch->outs [n].iovlen = iov - packet_iov; 466 467*With some debug code removed for simplicity* 468 469Also see the description of the batch in `out_batch`_. 470 471Note that packet coalescing is only done during the handshake of an IETF 472QUIC connection. Non-handshake and gQUIC packets cannot be coalesced. 473 474Sending and Refilling the Batch 475^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 476 477When the batch is sent inside the while loop, and the whole batch was 478sent successfully, the batch pointers are reset, the batch potentially 479grows larger, and the while loop continues. 480 481Batch Resizing 482^^^^^^^^^^^^^^ 483 484When all datagrams in the batch are sent successfully, the batch may 485grow -- up to the hardcoded maximum value of ``MAX_OUT_BATCH_SIZE``. When 486not all datagrams are sent, the batch shrinks. The batch size survives 487the call into the library: when packets are sent again, the same batch 488size is used to begin the sending. 489 490Deadline Checking 491^^^^^^^^^^^^^^^^^ 492 493This is a rather old safety check dating back to the summer of 2017, 494when we first shipped QUIC support. The way we send packets has changed 495since then -- there is high possibility that this code can be removed 496with no ill effect. 497 498Sending a batch 499--------------- 500 501When the batch is filled, it is handed off to the function ``send_batch``, 502which calls the user-supplied callback to send packets out. The 503high-level logic is as follows: 504 505- Update each packet's ``sent`` time 506 507- Call the "send packets out" callback 508 509- For packets that were sent successfully, call ``ci_packet_sent`` 510 511- For packets that were not sent, call ``ci_packet_not_sent``. This is 512 important: all packets returned by ``ci_next_packet_to_send`` must 513 be returned to the connection via either these two calls above or 514 via ``ci_packet_too_large`` (see below). 515 516- Return the number of packets sent 517 518Because of support for coalescing, we have to map from outgoing spec to 519packets via ``batch->pack_off``. This is done in several places in this 520function. 521 522To handle the case when a PMTU probe is too large (stuff happens!), the 523code checks for EMSGSIZE and returns the packet back to the connection 524via ``ci_packet_too_large``. Because this error is of our own making, this 525does not count as inability to send. The too-large packet is skipped and 526sending of the datagrams in the batch continues. 527 528Growing min-heaps 529----------------- 530 531The Outgoing and Tickable connection queues are actually min-heaps. The 532number of elements in these min-heaps never exceeds the number of 533connections. As optimization, allocation of the underlying arrays is 534done not in the min-heap module itself but in the engine module in the 535function ``maybe_grow_conn_heaps``. The engine knows how many connections 536there are and it grows the arrays as necessary. 537 538As an additional optimization, the two arrays use a single memory region 539which is allocated once. 540 541The min-heap arrays are never shrunk. 542 543Connection 544********** 545 546*Files: lsquic_conn.h, lsquic_conn.c -- others are covered in dedicated 547chapters* 548 549The connection represents the QUIC connection. Connections are `managed 550by the engine <#connection-management>`__. A connection, in turn, 551manages `streams <#stream>`__. 552 553Connection Types 554================ 555 556lsquic supports two different QUIC protocols: Google QUIC and IETF QUIC. 557Each of these has a separate implementation, which includes connection 558logic, parsing/generating mechanism, and encryption. 559 560Each of the QUIC connection types on the server begin their life as a 561``mini`` connection. This connection type is used while handshake is 562proceeding. Once the handshake has completed, the mini connection is 563``promoted`` to a ``full`` connection. (See `Mini vs Full 564Connection <#mini-vs-full-connections>`__ for more.) 565 566In addition to the above, an "evanescent" connection type is used to 567manage replies to incoming packets that do not result in connection 568creation. These include version negotiation, stateless retry, and 569stateless reset packets. 570 571Each of the five connection types above are covered in their own 572dedicated chapters elsewhere in this document: 573 574- `Mini gQUIC Connection <#mini-gquic-connection>`__ 575 576- `Full gQUIC Connection <#connection-public-interface>`__ 577 578- `Mini IETF QUIC Connection <#mini-ietf-connection>`__ 579 580- `Full IETF QUIC Connection <#mini-ietf-connection>`__ 581 582- `Evanescent Connection <#evanescent-connection>`__ 583 584lsquic_conn 585=========== 586 587All connection types expose the same connection interface via a pointer 588to ``struct lsquic_conn``. (This is the same type pointer to which is 589exposed to the user, but the user can only treat the connection as an 590opaque pointer.) 591 592This structure contains the following elements: 593 594Pointers to Crypto Implementation 595--------------------------------- 596 597The crypto session pointer, ``cn_enc_session``, points to a type-specific 598(gQUIC or iQUIC) instance of the encryption session. This session 599survives `connection promotion <#connection-promotion>`__. 600 601The two types of crypto session have a set of common functionality; it 602is pointed to by ``cn_esf_c`` (where ``c`` stands for ``common``). Each of 603them also has its own, type-specific functionality, which is pointed to 604by ``cn_esf.g`` and ``cn_esf.i`` 605 606Pointer to Common Connection Interface 607-------------------------------------- 608 609``cn_if`` points to the set of functions that implement the Common 610Connection Interface (`see below <#common-connection-interface>`__). 611 612Pointer to Parsing Interface 613---------------------------- 614 615The parsing interface is version-specific. It is pointed to by ``cn_pf``. 616 617Various list and heap connectors 618-------------------------------- 619 620A connection may be pointed to by one or several queues and heaps (see 621"\ `Connection Management <#connection-management>`__\ "). There are 622several struct members that make it possible: all the \*TAILQ_ENTRYs, 623``cn_attq_elem``, and ``cn_cert_susp_head``. 624 625Version 626------- 627 628``cn_version`` is used to make some decisions in several parts of the 629code. 630 631Flags 632----- 633 634The flags in ``cn_flags`` specify which lists the connection is on and 635some other properties of the connection which need to be accessible by 636other modules. 637 638Stats 639----- 640 641``cn_last_sent`` and ``cn_last_ticked`` are used to determine the 642connection's place on the outgoing queue (see `Batching 643Packets <#batching-packets>`__) and on the `Advisory Tick Time 644Queue <#alarm-set>`__. 645 646List of SCIDs 647------------- 648 649IETF QUIC connections have one or more SCIDs (Source Connection IDs), 650any one of which can be used by the peer as the DCID (Destination CID) 651in the packets it sends. Each of the SCIDs is used to hash the 652connection so it can be found. ``cn_cces`` points to an array of size 653``cn_n_cces`` which is allocated internally inside each connection type. 654 655Google QUIC connections use only one CID (same for source and 656destination). In order not to modify old code, the macro ``cn_cid`` is 657used. 658 659Common Connection Interface 660=========================== 661 662The struct ``conn_iface`` defines the common connection interface. All 663connection types implement all or some of these functions. 664 665Some of these functions are used by the engine; others by other modules 666(for example, to abort a connection); yet others are for use by the 667user, e.g. ``lsquic_conn_close`` and others in lsquic.h. In that case, 668these calls are wrapped in lsquic_conn.c. 669 670Tickability 671=========== 672 673A connection is processed when it is tickable. More precisely, the 674connection is placed onto the `Tickable Queue <#tickable-queue>`__, 675which is iterated over when `connections are 676processed <#processing-connections>`__. A connection reports its own 677tickability via the ``ci_is_tickable`` method. 678 679In general, a connection is tickable if it has productive user callbacks 680to dispatch (that is, user wants to read and there is data to read or 681user wants to write and writing is possible), if there are packets to 682send or generate, or if its advisory tick time is in the past. (The 683latter is handled in ``lsquic_engine_process_conns()`` when expired 684connections from the `Advisory Tick Time Queue`_ are added 685to the Tickable Queue.) 686 687Stream 688****** 689 690*Files: lsquic_stream.h, lsquic_stream.c* 691 692Overview 693======== 694 695The lsquic stream is the conduit for data. This object is accessible by 696the user via any of the ``lsquic_stream_*`` functions declared in 697lsquic.h. The stream is bidirectional; in our user code, it represents 698the HTTP request and response. The client writes its request to the 699stream and the server reads the request in its corresponding instance of 700the stream. The server sends its response using the same stream, which 701the client reads from the stream. 702 703Besides streams exposed to the application, connections use streams 704internally: 705 706- gQUIC has the HANDSHAKE and HEADERS streams 707 708- IETF QUIC has up to four HANDSHAKE streams 709 710- HTTP/3 has at least three unidirectional streams: 711 712 - Settings stream 713 714 - QPACK encoder stream 715 716 - QPACK decoder stream 717 718In addition, HTTP/3 push promises use unidirectional streams. In the 719code, we make a unidirectional stream simply by closing one end in the 720constructor. 721 722All of the use cases above are handled by the single module, 723lsquic_stream. The differences in behavior -- gQUIC vs IETF QUIC, HTTP 724vs non-HTTP -- are handled either by explicit conditionals or via 725function pointers. 726 727The streams hang off full connections via stream ID-to-stream hashes and 728in various queues. This is similar to the way the connections hang off 729the engine. 730 731Streams are only used in the full connections; mini connections use 732their own, minimalistic, code to handle streams. 733 734.. _data-structures-1: 735 736Data Structures 737=============== 738 739stream_hq_frame 740--------------- 741 742This structure is used to keep information about an HTTP/3 frame that is 743being, or is about to be, written. In our implementation, frame headers 744can be two or three bytes long: one byte is HTTP/3 frame type and the 745frame length is encoded in 1 or 2 bytes, giving us the maximum payload 746size of 2\ :sup:`14` - 1 bytes. You will find literal ``2`` or ``3`` values 747in code that deals with writing HQ frames. 748 749If the HQ frame's size is known in advance (SHF_FIXED_SIZE) -- which is 750the case for HEADERS and PUSH_PROMISE frames -- then the HQ header 751contents are written immediately. Otherwise, ``shf_frame_ptr`` points to 752the bytes in the packet where the HQ header was written, to be filled in 753later. 754 755See `Writing HTTP/3 Streams`_ for more information. 756 757hq_filter 758--------- 759 760This structure is used to read HTTP/3 streams. A single instance of it 761is stored in the stream in ``sm_hq_filter``. The framing is removed 762transparently (see `Reading HTTP/3 Streams`_). 763 764Frame type and length are read into ``hqfi_vint2_state``. Due to greasing, 765the reader must be able to support arbitrary frame types and so the code 766is pretty generic: varints of any size are supported. 767 768``hqfi_flags`` and ``hqfi_state`` contain information needed to resume 769parsing the frame header, as only partial data may have arrived. 770 771``hqfi_hist_buf`` and ``hqfi_hist_idx`` are used to record the last few 772incoming headers. This information is used to check for validity, as 773some sequences of HTTP/3 frames are invalid. 774 775stream_filter_if 776---------------- 777 778This struct is used to specify functionality required to strip arbitrary 779framing when reading from the stream. At the moment (and for the 780foreseeable future) only one mechanism is used: that to strip the HTTP/3 781framing. At the time the code was written, however, the idea was to 782future-proof it in case we needed to support more than one framing format 783at a time. 784 785lsquic_stream 786------------- 787 788This struct is the stream object. It contains many members that deal 789with 790 791- Reading data 792 793- Writing data 794 795- Maintaining stream list memberships 796 797- Enforcing flow control 798 799- Dispatching read and write events 800 801- Calling various user callbacks 802 803- Interacting with HEADERS streams 804 805The stream has an ID (``id``). It is used to hash the stream. 806 807A stream can be on one or more lists: see ``next_send_stream``, 808``next_read_stream``, and so on. 809 810Incoming data is stored in ``data_in``. Outgoing data is packetized 811immediately or buffered in ``sm_buf``. 812 813HTTP/3 frames that are being actively written are on the ``sm_hq_frames`` 814list. 815 816A note on naming: newer members of the stream begin with ``sm_`` for 817simplicity. Originally, the structure members lacked a prefix. 818 819progress 820-------- 821 822This structure is used to determine whether the user callback has made 823any progress during an ``on_write`` or ``on_read`` event loop. If progress 824is not made for a number of calls, the callback is interrupted, breaking 825out of a suspected infinite loop. (See ``es_progress_check`` setting.) 826 827 828frame_gen_ctx 829------------- 830 831This structure holds function pointers to get user data and write it to 832packets. ``fgc_size``, ``fgc_fin``, and ``fgc_read`` are set based on framing 833requirements. This is a nice abstraction that gets passed to several 834packetization functions and allows them not to care about how or whether 835framing is performed. 836 837pwritev_ctx 838----------- 839 840Used to aid ``lsquic_stream_pwritev``. ``hq_arr`` is used to roll back 841HTTP/3 framing if necessary. (The rollback is the most complicated part 842of the ``pwritev`` functionality). 843 844Event Dispatch 845============== 846 847The "on stream read" and "on stream write" callbacks are part of the 848lsquic API. These callbacks are called when the user has registered 849interest in reading from or writing to the stream and reading or writing 850is possible. 851 852Calling ``lsquic_stream_wantwrite`` and ``lsquic_stream_wantread`` places 853the stream on the corresponding "want to write" and "want to read" list. 854These lists are processed by a connection when it's ticked. For each 855stream on the list, the internal function 856``lsquic_stream_dispatch_read_events`` or 857``lsquic_stream_dispatch_write_events``, whichever may be the case. 858 859Dispatching read events is simple. When ``es_rw_once`` is set, the "on 860stream read" callback is called once -- if the stream is readable. 861Otherwise, the callback is called in a loop as long as: 862 863- The stream is readable; 864 865- The user wants to read from it; and 866 867- Progress is being made 868 869Dispatching write events is more complicated due to the following 870factors: 871 872- In addition to calling the "on stream write" callback, the flushing 873 mechanism also works by using the "want to write" list. 874 875- When writing occurs, the stream's position on the list may change 876 877STREAM frames in 878================ 879 880The data gets in from the transport into the stream via 881``lsquic_stream_frame_in`` function. The connection calls this function 882after parsing a STREAM frame. 883 884The data from the STREAM frame is stored in one of the two "data in" 885modules: ``di_nocopy`` and ``di_hash``. The two are abstracted out behind 886``stream->data_in``. 887 888The "data in" module is used to store incoming stream data. The data is 889read from this module using the ``di_get_frame`` function. See the next 890section. 891 892Reading Data 893============ 894 895There are three user-facing stream-reading functions; two of them are 896just wrappers around ``"lsquic_stream_readf``. This function performs some 897checks (we will cover HTTP mode separately) and calls 898``lsquic_stream_readf``, which also performs some checks and calls 899``read_data_frames``. This is the only function in the stream module where 900data is actually read from the "data in" module. 901 902Writing Data 903============ 904 905There are four user-facing functions to write to stream, and all of them 906are wrappers around ``stream_write``. (``lsquic_stream_pwritev`` is a bit 907more involved than the other three, but it's pretty well-commented -- 908and the complexity is in the rollback, not writing itself.) 909 910Small writes get buffered. If the write size plus whatever is buffered 911already exceeds the threshold -- which is the size of the largest STREAM 912frame that could be fit into a single outgoing packet -- the data is 913packetized instead by calling ``stream_write_to_packets``. See the next 914section. 915 916Packetization 917============= 918 919``stream_write_to_packets`` is the only function through which user data 920makes it into outgoing packets. There are three ways to write STREAM 921frames: 922 9231. ``stream_write_to_packet_hsk`` 924 9252. ``stream_write_to_packet_std`` 926 9273. ``stream_write_to_packet_crypto`` 928 929The particular function is selected based on connection and stream type 930when the stream is first created. 931 932stream_write_to_packets 933----------------------- 934 935Depending on the need to frame data, a reader is selected. The job of 936the reader is to copy user data into the outgoing STREAM frame. In 937HTTP/3 mode, HTTP/3 framing is added transparently -- see `Writing 938HTTP/3 Streams`_ for more information. 939 940The while loop is entered if there is user data to be copied or if the 941end of the stream has been reached and FIN needs to be written. Note the 942threshold check: when writing data from a user call, the threshold is 943set and frames smaller than the full packet are not generated. This is 944to allow for usage like "write 8KB", "write 8KB", "write 8KB" not to 945produce jagged STREAM frames. This way, we utilize the bandwidth most 946effectively. When flushing data, the threshold is not set, so even a 9471-byte data gets packetized. 948 949The call ``stream->sm_write_to_packet`` writes data to a single packet. 950This packet is allocated by the `Send Controller <#send-controller>`__. 951(Depending on when writing is performed, the returned packet may be 952placed onto the scheduled queue immediately or it may be a "buffered" 953packet. The stream code is oblivious to that.) If the send controller 954does not give us a packet, STOP is returned and the while loop exits. An 955ERROR should never happen -- this indicates a bug or maybe failure to 956allocate memory -- and so the connection is aborted in that case. If 957everything is OK, the while loop goes on. 958 959The ``seen_ok`` check is used to place the connection on the tickable list 960on the first successfully packetized STREAM frame. This is so that if 961the packet is buffered (meaning that the writing is occurring outside of 962the callback mechanism), the connection will be processed (ticked) and 963the packets will be scheduled and sent out. 964 965After the while loop, we conditionally close an outstanding HTTP/3 966frame, save any leftover data, schedule STREAM_BLOCKED or BLOCKED frames 967to be sent out if needed, and return the number of user-provided bytes 968that were copied into the outgoing packets and into the internal stream 969buffer (leftovers). 970 971Write a single STREAM frame 972--------------------------- 973 974We will examine ``stream_write_to_packet_std`` as it is the most 975complicated of these three functions. 976 977First, we flush the headers stream if necessary -- this is because we 978want the HTTP (gQUIC or HTTP/3) headers to be sent before the payload. 979 980Then, the number of bytes needed to generate a STREAM frame is 981calculated. This value depends on the QUIC version, whether we need to 982generate HTTP/3 framing, and whether the data to write exists (or we 983just need to write an empty STREAM frame with the FIN bit set). 984 985(Note that the framing check is made to overshoot the estimate for 986simplicity. For one, we might not need 3 bytes for the DATA frame, but 987only 2. Secondly, there may already be an open HTTP/3 frame in one of 988the previous packets and so we don't need to write it at all.) 989 990Then, a packet is allocated and ``write_stream_frame`` is called. It is in 991this function that we finally make the call to generate the STREAM frame 992and to copy the data from the user. The function ``pf_gen_stream_frame`` 993returns the number of bytes actually written to the packet: this 994includes both the STREAM frame header and the payload (which may also 995include HTTP/3 frame). 996 997The fact that this frame type has been written is added to 998``po_frame_types`` and the STREAM frame location, type, and size are 999recorded. This information is necessary to be able to elide the frame 1000from the packet in case the stream is reset. 1001 1002``PO_STREAM_END`` is set if the STREAM frame extends to the end of the 1003packet. This is done to prevent this packet from being used again to 1004append frames to it (after, for example, some preceding frames are 1005elided from it). This is because both in gQUIC and IETF QUIC the STREAM 1006frame header is likely to omit the ``length`` field and instead use the 1007"extends to the end of the packet" field. If frames are shifted, the 1008packet cannot be appended to because it will lead to data loss and 1009corruption. 1010 1011Writing HTTP/3 Streams 1012====================== 1013 1014HTTP/3 streams use framing. In most cases, a single HEADERS frame is 1015followed by zero or more DATA frames. The user code does not know this: 1016both gQUIC and IETF QUIC streams appear to behave in exactly the same 1017manner. This makes lsquic simple to use. 1018 1019The drawback is internal complexity. To make the code both easy to use 1020and performant, HTTP/3 framing is generated on-the-fly, as data is being 1021written to packets (as opposed to being buffered and then written). (OK, 1022*mostly* on-the-fly: the HEADERS frame payload is generated and then 1023copied.) 1024 1025On the high level, the way it works is as follows: 1026 1027- When a write call is made, a variable-size (that is, unknown size; 1028 it's called variable-size because the size of the DATA header may 1029 be 2 or 3 bytes; it's not the best name in the world) frame is 1030 opened/activated. 1031 1032- When data is written to stream, the DATA header placeholder bytes are 1033 written to the stream transparently and a pointer is saved to this 1034 location. 1035 1036- The active frame header is closed when 1037 1038 - It reaches its maximum size; or 1039 1040 - The data we are writing runs out. 1041 1042- When the header is closed, the number of bytes that follows is now 1043 written to the location we saved when the header was activated. 1044 1045This mechanism allows us to create a DATA frame that spans several 1046packets before we know how many packets there will be in a single write. 1047(As outgoing packet allocation is governed by the `Send Controller`_.) 1048This is done to minimize the goodput overhead incurred by the DATA frame header. 1049 1050.. image:: stream-http3-framing.png 1051 1052There are a couple of things that do not fit into this model: 1053 10541. The HEADERS frame is fixed size [1]_. It is generated separately 1055 (written by QPACK encoder into a buffer on the stack) and later 1056 copied into the stream. (See the ``send_headers_ietf`` function.) It 1057 can happen that the whole buffer cannot be written. In that case, 1058 a rather complicated dance of buffering the unwritten HEADERS 1059 frame bytes is performed. Here, the "on stream write" callback is 1060 replaced with an internal callback (see the ``select_on_write`` 1061 function) and user interaction is prohibited until the whole of 1062 the HEADERS frame is written to the stream. 1063 10642. Push promise streams are even weirder. In addition to the HEADERS 1065 handling above, the push promise stream must begin with a 1066 variable-integer Push ID. To make this fit into the framed stream 1067 model, the code makes up the concept of a "phantom" HTTP/3 frame. 1068 This type of frame's header is not written. This allows us to 1069 treat the Push ID as the payload of a regular HTTP/3 frame. 1070 1071The framing code has had its share of bugs. Because of that, there is a 1072dedicated unit test program just for the framing code, 1073*tests/test_h3_framing.c*. In addition to manually-written tests, the 1074program has a "fuzzer driver" mode, in which the American Fuzzy Lop 1075fuzzer drives the testing of the HTTP/3 framing mechanism. The advantage 1076of this approach is that AFL tries to explore all the code paths. 1077 1078 1079Debates regarding DATA framing raged in 2018 on the QUIC mailing list. 1080Some of the discussion is quite interesting: for example, the debate about 1081"optimizing" DATA frames and `calculations of the header 1082cost <https://lists.w3.org/Archives/Public/ietf-http-wg/2018OctDec/0236.html>`__. 1083 1084Reading HTTP/3 Streams 1085====================== 1086 1087HTTP/3 frame headers are stripped out transparently -- they are never 1088seen by the user. From the user's perspective, the lsquic stream 1089represents the payload of HTTP message; a dedicated call is made first 1090to get at the HTTP headers. 1091 1092To accomplish this, the stream implements a generic deframing mechanism. 1093The `stream_filter_if`_ interface allows one to 1094specify functions to a) check whether the stream is readable, b) strip 1095header bytes from a data frame fetched from "data in" module; and c) 1096update byte count in the filter once bytes have been read: 1097 1098hq_filter_readable 1099------------------ 1100 1101This function tests for availability of non-frame-header data, stripping 1102frame headers from the stream transparently. Note how it calls 1103``read_data_frames`` with its own callback, ``hq_read``. It is inside this 1104callback that the HEADERS frame is fed to the QPACK decoder. 1105 1106hq_filter_df 1107------------ 1108 1109This function's job is to strip framing from data frames returned by the 1110"data in" module inside the ``read_data_frames`` function. It, too, calls 1111the ``hq_read`` function. This allows the two functions that read from 1112stream (this one) and the readability-checking function 1113(``hq_filter_readable``) to share the same state. This is crucial: 1114Otherwise this approach is not likely to work well. 1115 1116hq_decr_left 1117------------ 1118 1119This function is needed to update the filter state. Once all payload 1120bytes from the frame have been consumed, the filter is readied to strip 1121the next frame header again. 1122 1123.. _notable-code-1: 1124 1125Notable Code 1126============ 1127 1128frame_hq_gen_read 1129----------------- 1130 1131This is where HTTP/3 frame headers are generated. Note the use of 1132``shf_frame_ptr`` to record the memory location to which the correct frame 1133size will be written by a different function. 1134 1135Parsing 1136******* 1137 1138Parsing Packets 1139=============== 1140 1141Parsing Frames 1142============== 1143 1144Mini vs Full Connections 1145************************ 1146 1147Mini Purpose 1148============ 1149 1150The reason for having a mini connection is to conserve resources: a mini 1151connection allocates a much smaller amount of memory. This protects the 1152server from a potential DoS attack. The mini connection's job is to get 1153the handshake to succeed, after which the connection is 1154`promoted <#connection-promotion>`__. 1155 1156Mini/Full Differences 1157===================== 1158 1159Besides their size, the two connection types differ in the following 1160ways: 1161 1162Mini connections' lifespan is limited. If the handshake does not succeed 1163within 10 seconds (configurable), the mini connection is destroyed. 1164 1165A mini connection is only `tickable <#tickability>`__ if it has unsent 1166packets. 1167 1168Mini connections do not process packets that carry application (as 1169opposed to handshake) data. The 0-RTT packet processing is deferred; 1170these packets are stashed and handed over to the full connection during 1171promotion. 1172 1173Connection Promotion 1174==================== 1175 1176A mini connection is promoted when the handshake succeeds. The mini 1177connection reports this via the return status of ``ci_tick`` by setting 1178the ``TICK_PROMOTE`` bit. The engine creates a new connection object and 1179calls the corresponding server constructor. The latter copies all the 1180relevant state information from mini to full connection. 1181 1182For a time, two connection objects -- one mini and one full -- exist at 1183the same state. Most of the time, the mini connection is destroyed 1184within the same function call to ``process_connections()``. If, however, 1185the mini connection has unsent packets, it will remain live until those 1186packets are sent successfully. Because the mini connection is by then 1187removed from the CID-to-connection hash (``engine->conns_hash``), it will 1188not receive any more incoming packets. 1189 1190Also see `Connection Processing <#processing-connections>`__. 1191 1192Mini gQUIC Connection 1193********************* 1194 1195*Files: lsquic_mini_conn.h, lsquic_mini_conn.c* 1196 1197.. _overview-1: 1198 1199Overview 1200======== 1201 1202The original version of ``struct mini_conn`` fit into paltry 128 bytes. 1203The desire to fit into 128 bytes [2]_ led to, for example, 1204``mc_largest_recv`` -- in effect, a 3-byte integer! Since that time, 1205the mini conn has grown to over 512 bytes. 1206 1207Looking at the struct, we can see that a lot of other data structures 1208are squeezed into small fields: 1209 1210Received and sent packet history is each packed into a 64-bit integer, 1211``mc_received_packnos`` and ``mc_sent_packnos``, respectively. The HEADERS 1212stream offsets are handled by the two two-byte integers ``mc_read_off`` 1213and ``mc_write_off``. 1214 1215.. _notable-code-2: 1216 1217Notable Code 1218============ 1219 1220continue_handshake 1221------------------ 1222 1223This function constructs a contiguous buffer with all the HANDSHAKE 1224stream chunks in order and passes it to ``esf_handle_chlo()``. This is 1225done because the gQUIC crypto module does not buffer anything: it's all 1226or nothing. 1227 1228The code has been written in a generic way, so that even 1229many small packets can be reconstructed into a CHLO. The lsquic client 1230can be made to split the CHLO by setting the max packet size 1231sufficiently low. 1232 1233sent/unsent packets 1234------------------- 1235 1236To conserve space, only a single outgoing packet header exists in the 1237mini connection struct, ``mc_packets_out``. To differentiate between 1238packets that are to be sent and those that have already been sent, the 1239``PO_SENT`` flag is used. 1240 1241Mini IETF Connection 1242******************** 1243 1244*Files: lsquic_mini_conn_ietf.h, lsquic_mini_conn_ietf.c* 1245 1246.. _overview-2: 1247 1248Overview 1249======== 1250 1251The IETF QUIC mini connection has the same idea as the gQUIC mini 1252connection: use as little memory as possible. This is more difficult to 1253do with the IETF QUIC, however, as there are more moving parts in this 1254version of the protocol. 1255 1256.. _data-structures-2: 1257 1258Data Structures 1259=============== 1260 1261mini_crypto_stream 1262------------------ 1263 1264This structure is a minimal representation of a stream. The IETF QUIC 1265protocol uses up to four HANDSHAKE streams (one for each encryption 1266level) during the handshake and we need to keep track of them. Even a 1267basic event dispatch mechanism is supported. 1268 1269packno_set_t 1270------------ 1271 1272This bitmask is used to keep track of sent, received, and acknowledged 1273packet numbers. It can support up to 64 packet numbers: 0 through 63. We 1274assume that the server will not need to send more than 64 packets to 1275complete the handshake. 1276 1277imc_recvd_packnos 1278----------------- 1279 1280Because the client is allowed to start its packet number sequence with 1281any number in the [0, 2\ :sup:`32`-1] range, the received packet history 1282must be able to accommodate numbers larger than 63. To do that, the 1283receive history is a union. If all received packet numbers are 63 or 1284smaller, the packno_set_t bitmask is used. Otherwise, the receive 1285history is kept in `Tiny Receive History <#tiny-receive-history>`__ 1286(trechist). The flag ``IMC_TRECHIST`` indicates which data structure is 1287used. 1288 1289ietf_mini_conn 1290-------------- 1291 1292This structure is similar to the gQUIC mini conn. It is larger, though, 1293as it needs to keep track of several instances of things based on 1294encryption level or packet number space. 1295 1296``imc_cces`` can hold up to three SCIDs: one for the original DCID from 1297the client, one for SCID generated by the server, and one for when 1298preferred address transport parameter is used. (The preferred address 1299functionality is not compiled by default.) 1300 1301ietf_mini_rechist 1302----------------- 1303 1304The receive history is in the header file because, in addition to 1305generating the ACK frames in the IETF mini conn, it is used to migrate 1306the receive history during promotion. 1307 1308.. _notable-code-3: 1309 1310Notable Code 1311============ 1312 1313Switching to trechist 1314--------------------- 1315 1316The switch to the Tiny Receive History happens when the incoming packet 1317number does not fit into the bitmask anymore -- see 1318``imico_switch_to_trechist()``. To keep the trechist code exercised, about 1319one in every 16 mini connection uses trechist unconditionally -- see 1320``lsquic_mini_conn_ietf_new()``. 1321 1322crypto_stream_if 1323---------------- 1324 1325A set of functions to drive reading and writing CRYPTO frames to move 1326the handshake along is specified. It is passed to the crypto session. 1327After promotion, the full connection installs its own function pointers. 1328 1329imico_read_chlo_size 1330-------------------- 1331 1332This function reads the first few bytes of the first CRYPTO frame on the 1333first HANDSHAKE stream to figure out the size of ClientHello. The 1334transport parameters will not be read until the full ClientHello is 1335available. 1336 1337 1338Duplicated Code 1339--------------- 1340 1341Some code has been copied from gQUIC mini connection. This was done on 1342purpose, with the expectation that gQUIC is going away. 1343 1344ECN Blackhole Detection 1345----------------------- 1346 1347ECN blackhole at the beginning of connection is guessed at when none of 1348packets sent in the initial batch were acknowledged. This is done by 1349``imico_get_ecn()``. ``lsquic_mini_conn_ietf_ecn_ok()`` is also used during 1350promotion to check whether to use ECN. 1351 1352Connection Public Interface 1353*************************** 1354 1355*Files: lsquic_conn_public.h* 1356 1357TODO 1358 1359Full gQUIC Connection 1360********************* 1361 1362*Files: lsquic_full_conn.h, lsquic_full_conn.c* 1363 1364.. _overview-3: 1365 1366Overview 1367======== 1368 1369The full gQUIC connection implements the Google QUIC protocol, both 1370server and client side. This is where a large part of the gQUIC protocol 1371logic is contained and where everything -- engine, streams, sending, 1372event dispatch -- is tied together. 1373 1374Components 1375========== 1376 1377In this section, each member of the ``full_conn`` structure is documented. 1378 1379fc_conn 1380------- 1381 1382The first member of the struct is the common connection object, 1383`lsquic_conn`_. 1384 1385It must be first in the struct because the two pointer are cast to each 1386other, depending on circumstances. 1387 1388fc_cces 1389------- 1390 1391This array holds two connection CID elements. 1392 1393The reason for having two elements in this array instead of one (even 1394though gQUIC only uses one CID) is for the benefit of the client: In 1395some circumstances, the client connections are hashed by the port 1396number, in which case the second element is used to hash the port value. 1397The relevant code is in lsquic_engine.c 1398 1399fc_rechist 1400---------- 1401 1402This member holds the `packet receive history <#receive-history>`__. It 1403is used to generate ACK frames. 1404 1405fc_stream_ifs 1406------------- 1407 1408This three-element array holds pointers to stream callbacks and the 1409stream callback contexts. 1410 1411From the perspective of lsquic, Google QUIC has three stream types: 1412 14131. HANDSHAKE stream; 1414 14152. HEADERS stream; and 1416 14173. Regular (message, or request/response) streams. 1418 1419The user provides stream callbacks and the context for the regular 1420streams (3) in ``ea_stream_if`` and ``ea_stream_if_ctx``. 1421 1422The other two stream types are internal. The full connection specifies 1423internal callbacks for those streams. One set handles the handshake and 1424the other handles reading and writing of HTTP/2 frames: SETTINGS, 1425HEADERS, and so on. 1426 1427fc_send_ctl 1428----------- 1429 1430This is the `Send Controller <#send-controller>`__. It is used to 1431allocate outgoing packets, control sending rate, and process 1432acknowledgements. 1433 1434fc_pub 1435------ 1436 1437This member holds the `Connection Public 1438Interface <#connection-public-interface>`__. 1439 1440fc_alset 1441-------- 1442 1443This is the `Alarm Set <#alarm-set>`__. It is used to set various timers 1444in the connection and the send controller. 1445 1446fc_closed_stream_ids 1447-------------------- 1448 1449The two sets in this array hold the IDs of closed streams. 1450 1451There are two of them because of the uneven distribution of stream IDs. 1452It is more efficient to hold even and odd stream IDs in separate 1453structures. 1454 1455fc_settings 1456----------- 1457 1458Pointer to the engine settings. 1459 1460This member is superfluous -- the settings can be fetched from 1461``fc_enpub->enp_settings``. 1462 1463fc_enpub 1464-------- 1465 1466This points to the `engine's public interface <#lsquic-engine-public>`__. 1467 1468fc_max_ack_packno 1469----------------- 1470 1471Recording the maximum packet number that contained an ACK allows us to 1472ignore old ACKs. 1473 1474fc_max_swf_packno 1475----------------- 1476 1477This is the maximum packet number that contained a STOP_WAITING frame. 1478It is used to ignore old STOP_WAITING frames. 1479 1480fc_mem_logged_last 1481------------------ 1482 1483This timestamp is used to limit logging the amount of memory used to 1484most once per second. 1485 1486fc_cfg 1487------ 1488 1489This structure holds a few important configuration parameters. (Looks 1490like ``max_conn_send`` is no longer used���) 1491 1492fc_flags 1493-------- 1494 1495The flags hold various boolean indicators associated with the full 1496connections. Some of them, such as ``FC_SERVER``, never change, while 1497others change all the time. 1498 1499fc_n_slack_akbl 1500--------------- 1501 1502This is the number of ackable (or, in the new parlance, *ack-eliciting*) 1503packets received since the last ACK was sent. 1504 1505This counter is used to decide whether an ACK should be sent (or, more 1506precisely, queued to be sent) immediately or whether to wait. 1507 1508fc_n_delayed_streams 1509-------------------- 1510 1511Count how many streams have been delayed. 1512 1513When ``lsquic_conn_make_stream()`` is called, a stream may not be created 1514immediately. It is delayed if creating a stream would go over the 1515maximum number of stream allowed by peer. 1516 1517fc_n_cons_unretx 1518---------------- 1519 1520Counts how many consecutive unretransmittable packets have been sent. 1521 1522 1523fc_last_stream_id 1524----------------- 1525 1526ID of the last created stream. 1527 1528Used to assign ID to streams created by this side of the connection. 1529Clients create odd-numbered streams, while servers initiate 1530even-numbered streams (push promises). 1531 1532fc_max_peer_stream_id 1533--------------------- 1534 1535Maximum value of stream ID created by peer. 1536 1537fc_goaway_stream_id 1538------------------- 1539 1540Stream ID received in the GOAWAY frame. 1541 1542This ID is used to reset locally-initiated streams with ID larger than 1543this. 1544 1545fc_ver_neg 1546---------- 1547 1548This structure holds the version negotiation state. 1549 1550This is used by the client to negotiate with the server. 1551 1552 1553With gQUIC going away, it is probably not very important anymore. 1554 1555fc_hsk_ctx 1556---------- 1557 1558Handshake context for the HANDSHAKE stream. 1559 1560Client and server have different HANDSHAKE stream handlers -- and 1561therefore different contexts. 1562 1563fc_stats 1564-------- 1565 1566Connection stats 1567 1568fc_last_stats 1569------------- 1570 1571Snapshot of connection stats 1572 1573This is used to log the changes in counters between calls to 1574``ci_log_stats()``. The calculation is straightforward in 1575``lsquic_conn_stats_diff()``. 1576 1577fc_stream_histories and fc_stream_hist_idx 1578------------------------------------------ 1579 1580Rolling log of histories of closed streams 1581 1582 1583fc_errmsg 1584--------- 1585 1586Error message associated with connection termination 1587 1588This is set when the connection is aborted for some reason. This error 1589message is only set once. It is used only to set the error message in 1590the call to ``ci_status()`` 1591 1592fc_recent_packets 1593----------------- 1594 1595Dual ring-buffer log of packet history 1596 1597The first element is for incoming packets, the second is for outgoing 1598packets. Each entry holds received or sent time and frame information. 1599 1600This can be used for debugging. It is only compiled into debug builds. 1601 1602fc_stream_ids_to_reset 1603---------------------- 1604 1605List of stream ID to send STREAM_RESET for 1606 1607These STREAM_RESET frames are associated with streams that are not 1608allowed to be created because we sent a GOAWAY frame. (There is a period 1609when GOAWAY is in transit, but the peer keeps on creating streams). To 1610queue the reset frames for such a stream, an element is added to this 1611list. 1612 1613fc_saved_ack_received 1614--------------------- 1615 1616Timestamp of the last received ACK. 1617 1618This is used for `ACK merging <#ack-merging>`__. 1619 1620fc_path 1621------- 1622 1623The network path -- Google QUIC only has one network path. 1624 1625fc_orig_versions 1626---------------- 1627 1628List (as bitmask) of original versions supplied to the client 1629constructor. 1630 1631Used for version negotiation. See `fc_ver_neg`_ for more 1632coverage of this topic. 1633 1634fc_crypto_enc_level 1635------------------- 1636 1637Latest crypto level 1638 1639This is for Q050 only, which does away with the HANDSHAKE stream and 1640uses CRYPTO frames instead. (This was part of Google's plan to move 1641Google QUIC protocol closer to IETF QUIC.) 1642 1643fc_ack 1644------ 1645 1646Saved ACK -- latest or merged 1647 1648This ACK structure is used in `ACK merging <#ack-merging>`__. 1649 1650Instantiation 1651============= 1652 1653The largest difference between the server and client mode of the full 1654connection is in the way it is created. The client creates a brand-new 1655connection, performs version negotiation, and runs the handshake before 1656dispatching user events. The server connection, on the other hand, gets 1657created from a mini connection during `connection 1658promotion <#connection-promotion>`__. By that time, both version 1659negotiation and handshake have already completed. 1660 1661Common Initialization 1662--------------------- 1663 1664The ``new_conn_common()`` function contains initialization common to both 1665server and client. Most full connection's internal data structures are 1666initialized or allocated here, among them `Send 1667Controller <#send-controller>`__, `Receive 1668History <#receive-history>`__, and `Alarm Set <#alarm-set>`__. 1669 1670The HEADERS stream is created here, if necessary. (Throughout the code, 1671you can see checks whether the connection is in HTTP mode or not. Even 1672though gQUIC means that HTTP is used, our library supports a non-HTTP 1673mode, in which there is no HEADERS stream. This was done for testing 1674purposes and made possible the echo and md5 client and server programs.) 1675 1676Server 1677------ 1678 1679After initializing the common structures in ``new_conn_common()``, 1680server-specific initialization continues in 1681``lsquic_gquic_full_conn_server_new()``. 1682 1683The HANDSHAKE stream is created. The handler (see 1684``lsquic_server_hsk_stream_if``) simply throws out data that it reads from 1685the client. 1686 1687Outgoing packets are inherited -- they will be sent during the next tick 1688-- and deferred incoming packets are processed. 1689 1690Client 1691------ 1692 1693The client's initialization takes place in 1694``lsquic_gquic_full_conn_client_new()``. Crypto session is created and the 1695HANDSHAKE stream is initialized. The handlers in 1696``lsquic_client_hsk_stream_if`` drive the handshake process. 1697 1698Incoming Packets 1699================ 1700 1701The entry point for incoming packets is ``ci_packet_in()``, which is 1702implemented by ``full_conn_ci_packet_in``. Receiving a packet restarts the 1703idle timer. 1704 1705The function ``process_incoming_packet`` contains some client-only logic 1706for processing version negotiation and stateless retry packets. In the 1707normal case, ``process_regular_packet()`` is called. This is where the 1708incoming process is decrypted, the `Receive 1709History <#receive-history>`__ is updated, ``parse_regular_packet()`` is 1710called, and some post-processing takes place (most importantly, 1711scheduling an ACK to be sent). 1712 1713The function ``parse_regular_packet`` is simple: It iterates over the 1714whole decrypted payload of the incoming packet and parses out frames one 1715by one. An error aborts the connection. 1716 1717ACK Merging 1718=========== 1719 1720Processing ACKs is `expensive <#handling-acks>`__. When sending data, a 1721batch of incoming packets is likely to contain an ACK frame each. The 1722ACK frame handler, ``process_ack_frame()``, merges consecutive ACK frames 1723and stores the result in `fc_ack`_. The ACK is processed 1724during the `next tick <#ticking>`__. If the two ACK cannot be merged 1725(which is unlikely), the cached ACK is processed immediately and the new 1726ACK is cached. 1727 1728Caching an ACK has a non-trivial memory cost: the 4KB-plus data 1729structure ``ack_info`` accounts for more than half of the size of the 1730``full_conn`` struct. Nevertheless, the tradeoff is well worth it. ACK 1731merging reduces the number of calls to ``lsquic_send_ctl_got_ack()`` by a 1732factor of 10 or 20 in some high-throughput scenarios. 1733 1734Ticking 1735======= 1736 1737When a `connection is processed by the 1738engine <#processing-connections>`__, the engine calls the connection's 1739``ci_tick()`` method. This is where most of the connection logic is 1740exercised. In the full gQUIC connection, this method is implemented by 1741``full_conn_ci_tick()``. 1742 1743The following steps are performed: 1744 1745- A cached ACK, if it exists, is processed 1746 1747- Expired alarms are rung 1748 1749- Stream read events are dispatched 1750 1751- An ACK frame is generated if necessary 1752 1753- Other control frames are generated if necessary 1754 1755- Lost packets are rescheduled 1756 1757- More control frames and stream resets are generated if necessary 1758 1759- HEADERS stream is flushed 1760 1761- Outgoing packets that carry stream data are scheduled in four steps: 1762 1763 a. High-priority `buffered packets <#buffered-queue>`__ are scheduled 1764 1765 b. Write events are dispatched for high-priority streams 1766 1767 c. Non-high-priority buffered packets are scheduled 1768 1769 d. Write events are dispatched for non-high-priority streams 1770 1771- Connection close or PING frames are generated if necessary 1772 1773- Streams are serviced (closed, freed, created) 1774 1775.. _notable-code-4: 1776 1777Notable Code 1778============ 1779 1780TODO 1781 1782Full IETF Connection 1783******************** 1784 1785*Files: lsquic_full_conn_ietf.h, lsquic_full_conn_ietf.c* 1786 1787Overview 1788======== 1789 1790This module implements IETF QUIC 1791`Transport <https://tools.ietf.org/html/draft-ietf-quic-transport-34>`_ 1792and 1793`HTTP/3 <https://tools.ietf.org/html/draft-ietf-quic-http-34>`_ logic, 1794plus several QUIC extensions. To attain an overall grasp of the code, 1795at least some familiarity with these protocols is required. To understand 1796the code in detail, especially *why* some things are done, a closer reading 1797of the specification may be in order. 1798 1799In some places, the code contains comments with references to the 1800specification, e.g. 1801 1802:: 1803 1804 if (conn->ifc_flags & IFC_SERVER) 1805 { /* [draft-ietf-quic-transport-34] Section 19.7 */ 1806 ABORT_QUIETLY(0, TEC_PROTOCOL_VIOLATION, 1807 "received unexpected NEW_TOKEN frame"); 1808 return 0; 1809 } 1810 1811(A search for "[draft-ietf" will reveal over one hundred places in the 1812code thus commented.) 1813 1814The Full IETF Connection module is similar in structure to the `Full gQUIC 1815Connection`_ module, from which it originated. Some code is quite similar 1816as well, including logic for `ACK Merging`_ and `Ticking`_. 1817 1818Components 1819========== 1820 1821In this section, each member of ``ietf_full_conn`` is documented. 1822 1823ifc_conn 1824-------- 1825 1826The first member of the struct is the common connection object, 1827`lsquic_conn`_. 1828 1829It must be first in the struct because the two pointer are cast to each 1830other, depending on circumstances. 1831 1832ifc_cces 1833-------- 1834 1835This array holds eight connection CID elements. 1836See `Managing SCIDs`_. 1837 1838ifc_rechist 1839----------- 1840 1841This member holds the `packet receive history <#receive-history>`__. 1842The receive history is used to generate ACK frames. 1843 1844ifc_max_ackable_packno_in 1845------------------------- 1846 1847This value is used to detect holes in incoming packet number sequence. 1848This information is used to queue ACK frames. 1849 1850ifc_send_ctl 1851------------ 1852 1853This is the `Send Controller`_. It is used to 1854allocate outgoing packets, control sending rate, and process 1855acknowledgements. 1856 1857ifc_pub 1858------- 1859 1860This member holds the `Connection Public Interface`_ 1861 1862ifc_alset 1863--------- 1864 1865This is the `Alarm Set`_. It is used to set various timers 1866in the connection and the send controller. 1867 1868ifc_closed_stream_ids 1869--------------------- 1870 1871The two sets in this array hold the IDs of closed streams. 1872 1873There are two of them because of the uneven distribution of stream IDs. 1874The set data structure is meant to hold sequences without gaps. 1875It is more efficient to hold stream IDs for each stream type in 1876separate structures. 1877 1878ifc_n_created_streams 1879--------------------- 1880 1881Counters for locally initiated streams. Used to generate next 1882stream ID. 1883 1884ifc_max_allowed_stream_id 1885------------------------- 1886 1887Maximum allowed stream ID for each of the four (``N_SITS``) stream types. 1888This is used all over the place. 1889 1890ifc_closed_peer_streams 1891----------------------- 1892 1893Counts how many remotely-initiated streams have been closed. Because the 1894protocol mandates that the stream IDs be assigned in order, this allows us 1895to make some logical inferences in the code. 1896 1897ifc_max_streams_in 1898------------------ 1899 1900Maximum number of open streams the peer is allowed to initiate. 1901 1902ifc_max_stream_data_uni 1903----------------------- 1904 1905Initial value of the maximum amount of data locally-initiated unidirectional 1906stream is allowed to send. 1907 1908ifc_flags 1909--------- 1910 1911All kinds of flags. 1912 1913ifc_mflags 1914---------- 1915 1916More flags! 1917 1918ifc_send_flags 1919-------------- 1920 1921The send flags keep track of which control frames are queued to be sent. 1922 1923ifc_delayed_send 1924---------------- 1925 1926Some send flags are delayed. 1927 1928We stop issuing streams credits if peer stops opening QPACK decoder window. 1929This addresses a potential attack whereby client can cause the server to keep 1930allocating memory. See `Security Considerations in the QPACK Internet-Draft 1931<https://tools.ietf.org/html/draft-ietf-quic-qpack-21#section-7.3>`__. 1932 1933ifc_send 1934-------- 1935 1936This is the `Send Controller`_. It is used to allocate outgoing packets, 1937control sending rate, and process acknowledgements. 1938 1939ifc_error 1940--------- 1941 1942This struct records which type of error has occurred (transport or application)' 1943and the error code. 1944 1945ifc_n_delayed_streams 1946--------------------- 1947 1948Count how many streams have been delayed. 1949 1950When ``lsquic_conn_make_stream()`` is called, a stream may not be created 1951immediately. It is delayed if creating a stream would go over the 1952maximum number of stream allowed by peer. 1953 1954ifc_n_cons_unretx 1955----------------- 1956 1957Counts how many consecutive unretransmittable packets have been sent. 1958 1959Enough unretransittable sent packets in a row causes a PING frame to 1960be sent. This forces the peer to send an ACK. 1961 1962ifc_pii 1963------- 1964 1965Points to the selected priority iterator. 1966 1967The IETF Full Connection supports two priority mechanisms: the original 1968Google QUIC priority mechanism and the `HTTP/3 Extensible Priorities 1969<https://tools.ietf.org/html/draft-ietf-httpbis-priority-03>`__. 1970 1971ifc_errmsg 1972---------- 1973 1974Holds dynamically generated error message string. 1975 1976Once set, the error string does not change until the connection is 1977destroyed. 1978 1979ifc_enpub 1980--------- 1981 1982This points to the `engine's public interface <#lsquic-engine-public>`__. 1983 1984ifc_settings 1985------------ 1986 1987Pointer to the engine settings. 1988 1989This member is superfluous -- the settings can be fetched from 1990``ifc_enpub->enp_settings``. 1991 1992ifc_stream_ids_to_ss 1993-------------------- 1994 1995Holds a queue of STOP_SENDING frames to send as response to remotely 1996initiated streams that came in after we sent a GOAWAY frame. 1997 1998ifc_created 1999----------- 2000 2001Time when the connection was created. This is used for the Timestamp 2002and Delayed ACKs extensions. 2003 2004ifc_saved_ack_received 2005---------------------- 2006 2007Time when cached ACK frame was received. See `ACK Merging`_. 2008 2009ifc_max_ack_packno 2010------------------ 2011 2012Holding the maximum packet number containing an ACK frame allows us 2013to ignore old ACK frames. One value per Packet Number Space is kept. 2014 2015ifc_max_non_probing 2016------------------- 2017 2018Maximum packet number of a received non-probing packets. This is used 2019for path migration. 2020 2021ifc_cfg 2022------- 2023 2024Local copy of a couple of transport parameters. We could get at them 2025with a function call, but these are used often enough to optimize 2026fetching them. 2027 2028ifc_process_incoming_packet 2029--------------------------- 2030 2031The client goes through version negotiation and the switches to the 2032fast function. The server begins to use the fast function immediately. 2033 2034ifc_n_slack_akbl 2035---------------- 2036 2037Number ackable packets received since last ACK was sent. A count is 2038kept for each Packet Number Space. 2039 2040ifc_n_slack_all 2041--------------- 2042 2043Count of all packets received since last ACK was sent. This is only 2044used in the Application PNS (Packet Number Space). (This is regular 2045PNS after the handshake completes). 2046 2047ifc_max_retx_since_last_ack 2048--------------------------- 2049 2050This number is the maximum number of ack-eliciting packets to receive 2051before an ACK must be sent. 2052 2053The default value is 2. When the Delayed ACKs extension is used, this 2054value gets modified by peer's ACK_FREQUENCY frames. 2055 2056ifc_max_ack_delay 2057----------------- 2058 2059Maximum amount of allowed after before an ACK is sent if the threshold 2060defined by ifc_max_retx_since_last_ack_ has not yet been reached. 2061 2062The default value is 25 ms. When the Delayed ACKs extension is used, this 2063value gets modified by peer's ACK_FREQUENCY frames. 2064 2065ifc_ecn_counts_in 2066----------------- 2067 2068Incoming ECN counts in each of the Packet Number Spaces. These counts 2069are used to generate ACK frames. 2070 2071ifc_max_req_id 2072-------------- 2073 2074Keeps track of the maximum ID of bidirectional stream ID initiated by the 2075peers. It is used to construct the GOAWAY frame. 2076 2077ifc_hcso 2078-------- 2079 2080State for outgoing HTTP/3 control stream. 2081 2082ifc_hcsi 2083-------- 2084 2085State for incoming HTTP/3 control stream. 2086 2087ifc_qeh 2088------- 2089 2090QPACK encoder streams handler. 2091 2092The handler owns two unidirectional streams: a) locally-initiated QPACK 2093encoder stream, to which it writes; and b) peer-initiated QPACK decoder 2094stream, from which it reads. 2095 2096ifc_qdh 2097------- 2098 2099QPACK decoder streams handler. 2100 2101The handler owns two unidirectional streams: a) peer-initiated QPACK 2102encoder stream, from which it reads; and b) locally-initiated QPACK 2103decoder stream, to which it writes. 2104 2105ifc_peer_hq_settings 2106-------------------- 2107 2108Peer's HTTP/3 settings. 2109 2110ifc_dces 2111-------- 2112 2113List of destination connection ID elements (DCEs). Each holds a DCID 2114and the associated stateless reset token. When lsquic uses a DCID, it 2115inserts the stateless reset token into a hash so that stateless resets 2116can be found. 2117 2118Outside of the initial migration, the lsquic client code does not switch 2119DCIDs. One idea (suggested in the drafts somewhere) is to switch DCIDs 2120after a period of inactivity. 2121 2122ifc_to_retire 2123------------- 2124 2125List of DCIDs to retire. 2126 2127ifc_scid_seqno 2128-------------- 2129 2130Sequence generator for SCIDs generated by the endpoint. 2131 2132ifc_scid_timestamp 2133------------------ 2134 2135List of timestamps for the generated SCIDs. 2136 2137This list is used in the SCID rate-limiting mechanism. 2138 2139 2140ifc_incoming_ecn 2141---------------- 2142 2143History indicating presence of ECN markings on most recent incoming packets. 2144 2145ifc_cur_path_id 2146--------------- 2147 2148Current path ID -- indexes `ifc_paths`_. 2149 2150ifc_used_paths 2151-------------- 2152 2153Bitmask of which paths in `ifc_paths`_ are being used. 2154 2155ifc_mig_path_id 2156--------------- 2157 2158Path ID of the path being migrated to. 2159 2160ifc_active_cids_limit 2161--------------------- 2162 2163This is the maximum number of CIDs at any one time this 2164endpoint is allowed to issue to peer. If the TP value exceeds ``cn_n_cces``, 2165it is reduced to it. 2166 2167ifc_active_cids_count 2168--------------------- 2169 2170This value tracks how many CIDs have been issued. It is decremented 2171each time a CID is retired. 2172 2173ifc_first_active_cid_seqno 2174-------------------------- 2175 2176Another piece of the SCID rate-limiting mechanism. 2177 2178ifc_ping_unretx_thresh 2179---------------------- 2180 2181Once the number consecutively sent non-ack-elicing packets 2182(`ifc_n_cons_unretx`_) exceeds this value, this endpoint will send 2183a PING frame to force the peer to respond with an ACK. 2184 2185The threshold begins at 20 and then made to fluctuate randomly between 218612 and 27. 2187 2188ifc_last_retire_prior_to 2189------------------------ 2190 2191Records the maximum value of ``Retire Prior To`` value of the 2192`NEW_CONNECTION_ID frame 2193<https://tools.ietf.org/html/draft-ietf-quic-transport-34#section-19.15>`_. 2194 2195ifc_ack_freq_seqno 2196------------------ 2197 2198Sequence number generator for ACK_FREQUENCY frames generated by this 2199endpoint. 2200 2201ifc_last_pack_tol 2202----------------- 2203 2204Last value of the ``Packet Tolerance`` field sent in the last 2205``ACK_FREQUENCY`` frame generated by this endpoint. 2206 2207ifc_last_calc_pack_tol 2208---------------------- 2209 2210Last *calculated* value of the ``Packet Tolerance`` field. 2211 2212ifc_min_pack_tol_sent 2213--------------------- 2214 2215Minimum value of the ``Packet Tolerance`` field sent. Only used for 2216statistics display. 2217 2218ifc_max_pack_tol_sent 2219--------------------- 2220 2221Maximum value of the ``Packet Tolerance`` field sent. Only used for 2222statistics display. 2223 2224ifc_max_ack_freq_seqno 2225---------------------- 2226 2227Maximum seen sequence number of incoming ``ACK_FREQUENCY`` frame. Used 2228to discard old frames. 2229 2230ifc_max_udp_payload 2231------------------- 2232 2233Maximum UDP payload. This is the cached value of the transport parameter. 2234 2235ifc_last_live_update 2236-------------------- 2237 2238Last time ``ea_live_scids()`` was called. 2239 2240ifc_paths 2241--------- 2242 2243Array of network paths. Most of the time, only one path is used when the 2244peer migrates. The array has four elements as a safe upper limit. 2245 2246ifc_u.cli 2247--------- 2248 2249Client-specific state. This is where pointers to "crypto streams" are 2250stored; they are not in the ``ifc_pub.all_streams`` hash. 2251 2252ifc_u.ser 2253--------- 2254 2255The server-specific state is only about push promises. 2256 2257ifc_idle_to 2258----------- 2259 2260Idle timeout. 2261 2262ifc_ping_period 2263--------------- 2264 2265Ping period. 2266 2267ifc_bpus 2268-------- 2269 2270A hash of buffered priority updates. It is used when a priority update 2271(part of the Extensible HTTP Priorities extension) arrives before the 2272stream it is prioritizing. 2273 2274ifc_last_max_data_off_sent 2275-------------------------- 2276 2277Value of the last MAX_DATA frame sent. This is used to limit the number 2278of times we send the MAX_DATA frame in response to a DATA_BLOCKED frame. 2279 2280ifc_min_dg_sz 2281------------- 2282 2283Minimum size of the DATAGRAM frame. Used by the eponymous extension. 2284 2285ifc_max_dg_sz 2286------------- 2287 2288Maximum size of the DATAGRAM frame. Used by the eponymous extension. 2289 2290ifc_pts 2291------- 2292 2293PTS stands for "Packet Tolerance Stats". Information collected here 2294is used to calculate updates to the packet tolerance advertised to the 2295peer via ACK_FREQUENCY frames. Part of the Delayed ACKs extension. 2296 2297ifc_stats 2298--------- 2299 2300Cumulative connection stats. 2301 2302ifc_last_stats 2303-------------- 2304 2305Copy of `ifc_stats`_ last time ``ci_log_stats()`` was called. Used 2306to calculate the difference. 2307 2308ifc_ack 2309------- 2310 2311One or more cached incoming ACK frames. Used for `ACK merging`_. 2312 2313Managing SCIDs 2314============== 2315 2316Source Connection IDs -- or SCIDs for short -- are stored in the `ifc_cces`_ 2317array. 2318 2319Each of ``struct conn_cid_elem`` contains the CID itself, the CID's port or 2320sequence number, and flags: 2321 2322- ``CCE_USED`` means that this Connection ID has been used by the peer. This 2323 information is used to check whether the peer's incoming packet is using 2324 a new DCID or reusing an old one when the packet's DCID does not match 2325 this path's current DCID. 2326 2327- ``CCE_REG`` signifies that the CID has been registered with the user-defined 2328 ``ea_new_scids()`` callback. 2329 2330- ``CCE_SEQNO`` means that the connection has been issued by this endpoint 2331 and ``cce_seqno`` contains a valid value. Most of SCIDs are issued by 2332 either endpoint, with one exception: The DCID included in the first few 2333 packets sent by the client becomes an interim SCID for the server and it 2334 does not have a sequence number. This "original" SCID gets retired 2 2335 seconds after the handshake succeeds, see the ``AL_RET_CIDS`` alarm. 2336 2337- ``CCE_PORT`` is used to mark the special case of hashing connections by 2338 port number. In client mode, the lsquic engine may, under some circumstances, 2339 hash the connections by local port number instead of connection ID. 2340 In that case, ``cce_port`` contains the port number used to hash the 2341 connection. 2342 2343Each CIDs is hashed in the of the "CID-to-connection" mapping that the engine 2344maintains. If it is not in the hash, incoming packets that use this CID as 2345DCID will not be dispatched to the connection (because the connection will not 2346be found). 2347 2348Path Migration 2349============== 2350 2351Stream Priority Iterators 2352========================= 2353 2354Creating Streams on the Server 2355============================== 2356 2357Calculating Packet Tolerance 2358============================ 2359 2360When the Delayed ACKs extension is used, we advertise our ``Packet Tolerance`` 2361to peer. This is the number of packets the peer can receive before having to 2362send an acknowledgement. By default -- without the extension -- the packet 2363tolerance is 2. 2364 2365Because we `merge ACKs <#ack-merging>`__, receiving more than one ACK between 2366ticks is wasteful. Another consideration is that a packet is not declared 2367lost until at least one RTT passes -- the time to send a packet and receive 2368the acknowledgement from peer. 2369 2370To calculate the packet tolerance, we use a feedback mechanism: when number 2371of ACKs per RTT is too high, we increase packet tolerance; when number of 2372ACKs per RTT is too low, we decrease packet tolerance. The feedback is 2373implemented with a `PID Controller <https://en.wikipedia.org/wiki/PID_controller>`__: 2374the target is the number of ACKs per RTT, normalized to 1.0. 2375 2376See the function ``packet_tolerance_alarm_expired()`` as well as comments 2377in ``lsquic.h`` that explain the normalization as well as the knobs available 2378for tuning. 2379 2380The pre-normalized target is a function of RTT. It was obtained 2381empirically using netem. This function together with the default 2382PID controller parameters give good performance in the lab and in 2383some limited interop testing. 2384 2385Anatomy of Outgoing Packet 2386************************** 2387 2388Evanescent Connection 2389********************* 2390 2391Send Controller 2392*************** 2393 2394*Files: lsquic_send_ctl.h, lsquic_send_ctl.c* 2395 2396.. _overview-4: 2397 2398Overview 2399======== 2400 2401The Send Controller manages outgoing packets and the sending rate: 2402 2403- It decides whether packets can be sent 2404 2405- It figures out what the congestion window is 2406 2407- It processes acknowledgements and detects packet losses 2408 2409- It allocates packets 2410 2411- It maintains sent packet history 2412 2413The controller allocates, manages, splits, coalesces, and destroys 2414outgoing packets. It owns these packets. 2415 2416The send controller services two modules: 2417 2418- Full connection. gQUIC and IETF full connections use the send 2419 controller to allocate packets and delegate packet-sending 2420 decisions to it. 2421 2422- Stream. The stream uses the stream controller as the source of 2423 outgoing packets to write STREAM frames to. 2424 2425Packet Life Cycle 2426================= 2427 2428A new outgoing packet is allocated and returned to the connection or the 2429stream. Around this time (just before or just after, depending on the 2430particular function call to get the packet), the packet is placed on the 2431Scheduled Queue. 2432 2433When the engine is creating a batch of packets to send, it calls 2434``ci_next_packet_to_send()``. The connection removes the next packet from 2435its Scheduled Queue and returns it. The engine now owns the outgoing 2436packet, but only while the batch is being sent. The engine *always 2437returns the packet* after it tries to send it. 2438 2439If the packet was sent successfully, it is returned via the 2440``ci_packet_sent`` call, after which it is appended to the Unacked Queue. 2441If the packet could not be sent, ``ci_packet_not_sent()`` is called, at 2442which point it is prepended back to the Schedule Queue to be tried 2443later. 2444 2445There are two ways to get off the Unacked Queue: being acknowledged or 2446being lost. When a packet is acknowledged, it is destroyed. On the other 2447hand, if it is deemed lost, it is placed onto the Lost Queue, where it 2448will await being rescheduled. 2449 2450Packet Queues 2451============= 2452 2453.. image:: lsquic-packet-queues.png 2454 2455Buffered Queue 2456-------------- 2457 2458The Buffered Queue is a special case. When writing to the stream occurs 2459outside of the event dispatch loop, new outgoing packets begin their 2460life in the Buffered Queue. They get scheduled during a connection tick, 2461making their way onto the Scheduled Queue. 2462 2463There are two buffered queues: one for packets holding STREAM frames 2464from the highest-priority streams and one for packets for streams with 2465lower priority. 2466 2467Scheduled Queue 2468--------------- 2469 2470Packets on the Scheduled Queue have packet numbers assigned to them. In 2471rare cases, packets may be removed from this queue before being sent 2472out. (For example, a stream may be cancelled, in which case packets that 2473carry its STREAM frames may end up empty.) In that case, they are marked 2474with a special flag to generate the packet number just before they are 2475sent. 2476 2477Unacked Queue 2478------------- 2479 2480This queue holds packets that have been sent but are yet to be 2481acknowledged. When a packet on this queue is acknowledged, it is 2482destroyed. 2483 2484The loss detection code runs on this queue when ACKs are received or 2485when the retransmission timer expires. 2486 2487This queue is actually three queues: one for each of the IETF QUIC's 2488Packet Number Spaces, or PNSs. The PNS_APP queue is what is used by 2489gQUIC and IETF QUIC server code. PNS_INIT and PNS_HSK are only used by 2490the IETF QUIC client. (IETF QUIC server handles those packet number 2491spaces in its mini conn module.) 2492 2493In addition to regular packets, the Unacked Queue holds `loss 2494records <#loss-records>`__ and `poisoned packets <#poisoned-packets>`__. 2495 2496Lost Queue 2497---------- 2498 2499This queue holds lost packets. These packets are removed from the 2500Unacked Queue when it is decided that they have been lost. Packets on 2501this queue get rescheduled after connection schedules a packet with 2502control frames, as those have higher priority. 2503 25040-RTT Stash Queue 2505----------------- 2506 2507This queue is used by the client to retransmit packets that carry 0-RTT 2508data. 2509 2510Handling ACKs 2511============= 2512 2513Acknowledgements are processed in the function 2514``lsquic_send_ctl_got_ack``. 2515 2516One of the first things that is done is ACK validation. We confirm that 2517the ACK does not contain any packet numbers that we did not send. 2518Because of the way we `generate packet numbers <#packet-numbers>`__, 2519this check is a simple comparison. 2520 2521The nested loops work as follows. The outer loop iterates over the 2522packets in the Unacked Queue in order -- meaning packet numbers 2523increase. In other words, older packets are examined first. The inner 2524loop processes ACK ranges in the ACK *backwards*, meaning that both 2525loops follow packets in increasing packet number order. It is done this 2526way as an optimization. The (previous) alternative design of looking 2527up a packet number in the ACK frame, even if using binary search, is 2528slower. 2529 2530The code is optimized: the inner loop has a minimum possible number of 2531branches. These optimizations predate the more-recent, higher-level 2532optimization. The latest ACK-handling optimization added to the code 2533combines incoming ACKs into a single ACK (at the connection level), thus 2534reducing the number of times this loop needs to be used by a lot, 2535sometimes by a significant factor (when lots of data is being sent). 2536This makes some of the code-level optimizations, such as the use of 2537``__builtin_prefetch``, an overkill. 2538 2539Loss Records 2540============ 2541 2542A loss record is a special type of outgoing packet. It marks a place in 2543the Unacked Queue where a lost packet had been -- the lost packet itself 2544having since moved on to the Lost Queue or further. The loss record and 2545the lost packet form a circular linked list called the "loss chain." 2546This list contains one real packet and zero or more loss records. The 2547real packet can move from the Unacked Queue to the Lost Queue to the 2548Scheduled Queue and back to the Unacked Queue; its loss records live 2549only on the Unacked Queue. 2550 2551We need loss records to be able to handle late acknowledgements -- those 2552that acknowledge a packet *after* it has been deemed lost. When an 2553acknowledgment for any of the packet numbers associated with this packet 2554comes in, the packet is acknowledged and the whole loss chain is 2555destroyed. 2556 2557Poisoned Packets 2558================ 2559 2560A poisoned packet is used to thwart opportunistic ACK attacks. The 2561opportunistic ACK attack works as follows: 2562 2563- The client requests a large resource 2564 2565- The server begins sending the response 2566 2567- The client sends ACKs for packet number before it sees these packets, 2568 tricking the server into sending packets faster than it would 2569 otherwise 2570 2571The poisoned packet is placed onto the Unacked Queue. If the peer lies 2572about packet numbers it received, it will acknowledge the poisoned 2573packet, in which case it will be discovered during ACK processing. 2574 2575Poisoned packets cycle in and out of the Unacked Queue. A maximum of one 2576poisoned packet is outstanding at any one time for simplicity. (And we 2577don't need more). 2578 2579Packet Numbers 2580============== 2581 2582The Send Controller aims to send out packets without any gaps in the 2583packet number sequence. (The only exception to this rule is the handling 2584of poisoned packets, where the gap is what we want.) Not having gaps in 2585the packet number sequence is beneficial: 2586 2587- ACK verification is cheap 2588 2589- Send history updates are fast 2590 2591- Send history uses very little memory 2592 2593The downside is code complexity and having to renumber packets when they 2594are removed from the Scheduled Queue (due to, for example, STREAM frame 2595elision or loss chain destruction) or resized (due to a path or MTU 2596change, for instance). 2597 2598Some scenarios when gaps can be produced inadvertently are difficult to 2599test or foresee. To cope with that, a special warning in the send 2600history code is added when the next packet produces a gap. This warning 2601is limited to once per connection. Having a gap does not break 2602functionality other than ACK verification, but that's minor. On the 2603other hand, we want to fix such bugs when they crop up -- that's why the 2604warning is there. 2605 2606Loss Detection and Retransmission 2607================================= 2608 2609The loss detection and retransmission logic in the Send Controller was 2610taken from the Chromium code in the fall of 2016, in the beginning of 2611the lsquic project. This logic has not changed much since then -- only 2612some bugs have been fixed here and there. The code bears no resemblance 2613to what is described in the QUIC Recovery Internet Draft. Instead, `the 2614much earlier 2615document <https://tools.ietf.org/html/draft-iyengar-quic-loss-recovery-01>`__, 2616describing gQUIC, could be looked to for reference. 2617 2618Congestions Controllers 2619======================= 2620 2621The Send Controller has a choice of two congestion controllers: Cubic 2622and BBRv1. The latter was translated from Chromium into C. BBRv1 does 2623not work well for very small RTTs. 2624 2625To cope with that, lsquic puts the Send Controller into the "adaptive CC" 2626mode by default. The CC is selected after RTT is determined: below a 2627certain threshold (configurable; 1.5 ms by default), Cubic is used. 2628Until Cubic or BBRv1 is selected, *both* CC controllers are used -- 2629because we won't have the necessary state to instantiate a controller 2630when the decision is made. 2631 2632Buffered Packet Handling 2633======================== 2634 2635Buffered packets require quite a bit of special handling. Because they 2636are created outside of the regular event dispatch, a lot of things are 2637unknown: 2638 2639- Congestion window 2640 2641- Whether more incoming packets will arrive before the next tick 2642 2643- The optimal packet number size 2644 2645The Send Controller tries its best to accommodate the buffered packets 2646usage scenario. 2647 2648ACKs 2649---- 2650 2651When buffered packets are created, we want to generate an ACK, if 2652possible. This can be seen in ``send_ctl_get_buffered_packet``, which 2653calls ``ci_write_ack()`` 2654 2655This ACK should be in the first buffered packet to be scheduled. Because 2656the Send Controller does not dictate the order of buffered packet 2657creation -- high-priority versus low-priority -- it may need to move (or 2658steal) the ACK frame from a packet on the low-priority queue to a packet 2659on the high-priority queue. 2660 2661When buffered packets are finally scheduled, we have to remove ACKs from 2662them if another ACK has already been sent. This is because Chrome errors 2663out if out-of-order ACKs come in. 2664 2665Flushing QPACK Decoder 2666---------------------- 2667 2668The priority-based write events dispatch is emulated when the first 2669buffered packet is allocated: the QPACK decoder is flushed. Without it, 2670QPACK updates are delayed, which may negatively affect compression 2671ratio. 2672 2673Snapshot and Rollback 2674===================== 2675 2676The Send Controller snapshot and rollback functionality was implemented 2677exclusively for the benefit of the optimized ``lsquic_stream_pwritev`` 2678call. 2679 2680Complexity Woes 2681=============== 2682 2683The Send Controller is complicated. Because we write stream data to 2684packets directly and packets need to be resized, a lot of complexity 2685resides in the code to resize packets, be it due to repathing, STREAM 2686frame elision, or MTU changes. This is the price to be paid for 2687efficiency in the normal case. 2688 2689 2690Alarm Set 2691********* 2692 2693*Files: lsquic_alarmset.h, lsquic_alarmset.c, test_alarmset.c* 2694 2695TODO 2696 2697Tickable Queue 2698************** 2699 2700*Files: lsquic_engine.c, lsquic_min_heap.h, lsquic_min_heap.c* 2701 2702The Tickable Queue is a min-heap used as a priority queue. Connections 2703on this queue are in line to be processed. Connections that were last 2704ticked a longer time ago have higher priority than those ticked 2705recently. (``cn_last_ticked`` is used for ordering.) This is intended to 2706prevent starvation as multiple connections vye for the ability to send 2707packets. 2708 2709The way the min-heap grows is described in `Growing 2710Min-Heaps <#growing-min-heaps>`__. 2711 2712Advisory Tick Time Queue 2713************************ 2714 2715*Files: lsquic_attq.h, lsquic_attq.c* 2716 2717This data structure is a mini-heap. Connections are ordered by the value 2718of the next time they should be processed (ticked). (Because this is not 2719a hard limit, this value is advisory -- hence its name.) 2720 2721This particular min-heap implementation has two properties worth 2722highlighting: 2723 2724Removal of Arbitrary Elements 2725============================= 2726 2727When a connection's next tick time is updated (or when the connection is 2728destroyed), the connection is removed from the ATTQ. At that time, it 2729may be at any position in the min-heap. The position is recorded in the 2730heap element, ``attq_elem->ae_heap_idx`` and is updated when elements are 2731swapped. This makes it unnecessary to search for the entry in the 2732min-heap. 2733 2734Swapping Speed 2735============== 2736 2737To make swapping faster, the array that underlies the min-heap is an 2738array of *pointers* to ``attq_elem``. This makes it unnecessary to update 2739each connection's ``cn_attq_elem`` as array elements are swapped: the 2740memory that stores ``attq_elem`` stays put. This is why there are both 2741``aq_elem_malo`` and ``aq_heap``. 2742 2743CID Purgatory 2744************* 2745 2746Memory Manager 2747************** 2748 2749Malo Allocator 2750************** 2751 2752Receive History 2753*************** 2754 2755*Files: lsquic_rechist.h, lsquic_rechist.c, test_rechist.c* 2756 2757.. _overview-5: 2758 2759Overview 2760======== 2761 2762The reason for keeping the history of received packets is to generate 2763ACK frames. The Receive History module provides functionality to add 2764packet numbers, truncate history, and iterate over the received packet 2765number ranges. 2766 2767.. _data-structures-3: 2768 2769Data Structures 2770=============== 2771 2772.. _overview-6: 2773 2774Overview 2775-------- 2776 2777The receive history is a singly-linked list of packet number ranges, 2778ordered from high to low: 2779 2780.. image:: rechist-linked-list.png 2781 2782The ordering is maintained as an invariant with each addition to the 2783list and each truncation. This makes it trivial to iterate over the 2784ranges. 2785 2786To limit the amount of memory this data structure can allocate, the 2787maximum number of elements is specified when Receive History is 2788initialized. In the unlikely case that that number is reached, new 2789elements will push out the elements at the tail of the linked list. 2790 2791Memory Layout 2792------------- 2793 2794In memory, the linked list elements are stored in an array. Placing them 2795into contiguous memory achieves three goals: 2796 2797- Receive history manipulation is fast because the elements are all 2798 close together. 2799 2800- Memory usage is reduced because each element does not use pointers to 2801 other memory locations. 2802 2803- Memory fragmentation is reduced. 2804 2805The array grows as necessary as the number of elements increases. 2806 2807The elements are allocated from and returned to the array with the aid 2808of an auxiliary data structure. An array of bitmasks is kept where each 2809bit corresponds to an array element. A set bit means that the element is 2810allocated; a cleared bit indicates that the corresponding element is 2811free. 2812 2813To take memory savings and speed further, the element array and the 2814array of bitmasks are allocated in a single span of memory. 2815 2816.. image:: rechist-memory-layout.png 2817 2818rechist_elem 2819------------ 2820 2821``re_low`` and ``re_count`` define the packet range. To save memory, we 2822assume that the range will not contain more than 4 billion entries and 2823use a four-byte integer instead of a second ``lsquic_packno_t``. 2824 2825``re_next`` is the index of the next element. Again, we assume that there 2826will be no more than 4 billion elements. The NULL pointer is represented 2827by ``UINT_MAX``. 2828 2829This struct is just 16 bytes in size, which is a nice number. 2830 2831lsquic_rechist 2832-------------- 2833 2834``rh_elems`` and ``rh_masks`` are the element array and the bitmask array, 2835respectively, as described above. The two use the same memory chunk. 2836 2837``rh_head`` is the index of the first element of the linked list. 2838 2839The iterator state, ``rh_iter``, is embedded into the main object itself, 2840as there is no expectation that more than one iterations will need to be 2841active at once. 2842 2843.. _notable-code-5: 2844 2845Notable Code 2846============ 2847 2848Inserting Elements 2849------------------ 2850 2851Elements may be inserted into the list when a new packet number is added 2852to history via ``lsquic_rechist_received()``. If the new packet number 2853requires a new range (e.g. it does not expand one of the existing 2854ranges), a new element is allocated and linked. 2855 2856There are four cases to consider: 2857 28581. Inserting the new element at the head of the list, with it becoming 2859 the new head. (This is the most common scenario.) The code that 2860 does it is labeled ``first_elem``. 2861 28622. Appending the new element to the list, with it becoming the new tail. 2863 This code is located right after the ``while`` loop. 2864 28653. Inserting the new element between two existing elements. This code is 2866 labeled ``insert_before``. 2867 28684. Like (3), but when the insertion is between the last and the 2869 next-to-last elements and the maximum number of elements has been 2870 reached. In this case, the last element's packet number 2871 information can simply be replaced. This code is labeled 2872 ``replace_last_el``. 2873 2874Growing the Array 2875----------------- 2876 2877When all allocated elements in ``rh_elems`` are in use 2878(``rh_n_used >= rh_n_alloced``), the element array needs to be expanded. 2879This is handled by the function ``rechist_grow``. 2880 2881Note how, after realloc, the bitmask array is moved to its new location 2882on the right side of the array. 2883 2884Handling Element Overflow 2885------------------------- 2886 2887When the array has grown to its maximum allowed size, allocating a new 2888element occurs via reusing the last element on the list, effectively 2889pushing it out. This happens in ``rechist_reuse_last_elem``. 2890 2891The first loop finds the last linked list element: that's the element 2892whose ``re_next`` is equal to ``UINT_MAX``. 2893 2894Then, the second loop finds the element that points to the last element. 2895This is the next-to-last (penultimate) element. This element's next 2896pointer will now be set to NULL, effectively dropping the last element, 2897which can now be reused. 2898 2899Iterating Over Ranges 2900--------------------- 2901 2902Iteration is performed by the ``lsquic_rechist_first`` and 2903``lsquic_rechist_next`` pair of functions. The former resets the internal 2904iterator. Only one iteration at a time is possible. 2905 2906These functions have a specific signature: they and the pointer to the 2907receive history are passed to the ``pf_gen_ack_frame`` function, which 2908generates an ACK frame. 2909 2910Clone Functionality 2911------------------- 2912 2913The Receive History can be initialized from another instance of a 2914receive history. This is done by ``lsquic_rechist_copy_ranges``. This 2915functionality is used during connection promotion, when `Tiny Receive 2916History <#tiny-receive-history>`__ that is used by the `IETF mini 2917connection <#mini-ietf-connection>`__ is converted to Receive History. 2918 2919Tiny Receive History 2920******************** 2921 2922*Files: lsquic_trechist.h, lsquic_trechist.c, test_trechist.c* 2923 2924.. _overview-7: 2925 2926Overview 2927======== 2928 2929The Tiny Receive History is similar to `Receive 2930History <#receive-history>`__, but it is even more frugal with memory. 2931It is used in the `IETF mini connection <#mini-ietf-connection>`__ as a 2932more costly `alternative to using bare bitmasks <#imc-recvd-packnos>`__. 2933 2934Because it is so similar to Receive History, only differences are 2935covered in this section. 2936 2937Less Memory 2938=========== 2939 2940No Trechist Type 2941---------------- 2942 2943There is no ``lsquic_trechist``. The history is just a single 32-bit 2944bitmask and a pointer to the array of elements. The bitmask and the 2945pointer are passed to all ``lsquic_trechist_*`` functions. 2946 2947This gives the user of Tiny Receive History some flexibility and saves 2948memory. 2949 2950Element 2951------- 2952 2953The linked list element, ``trechist_elem``, is just 6 bytes in size. The 2954assumptions are: 2955 2956- No packet number is larger than 2\ :sup:`32` - 1 2957 2958- No packet range contains more than 255 packets 2959 2960- Linked list is limited to 256 elements 2961 2962Head Does Not Move 2963================== 2964 2965Because of memory optimizations described above, the head element is 2966always at index 0. The NULL pointer ``te_next`` is indicated by the value 29670 (because nothing points to the first element). 2968 2969Array Does Not Grow 2970=================== 2971 2972The size of the element array is limited by the 32-bit bitmask. As a 2973further optimization, the number of ranges is limited to 16 via the 2974``TRECHIST_MAX_RANGES`` macro. 2975 2976Insertion Range Check 2977===================== 2978 2979A packet range spanning more than 255 (UCHAR_MAX) packets cannot be 2980represented. This will cause a failure, as it is checked for in the 2981code. 2982 2983This many packets are unlikely to even be required to complete the 2984handshake. If this limit is hit, it is perhaps good to abort the mini 2985connection. 2986 2987Set64 2988***** 2989 2990Appendix A: List of Data Structures 2991*********************************** 2992 2993The majority of data structures employed by lsquic are linked lists and, 2994to a lesser extent, arrays. This makes the code simple and fast 2995(assuming a smart memory layout). 2996 2997Nevertheless, a few places in the code called for more involved and, at 2998times, customized data structures. This appendix catalogues them. 2999 3000This is the list of non-trivial data structures implemented in lsquic: 3001 3002Ring Buffer Linked Lists 3003======================== 3004 3005- `Receive History <#receive-history>`__ 3006 3007- `Tiny Receive History <#tiny-receive-history>`__ 3008 3009Hashes 3010====== 3011 3012- lsquic_hash 3013 3014- hash_data_in 3015 3016Min-heaps 3017========= 3018 3019- `Advisory Tick Time Queue <#advisory-tick-time-queue>`__ 3020 3021- lsquic_min_heap 3022 3023Bloom Filters 3024============= 3025 3026- CID Purgatory 3027 3028 3029Appendix B: Obsolete and Defunct Code 3030************************************* 3031 3032Mem Used 3033======== 3034 3035Engine History 3036============== 3037 3038QLOG 3039==== 3040 3041 3042.. [1] 3043 This is due to the limitation of the QPACK library: the decoder can 3044 read input one byte at a time, but the encoder cannot write output 3045 one byte at a time. It could be made to do that, but the effort is 3046 not worth it. 3047 3048.. [2] 3049 Mini conn structs are allocated out of the `Malo 3050 Allocator <#malo-allocator>`__, which used to be limited to objects 3051 whose size is a power of two, so it was either fitting it into 128 3052 bytes or effectively doubling the mini conn size. 3053