Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection abort by the error "DCID not found" #176

Closed
initlifeinc opened this issue Oct 9, 2020 · 6 comments
Closed

Connection abort by the error "DCID not found" #176

initlifeinc opened this issue Oct 9, 2020 · 6 comments
Labels

Comments

@initlifeinc
Copy link
Contributor

hi @dtikhonov , i have a problem when use lsquic to connect to server.

sometimes client will receive a retry packet, and than raise an error
"conn: Abort connection: DCID not found"
it will cause quic connection broken.
i ever think that whether there is something wrong about this api "iquic_esfi_reset_dcid”。。。
i'm trying to reproduce...
do you have some ideas about it?

lsquic version is 2.7.0

@dtikhonov
Copy link
Contributor

The "DCID not found" error message comes from this code:

    if (cce >= END_OF_CCES(lconn))
    {   
        ABORT_WARN("DCID not found");
        return -1;
    }

This is in function on_dcid_change(), called here:

        parse_regular_packet(conn, packet_in);
        if (saved_path_id == conn->ifc_cur_path_id)
        {   
            if (conn->ifc_cur_path_id != packet_in->pi_path_id)
                on_new_or_unconfirmed_path(conn, packet_in);
            else if (!LSQUIC_CIDS_EQ(CN_SCID(&conn->ifc_conn),
                                                    &packet_in->pi_dcid))
            {   
                if (0 != on_dcid_change(conn, &packet_in->pi_dcid))
                    return -1;
            }
        }

This means that the incoming packet has DCID that this end point has not issued! This results in a connection error. This may be a bug in the server. Do you know which server implementation this is you're connecting to?


Why do you think it's iquic_esfi_reset_dcid()? Are you sure it's a Retry packet that causes this?

If iquic_esfi_reset_dcid() failed it would be pretty odd. It would mean that setup_handshake_keys() failed. This, in turn, means either that there is a memory allocation failure (the likelihood of this is low) or that there is some buffer size mismatch error or an error in calling a BoringSSL function. There are only a few possible errors. You should add log messages so that we can see exactly where it fails next time it hits.

@initlifeinc
Copy link
Contributor Author

this is the detail log. please help to have a look.
dcid_issue.log

this is my analysis.
dcidnotfound
Question:

  1. if client receive a different DCID packet to cause DCID not equal, is it a normal logic of quic? whether the quic library need to support this case? always to switch packet DCID to connection SCID?
  2. here we can see if receive a packet DCID is not found in local connection lists, lsquic occur a "DCID not found" error. for these packets, quic should abort connection or just ignore them?
  3. in the image, you will find before the last DCID not equal happened, client received a RETIRE_CONNECTION_ID frame, so if the client receive this frame, whether the client will still receive illegal frames(with invalid DCID) in a normal quic flow?
  4. in you option, is it a server quic(my server quic version is quicgo library) error or client quic error or it just a normal case?

some more other question about lsquic

  1. why it generate so much NEW_CONNECTION_ID frame, just to resend new connection id frame?
    image
  2. if receive out-of-order packets, what will happend that lsquic wanted?

outoforder packet

@dtikhonov
Copy link
Contributor

Thank you for the logs! Your analysis is very good -- and the bug is indeed in lsquic. Here we are tripped up by something unusual that lsquic client does behind the scenes in some circustances. Explanation follows.

Because all non-deprecated, non-experimental gQUIC and IETF QUIC versions are enabled by default, lsquic has to cope with the odd property of Q046 and Q050 versions. In these versions, the server never includes a CID into the packets its sends to the client. This means that a client would not be able to differentiate between two connections by CID when a packet is received. To work around this, lsquic identifies connections by the local port number instead of the connection ID. It uses the following code to decide which to pick:

static int
hash_conns_by_addr (const struct lsquic_engine *engine)
{
    if (engine->flags & ENG_SERVER)
        return 0;
    if (engine->pub.enp_settings.es_versions & LSQUIC_FORCED_TCID0_VERSIONS)
        return 1;
    if ((engine->pub.enp_settings.es_versions & LSQUIC_GQUIC_HEADER_VERSIONS)
                                && engine->pub.enp_settings.es_support_tcid0)
        return 1;
    if (engine->pub.enp_settings.es_scid_len == 0)
        return 1;
    return 0;
}

(As you can see, Q043 has the same issue when es_support_tcid0 is enabled -- which it is by default.)

Thus, after CID 324D2EE8D711FDF2 is retired, the following incoming packets that have DCID set to this value, the engine should not be able to find matching connection for this packet and the packet should be dropped. But (and this is the bug!), since connections are identified by the local port number, the engine knows which connection it is for. The connection logic does not know how to deal with this situation and aborts.

The proper way to fix it is to perform a check -- only when connections are identified by port number -- whether any SCID matches DCID that's in the incoming packet. If there are no matching SCID, discard the packet immediately and not give it to the connection. This will mimic the normal behavior when the connections are looked up in a hash by DCID.

To answer your questions:

  1. Yes, getting a different DCID (as long as it is one of the issued CIDs) is normal. A peer can change DCID at any time.
  2. Explained above.
  3. Explained above.
  4. This is a bug in lsquic.

  1. Generation of SCIDs is just something that happens at the beginning of a connection (here, when Retry has been handled). A bunch of NEW_CONNECTION_ID frames are generated once at the beginning of the connection, and then the peer is free to use them. When peer retires CIDs it was issued, this endpoint will issue more CIDs.
  2. lsquic copes with out-of-order packets just fine. Note that packets 7 and 5 (the first highlighted pair) are not out-of-order: they are in different Packet Number Spaces.

@initlifeinc
Copy link
Contributor Author

now I only use LSQVER_ID24 on client and server side. So if i want to fix or avoid this bug happened, Is it ok to force set es_versioins = 1 << LSQVER_ID24 to make the lsquic not use address to find connection( use dcid to find connection)?

lsquic_engine_init_settings(&settings_, flags);
// the following codes is new added to avoid issue happened
settings_.es_versions = 1 << LSQVER_ID24;

@dtikhonov
Copy link
Contributor

Yes, this will work.

litespeedtech pushed a commit that referenced this issue Oct 13, 2020
- [FEATURE] IETF Client 0-RTT support.
- [BUGFIX] Do not schedule MTU probe on first tick.
- [BUGFIX] Parsing DATAGRAM frame.
- [BUGFIX] If push promise fails, do not invoke hset destructor.
- [BUGFIX] Client: When connections are IDed by port number, check DCID.
  Fixes issue #176.
- Revert the 2.22.1 lsquic_is_valid_hs_packet change.  All that was
  necessary is a change to the way we call it in lsquic_engine.  No
  change to the function itself is required.
@litespeedtech
Copy link
Owner

Fixed in 2.23.1 -- closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants