Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bot is unable to properly connect to certain Discord voice servers #714

Closed
KontraCity opened this issue Jun 18, 2023 · 39 comments · Fixed by #898 or #900
Closed

Bot is unable to properly connect to certain Discord voice servers #714

KontraCity opened this issue Jun 18, 2023 · 39 comments · Fixed by #898 or #900
Assignees
Labels
audio audio issues or feature requests bug Something isn't working

Comments

@KontraCity
Copy link

Git commit reference
cd5b9ca

Describe the bug
The problem is that on_voice_ready callback is not called when bot connects to certain voice channels.
Example of successful voice server connection:

  1. DPP logs show this:
    DEBUG: Sending op 4 to join VC, guild xxx channel xxx
    DEBUG: Connecting voice for guild xxx channel xxx
    DEBUG: Connecting new voice session...
    DEBUG: Voice websocket established; UDP endpoint: xxx.xxx.xxx.xxx:xxx [ssrc=xxx] with 7 modes
    DEBUG: External IP address: xxx.xxx.xxx.xxx
  2. Bot connects with following request:
    GET /?v=4 HTTP/1.1
    Host: russia8508.discord.media
    pragma: no-cache
    User-Agent: DPP/0.1
    Upgrade: WebSocket
    Connection: Upgrade
    Sec-WebSocket-Key: xxx
    Sec-WebSocket-Version: 13
  3. Server responds with HTTP 101 response:
    HTTP/1.1 101 Switching Protocols
    Connection: upgrade
    sec-websocket-accept: xxx
    upgrade: websocket
    CF-Cache-Status: DYNAMIC
    Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xxx"}],"group":"cf-nel","max_age":604800}
    NEL: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
    Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
    X-Content-Type-Options: nosniff
    Server: cloudflare
    CF-RAY: xxx-DME
  4. Server sends JSON payload:
    1{"op":8,"d":{"v":4,"heartbeat_interval":13750.0}}
    Connection: upgrade
    sec-websocket-accept: xxx
    upgrade: websocket
    CF-Cache-Status: DYNAMIC
    Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xxx"}],"group":"cf-nel","max_age":604800}
    NEL: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
    Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
    X-Content-Type-Options: nosniff
    Server: cloudflare
    CF-RAY: xxx-DME
  5. Connection proceeds. on_voice_ready callback is called.

Example of erroneous voice server connection in different Discord guild:

  1. DPP logs show this:
    DEBUG: Sending op 4 to join VC, guild xxx channel xxx
    DEBUG: Connecting voice for guild xxx channel xxx
    Nothing happens after that.
  2. Bot connects with following request:
    GET /?v=4 HTTP/1.1
    Host: tel-aviv10001.discord.media
    pragma: no-cache
    User-Agent: DPP/0.1
    Upgrade: WebSocket
    Connection: Upgrade
    Sec-WebSocket-Key: xxx
    Sec-WebSocket-Version: 13
  3. Server responds with HTTP 400 response:
    HTTP/1.1 400 Bad request
    Content-Type: text/html
    Transfer-Encoding: chunked
    Connection: keep-alive
    Cache-Control: no-cache
    CF-Cache-Status: DYNAMIC
    Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xxx"}],"group":"cf-nel","max_age":604800}
    NEL: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
    Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
    X-Content-Type-Options: nosniff
    Server: cloudflare
    CF-RAY: xxx-DME
    5a
    <html><body><h1>400 Bad request</h1>
    Your browser sent an invalid request.
    </body></html>
  4. By design of dpp::discord_voice_client::thread_run() function bot tries to reconnect. Steps 2 and 3 are repeated infinitely.
    It may be important to mention that in Discord client bot actually connects to voice channel and sits there even when HTTP 400 error occurs.

To Reproduce
Issue appears to only occur when bot connects to tel-avivXXXXX.discord.media Discord voice servers. The problematic servers do work in Discord desktop clients, so I think the problem is with the library.
Steps to reproduce the behavior:

  1. Connect to voice channel that is served by tel-avivXXXXX.discord.media voice server.
  2. Make the bot connect to it, for example call dpp::guild::connect_member_voice().
  3. Observe that on_voice_ready callback is never called although bot connects to voice channel in Discord UI.

Expected behavior
Bot should properly connect to voice server and on_voice_ready callback should be called.

System Details:

  • Tried it on both
    Linux PC 5.15.90.1-microsoft-standard-WSL2 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
    and
    Linux rpi4 6.1.21-v8+ #1642 SMP PREEMPT aarch64 GNU/Linux
    OSes.
  • Desktop Discord Client was used for testing

Some of my assumptions/terms may be wrong because I tried to get how DPP source code works manually.

@KontraCity KontraCity added the bug Something isn't working label Jun 18, 2023
@braindigitalis
Copy link
Contributor

braindigitalis commented Jun 18, 2023 via email

@braindigitalis
Copy link
Contributor

braindigitalis commented Jun 18, 2023 via email

@KontraCity
Copy link
Author

sounds to me like there's a problem with the tel Aviv voice endpoint. bots and clients do not use the same api or same endpoint and we have no control over what http response code discord sends.

On 18 Jun 2023, 17:11, at 17:11, KontraCity @.> wrote: Git commit reference cd5b9ca Describe the bug The problem is that on_voice_ready callback is not called when bot connects to certain voice channels. Example of successful voice server connection: 1. DPP logs show this: DEBUG: Sending op 4 to join VC, guild xxx channel xxx DEBUG: Connecting voice for guild xxx channel xxx DEBUG: Connecting new voice session... DEBUG: Voice websocket established; UDP endpoint: xxx.xxx.xxx.xxx:xxx [ssrc=xxx] with 7 modes DEBUG: External IP address: xxx.xxx.xxx.xxx 2. Bot connects with following request: GET /?v=4 HTTP/1.1 Host: russia8508.discord.media pragma: no-cache User-Agent: DPP/0.1 Upgrade: WebSocket Connection: Upgrade Sec-WebSocket-Key: xxx Sec-WebSocket-Version: 13 3. Server responds with HTTP 101 response: HTTP/1.1 101 Switching Protocols Connection: upgrade sec-websocket-accept: xxx upgrade: websocket CF-Cache-Status: DYNAMIC Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xxx"}],"group":"cf-nel","max_age":604800} NEL: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800} Strict-Transport-Security: max-age=15552000; includeSubDomains; preload X-Content-Type-Options: nosniff Server: cloudflare CF-RAY: xxx-DME 4. Server sends JSON payload: 1{"op":8,"d":{"v":4,"heartbeat_interval":13750.0}} Connection: upgrade sec-websocket-accept: xxx upgrade: websocket CF-Cache-Status: DYNAMIC Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xxx"}],"group":"cf-nel","max_age":604800} NEL: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800} Strict-Transport-Security: max-age=15552000; includeSubDomains; preload X-Content-Type-Options: nosniff Server: cloudflare CF-RAY: xxx-DME 5. Connection proceeds. on_voice_ready callback is called. Example of erroneous voice server connection in different Discord guild: 1. DPP logs show this: DEBUG: Sending op 4 to join VC, guild xxx channel xxx DEBUG: Connecting voice for guild xxx channel xxx Nothing happens after that. 2. Bot connects with following request: GET /?v=4 HTTP/1.1 Host: tel-aviv10001.discord.media pragma: no-cache User-Agent: DPP/0.1 Upgrade: WebSocket Connection: Upgrade Sec-WebSocket-Key: xxx Sec-WebSocket-Version: 13 3. Server responds with HTTP 400 response: HTTP/1.1 400 Bad request Content-Type: text/html Transfer-Encoding: chunked Connection: keep-alive Cache-Control: no-cache CF-Cache-Status: DYNAMIC Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xxx"}],"group":"cf-nel","max_age":604800} NEL: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800} Strict-Transport-Security: max-age=15552000; includeSubDomains; preload X-Content-Type-Options: nosniff Server: cloudflare CF-RAY: xxx-DME 5a <html><body><h1>400 Bad request</h1> Your browser sent an invalid request. </body></html> 4. By design of dpp::discord_voice_client::thread_run() function bot tries to reconnect. Steps 2 and 3 are repeated infinitely. It may be important to mention that in Discord client bot actually connects to voice channel and sits there even when HTTP 400 error occurs. To Reproduce Issue appears to only occur when bot connects to tel-avivXXXXX.discord.media Discord voice servers. The problematic servers do work in Discord desktop clients, so I think the problem is with the library. Steps to reproduce the behavior: 1. Connect to voice channel that is served by tel-avivXXXXX.discord.media voice server. 2. Make the bot connect to it, for example call dpp::guild::connect_member_voice(). 3. Observe that on_voice_ready callback is never called although bot connects to voice channel in Discord UI. Expected behavior Bot should properly connect to voice server and on_voice_ready callback should be called. System Details: - Tried it on both Linux PC 5.15.90.1-microsoft-standard-WSL2 #1 SMP x86_64 x86_64 x86_64 GNU/Linux and Linux rpi4 6.1.21-v8+ #1642 SMP PREEMPT aarch64 GNU/Linux OSes. - Desktop Discord Client was used for testing Some of my assumptions/terms may be wrong because I tried to get how DPP source code works manually. -- Reply to this email directly or view it on GitHub: #714 You are receiving this because you were assigned. Message ID: @.>

I tried using another bot with the endpoint and it works. Sounds like the problem is not with it or bot API to me.

@braindigitalis
Copy link
Contributor

it could be related to this:

discord/discord-api-docs#6145 (comment)

an error 400 is not something we can prevent or control and is generated by cloudflare not discord. your IP might be shadow banned from bot API perhaps.

bot voice uses different endpoints to the user client and different API versions.

@KontraCity
Copy link
Author

an error 400 is not something we can prevent or control and is generated by cloudflare not discord. your IP might be shadow banned from bot API perhaps.

Tried running the bot under VPN to another country, checked that DEBUG: External IP address: xxx is changed. The behaviour is the same.
I thought that because it is 400 Bad Request the endpoint expects some kind of different request, so the problem is the request.

it could be related to this:

discord/discord-api-docs#6145 (comment)

Not running in "Cloudflare Worker" and the mentioned issue is about authentication error rather than bad request, unlikely to help here.

@Commandserver Commandserver added the audio audio issues or feature requests label Jul 19, 2023
@Neko-Life
Copy link
Member

I think this might also be the cause of Too many files opened. error in the case where bot get stuck waiting for voice ready event, every time it tries to reconnect it opens a new file descriptor without closing the previous ones?

@braindigitalis
Copy link
Contributor

I think this might also be the cause of Too many files opened. error in the case where bot get stuck waiting for voice ready event, every time it tries to reconnect it opens a new file descriptor without closing the previous ones?

where in the code in dpp does it open the new descriptor?

@Neko-Life
Copy link
Member

I think this might also be the cause of Too many files opened. error in the case where bot get stuck waiting for voice ready event, every time it tries to reconnect it opens a new file descriptor without closing the previous ones?

where in the code in dpp does it open the new descriptor?

What i understand from here is that it caches the socket connection if keepalive is true, then it loops calling ssl_client::connect again here, and what i found in the ssl_client constructor it seems that make_new is default to true, that's when it creates new socket connection here. I might be wrong but it seems like such scenario is possible where the cached socket file descriptor is never closed and it would keeps looping creating new socket connection and keeps adding it to cache

@braindigitalis
Copy link
Contributor

no, these keepalives are never used right now in dpp (and when they were used, were not used and are not valid for a websocket connection upgrade)

@braindigitalis
Copy link
Contributor

yeah had to double check but socket reuse is disabled by default and nothing enables it

@Neko-Life
Copy link
Member

ah i see, the Too many files open. was a problem i encountered back in march tho, i don't think i ever encountered it ever since and i thought this issue might relate to it

@braindigitalis
Copy link
Contributor

can anyone actually duplicate this anywhere but the tel aviv endpoint? if its specific to one endpoint it smells of discord bug, so not much we can do.

@KontraCity
Copy link
Author

can anyone actually duplicate this anywhere but the tel aviv endpoint?

Yes, I am able to reproduce the problem. Let me show how:

Bot

We'll make a small bot for testing:

/* main.cpp */
// STL modules
#include <iostream>

// Library dpp
#include <dpp/dpp.h>

int main()
{
    dpp::cluster bot("token", dpp::i_default_intents | dpp::i_message_content);

    bot.on_message_create([&bot](const dpp::message_create_t& event) {
        if (event.msg.content[0] != '!')
            return;
        dpp::guild* guild = dpp::find_guild(event.msg.guild_id);

        if (event.msg.content == "!join") {
            std::cout << "[bot] Will connect to voice channel\n";
            if (!guild->connect_member_voice(event.msg.author.id)) {
                bot.message_create(dpp::message(event.msg.channel_id,
                    "You are not connected to voice channel"));
                return;
            }
        }
        else if (event.msg.content == "!leave") {
            std::cout << "[bot] Will leave voice channel\n";
            dpp::voiceconn* connection = event.from->get_voice(guild->id);
            if (!connection) {
                bot.message_create(dpp::message(event.msg.channel_id,
                    "Not connected to any voice channel"));
                return;
            }

            event.from->disconnect_voice(guild->id);
            bot.message_create(dpp::message(event.msg.channel_id,
                    "Disconnected from voice channel"));
        }
    });

    bot.on_voice_ready([&bot](const dpp::voice_ready_t& event) {
        std::cout << "[bot] Voice connection is ready\n";
    });

    std::cout << "[bot] Starting bot\n";
    bot.start(false);
}

A fresh bot user and guild may be created for cleanliness of testing, but I don't think that this is necessary.
g++ -std=c++17 main.cpp -o main -ldpp

Library

We'll modify library code to show us HTTP response information:

/* sslclient.cpp */
...
502 switch (e) {
503     case SSL_ERROR_NONE:
504         /* Data received, add it to the buffer */
505         if (r > 0) {
+++             std::string data(server_to_client_buffer, r);
+++             std::smatch matches;
+++             if (std::regex_search(data, matches, std::regex(R"(^(HTTP\/1\.1 .+))")))
+++                 std::cout << "[lib] " << matches.str(1) << '\n';
506             buffer.append(server_to_client_buffer, r);
507             if (!this->handle_buffer(buffer)) {
508                 return;
509             }
510             bytes_in += r;
511         }
...

An additional #include <regex> is required.
make -j install

Output

With this setup:

[bot] Starting bot
[lib] HTTP/1.1 200 OK
[lib] HTTP/1.1 101 Switching Protocols
[lib] HTTP/1.1 200 OK
// user in vc: !join //
[bot] Will connect to voice channel
[lib] HTTP/1.1 101 Switching Protocols
[bot] Voice connection is ready

is considered to be successful output;

[bot] Starting bot
[lib] HTTP/1.1 200 OK
[lib] HTTP/1.1 101 Switching Protocols
[lib] HTTP/1.1 200 OK
// user in vc: !join //
[bot] Will connect to voice channel
[lib] HTTP/1.1 400 Bad request
[lib] HTTP/1.1 400 Bad request
[lib] HTTP/1.1 400 Bad request
[lib] HTTP/1.1 400 Bad request
[lib] HTTP/1.1 400 Bad request
// repeats infinetely //

is considered to be erroneous output.

Testing

Region overriding in voice channel settings can be used to cycle bot through several voice servers.
I think that Tel-Aviv or Israel is absent in selection because Tel-Aviv servers are being tested by Discord. Anyway, it doesn't matter because the problem is not unique to Tel-Aviv servers as I found out from testing results.
My steps:

  1. Connect to voice channel, select Brazil region override.
  2. Connect bot to voice channel. Observe results.
  3. Disconnect bot from voice channel.
  4. Change to next region.
  5. Go to step 2

My results

aarch64
test 1, rpi4, aarch64, g++ (Debian 10.2.1-6) 10.2.1 20210110

Region          | Server            | Response
-------------------------------------------------------------
Brazil          | brazil3305        | 101 Switching Protocols
Hong Kong       | hongkong11031     | 400 Bad request
India           | india5924         | 101 Switching Protocols
Japan           | japan7215         | 101 Switching Protocols
Rotterdam       | rotterdam3102     | 101 Switching Protocols
Russia          | russia5000        | 101 Switching Protocols
Singapore       | singapore11038    | 400 Bad request
South Africa    | southafrica3280   | 101 Switching Protocols
Syndey          | sydney11007       | 400 Bad request
US Central      | us-central1195    | 101 Switching Protocols
US East         | us-east533        | 101 Switching Protocols
US South        | us-south301       | 101 Switching Protocols
US West         | us-west11003      | 101 Switching Protocols

test 2, rpi4, aarch64, g++ (Debian 10.2.1-6) 10.2.1 20210110

Region          | Server            | Response
-------------------------------------------------------------
Brazil          | brazil495         | 101 Switching Protocols
Hong Kong       | hongkong3101      | 101 Switching Protocols
India           | india9972         | 101 Switching Protocols
Japan           | japan3512         | 101 Switching Protocols
Rotterdam       | rotterdam2930     | 101 Switching Protocols
Russia          | russia7782        | 101 Switching Protocols
Singapore       | singapore11048    | 400 Bad request
South Africa    | southafrica9888   | 101 Switching Protocols
Syndey          | sydney11007       | 400 Bad request
US Central      | us-central2659    | 101 Switching Protocols
US East         | us-east4353       | 101 Switching Protocols
US South        | us-south2767      | 101 Switching Protocols
US West         | us-west5779       | 101 Switching Protocols

x86_64
test 1, pc, x86_64, g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

Brazil          | brazil3811        | 400 Bad request
Hong Kong       | hongkong2146      | 400 Bad request
India           | india2557         | 101 Switching Protocols
Japan           | japan6476         | 400 Bad request
Rotterdam       | rotterdam3775     | 101 Switching Protocols
Russia          | russia1860        | 101 Switching Protocols
Singapore       | singapore3419     | 101 Switching Protocols
South Africa    | southafrica9470   | 101 Switching Protocols
Syndey          | sydney11009       | 400 Bad request
US Central      | us-central2775    | 101 Switching Protocols
US East         | us-east6698       | 101 Switching Protocols
US South        | us-south10001     | 400 Bad request
US West         | us-west1026       | 101 Switching Protocols

test 2, pc, x86_64, g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

Region          | Server            | Response
-------------------------------------------------------------
Brazil          | brazil3366        | 101 Switching Protocols
Hong Kong       | hongkong776       | 101 Switching Protocols
India           | india3572         | 101 Switching Protocols
Japan           | japan11053        | 400 Bad request
Rotterdam       | rotterdam5162     | 101 Switching Protocols
Russia          | russia6178        | 101 Switching Protocols
Singapore       | singapore11079    | 400 Bad request
South Africa    | southafrica6172   | 101 Switching Protocols
Syndey          | sydney11020       | 400 Bad request
US Central      | us-central8865    | 101 Switching Protocols
US East         | us-east9427       | 101 Switching Protocols
US South        | us-south10001     | 400 Bad request
US West         | us-west6448       | 101 Switching Protocols

VPN bonus to eliminate IP problems possibility
vpn test, pc, x86_64, g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

Region          | Server            | Response
-------------------------------------------------------------
Brazil          | brazil8602        | 400 Bad request
Hong Kong       | hongkong11048     | 400 Bad request
India           | india9989         | 101 Switching Protocols
Japan           | japan11075        | 400 Bad request
Rotterdam       | rotterdam1221     | 101 Switching Protocols
Russia          | russia7679        | 101 Switching Protocols
Singapore       | singapore249      | 101 Switching Protocols
South Africa    | southafrica181    | 101 Switching Protocols
Syndey          | sydney11027       | 400 Bad request
US Central      | us-central1835    | 101 Switching Protocols
US East         | us-east3807       | 101 Switching Protocols
US South        | us-south2110      | 101 Switching Protocols
US West         | us-west9340       | 101 Switching Protocols

I can tell that the problem is not unique to Tel-Aviv servers nor to any region. It is rather a lottery and I can't come up with a solution.

@braindigitalis
Copy link
Contributor

look at the geography of the majority of these places... nearly all are Oceania. it smells of a problem with cloudflare to me.
Either way, we are sending the correct protocol and not getting the correct response. this is not something we can fix.

@KontraCity
Copy link
Author

Why other bots work then? The servers dislike the request if it is 400 Bad request. Maybe there is something that some servers accept and the problematic reject, like old HTTP protocol or some missing headers?

@braindigitalis
Copy link
Contributor

if this was the case they would all be broken

@github-actions
Copy link
Contributor

This issue has had no activity and is being marked as stale. If you still wish to continue with this issue please comment to reopen it.

@github-actions github-actions bot added the Stale label Sep 23, 2023
@Mishura4
Copy link
Member

bump

@raxyte raxyte removed the Stale label Sep 23, 2023
@Jaskowicz1
Copy link
Contributor

Jaskowicz1 commented Sep 23, 2023

So, we can rule out tel-aviv because, as OP mentioned, it's experimental. It's likely discord doesn't want many people touching that until they get more answers. For the other regions, here are my test results (Recorded from the UK):

Region          | Server            | Response
---------------------------------------------------------------------
Brazil          | brazil11020       | 400 Bad request (219ms)
Hong Kong       | hongkong7902      | 400 Bad request (214ms)
India           | india9766         | 101 Switching Protocols (131ms)
Japan           | japan3231         | 101 Switching Protocols (247ms)
Rotterdam       | rotterdam11013    | 101 Switching Protocols (29ms)
Russia          | russia7782        | 101 Switching Protocols (63ms)
Singapore       | singapore6089     | 101 Switching Protocols (185ms, 400 on second run, 11019)
South Africa    | southafrica2917   | 101 Switching Protocols (186ms)
Syndey          | sydney1100        | 400 Bad request (280ms, 101 on second run, 3028)
US Central      | us-central4565    | 101 Switching Protocols (101ms)
US East         | us-east1635       | 101 Switching Protocols (85ms)
US South        | us-south5871      | 101 Switching Protocols (114ms)
US West         | us-west3799       | 400 Bad request (142ms)

From this, I'm assuming anything above or equal to 142ms, is random as to if you can join. I'd like to even say India with 131ms could possibly do this too. Automated testing is required to get more answers, but I'm assuming discord can't handle bots that are too far away, requiring you to get shards and deploy them in different locations.

This data, ofc, isn't accurate. It could be more or less. I'll do more research and I'll find out what's going on.

@Jaskowicz1
Copy link
Contributor

Jaskowicz1 commented Sep 23, 2023

After further discussion in Discord to try and find the issue out, it seems that it's not a library issue at all, my previous assumption is not entirely true either.

While doing even more testing, we found that this can happen on any location, at any endpoint. More specifically, I had the rotterdam endpoint rotterdam5048 constantly cause 400 Bad request and upon further research, this isn't exclusive to this library and dates all the way back to 2018.

Connection definitely plays a part, but this definitely seems like a Discord issue and not a D++ issue.

@Jaskowicz1
Copy link
Contributor

#883 now stops this from infinite looping. It's not a fix, but it improves the logging (so DPP will natively tell you about the http error code) and, like I said at the start of this post, stops the infinite loop happening that you mentioned :)

@braindigitalis
Copy link
Contributor

we have reached the limit of what we can do to fix this. I've raised it with discord and if they say there is anything we can do, i shall reopen this issue.

@KontraCity
Copy link
Author

The cause of the error was so simple that it wasn't worth this long discussion.
Apparently the Sec-WebSocket-Key HTTP header field generation is done completely wrong in the library.
According to WebSocket protocol documentation, Sec-WebSocket-Key should be a random base64 string. In the library it's just

...    /* wsclient.cpp */
 40    : ssl_client(hostname, port),
 41    key(std::to_string(time(nullptr))),
 42    state(HTTP_HEADERS),
...

Almost all Discord servers ignore this field, but the problematic ones notice the problem and respond with 400 Bad Request.
Changing the code:

...    /* wsclient.cpp */
 52    this->write(
 53            "GET " + this->path + " HTTP/1.1\r\n"
 54            "Host: " + this->hostname + "\r\n"
 55            "pragma: no-cache\r\n"
 56            "User-Agent: " + http_version + "\r\n"
 57            "Upgrade: WebSocket\r\n"
 58            "Connection: Upgrade\r\n"
---            "Sec-WebSocket-Key: " + this->key + "\r\n"
+++            "Sec-WebSocket-Key: iMfprdrquvBMoMdc562JrA==\r\n"  // Still wrong! The key should be regenerated every connection
 60            "Sec-WebSocket-Version: 13\r\n\r\n"
 61    );
...

seems to work, although I didn't test it in every region.

@Jaskowicz1
Copy link
Contributor

Jaskowicz1 commented Sep 28, 2023

The cause of the error was so simple that it wasn't worth this long discussion. Apparently the Sec-WebSocket-Key HTTP header field generation is done completely wrong in the library. According to WebSocket protocol documentation, Sec-WebSocket-Key should be a random base64 string. In the library it's just

...    /* wsclient.cpp */
 40    : ssl_client(hostname, port),
 41    key(std::to_string(time(nullptr))),
 42    state(HTTP_HEADERS),
...

Almost all Discord servers ignore this field, but the problematic ones notice the problem and respond with 400 Bad Request. Changing the code:

...    /* wsclient.cpp */
 52    this->write(
 53            "GET " + this->path + " HTTP/1.1\r\n"
 54            "Host: " + this->hostname + "\r\n"
 55            "pragma: no-cache\r\n"
 56            "User-Agent: " + http_version + "\r\n"
 57            "Upgrade: WebSocket\r\n"
 58            "Connection: Upgrade\r\n"
---            "Sec-WebSocket-Key: " + this->key + "\r\n"
+++            "Sec-WebSocket-Key: iMfprdrquvBMoMdc562JrA==\r\n"  // Still wrong! The key should be regenerated every connection
 60            "Sec-WebSocket-Version: 13\r\n\r\n"
 61    );
...

seems to work, although I didn't test it in every region.

Bit of a harsh reply, the long discussion was required. It's not an easy issue to fix and I needed to narrow down the issues. I tested adding the WebSocket key when I was debugging the issue and it didn't work, it's possible it's generated wrong but upon looking at other libraries (like I mentioned I did in my previous comment), they don't use this to communicate via audio either.

If you want to be certain, run the same test you ran before in all regions and see if it's fixed. A PR is more than welcome if it's fixed!

@braindigitalis
Copy link
Contributor

if this was the concrete problem, no websockets would ever connect.

all dpp websockets use this key as it is not checked by dpp. I have a feeling that implementing this fix will not actually solve the problem.

@KontraCity
Copy link
Author

If you want to be certain, run the same test you ran before in all regions and see if it's fixed.

Using the same test bot, test code for library and temporary fix, although I had to rename one variable to avoid name conflict with new const std::string data(server_to_client_buffer, r);:

--- std::string data(server_to_client_buffer, r);
+++ std::string data2(server_to_client_buffer, r);

Test result:
rpi4, aarch64, g++ (Debian 10.2.1-6) 10.2.1 20210110

Region          | Server            | Response
-------------------------------------------------------------
Brazil          | brazil9229        | 101 Switching Protocols
Hong Kong       | hongkong4727      | 101 Switching Protocols
India           | india9298         | 101 Switching Protocols
Japan           | japan11018        | 101 Switching Protocols
Rotterdam       | rotterdam8278     | 101 Switching Protocols
Russia          | russia3879        | 101 Switching Protocols
Singapore       | singapore11076    | 101 Switching Protocols
South Africa    | southafrica6152   | 101 Switching Protocols
Syndey          | sydney802         | 101 Switching Protocols
US Central      | us-central6496    | 101 Switching Protocols
US East         | us-east2522       | 101 Switching Protocols
US South        | us-south5290      | 101 Switching Protocols
US West         | us-west8199       | 101 Switching Protocols

I tested adding the WebSocket key when I was debugging the issue and it didn't work

Make sure it doesn't look like this: (this was my initial error)

"Sec-WebSocket-Key: iMfprdrquvBMoMdc562JrA==/*" + this->key + "*/\r\n"

because it compiles to this:

GET /?v=4 HTTP/1.1
Host: hongkong11009.discord.media
pragma: no-cache
User-Agent: DiscordBot (https://github.com/brainboxdotcc/DPP, 10.0.26)
Upgrade: WebSocket
Connection: Upgrade
Sec-WebSocket-Key: iMfprdrquvBMoMdc562JrA==/*1695922269*/
Sec-WebSocket-Version: 13

You can always double check:

...
void websocket_client::write(const std::string &data)
{
        std::cout << data << '\n';

        if (state == HTTP_HEADERS) {
                /* Simple write */
...

@Neko-Life
Copy link
Member

seems like a legit fix, will test that on Musicat sometime later

@braindigitalis
Copy link
Contributor

thanks for the testing and diagnosis; it looks to me that perhaps it is only failing when the random number we used as a key caused an error when passed through a base64 decoder. this would mean it would be time based. I don't think it needs any further thought though.

did you use a constant for the websocket key? as shown in the example? or is it a base64 version of the time? I think either would work.

@KontraCity
Copy link
Author

KontraCity commented Sep 28, 2023

I used the constant from my first reply. I don't think it really matters, as it is not very important and afaik DPP doesn't check server response to the key, but it's a dirty fix, should be done properly

@braindigitalis braindigitalis linked a pull request Sep 28, 2023 that will close this issue
5 tasks
@Jaskowicz1
Copy link
Contributor

I used the constant from my first reply. I don't think it really matters, as it is not very important and afaik DPP doesn't check server response to the key, but it's a dirty fix, should be done properly

👏 👏 👏 , much love, well done!

@braindigitalis
Copy link
Contributor

braindigitalis commented Sep 28, 2023

this can now be closed but on discussion we dont know if this should be reported to cloudflare.

There is an issue here of RFC non compliance which lies on cloudflare's doorstep. If you read chapter 6 carefully which I did, including the example below it, cloudflare or any RFC compliant websocket server endpoint should not attempt to base64 decode the websocket-sec-key.

cloudflare should take the websocket-src-key as is, like raw text,.prepend it to a constant GUID, sha1 sum the resulting string and pass that resulting string back to us in the 101 switching protocols reply.

what it seems to do instead is attempt to base64 decode the string, then convert the output to utf8. if the conversion fails and contains an invalid encoding then you get an error 400.

what's more only certain endpoints implement this jank. it seems to me to be some kind of check for bad web actors that is broken.

the only mistake we made was assuming the other end will follow the word of the RFC to the letter which clearly gives an explicit example of "leave this text alone and just put it here". people never learn.

anyhow many thanks for your efforts and patience with this issue. if you're ever on the discord server ping me and I'll be happy to give you a contributor role.

@Neko-Life
Copy link
Member

Unfortunately commit 87ec1a1eb5cc03a304d0a211dac51b7972823fd1 didn't fix Musicat failing to connect to US West and Japan region, Rotterdam was also a region which Musicat can't connect to a few week ago but it successfully connect now, US West was working until lately.

Did you also test with the old non-base64 header along with your fix test 12 hours ago @KontraCity ?

It seemingly just happen randomly, might be caused by heavy traffic or simply network issue and the header might not play a role in this since we don't know if the server actually check and validate the header, or the fix isn't really correct, who knows.

Here's some log of Musicat trying to connect to voice channel with US West region

track position: 'Oh Oh Oh Sexy Vampire ft Fright Ranger (S3RL vs Justin B remix) - Disko Warp' 0
[INFO] New thread spawned: 413484736
[INFO] Total active thread: 6
[2023-09-29 10:11:54] DEBUG: Sending op 4 to join VC, guild 1133586655811481770 channel  1133586656478371883
[THUMB HQ_MAX] Inserted thumb: 'http://i3.ytimg.com/vi/Q-2T0K3u_3E/maxresdefault.jpg'
[THUMB HQ_MAX] Inserted thumb: 'http://i3.ytimg.com/vi/Q-2T0K3u_3E/hqdefault.jpg'
[Manager::clear_connecting]: 1133586655811481770
[INFO] Thread done: 413484736
[INFO] Total active thread: 6
[2023-09-29 10:11:55] DEBUG: Connecting voice for guild 1133586655811481770 channel 1133586656478371883
[2023-09-29 10:11:56] WARN: Received unhandled code: HTTP/1.1 400 Bad request
[2023-09-29 10:11:56] DEBUG: Attempting to reconnect the websocket...
[INFO] Total thread done: 1
[INFO] Total thread done and joined: 1
[INFO] Total active thread: 5
[2023-09-29 10:11:57] WARN: Received unhandled code: HTTP/1.1 400 Bad request
[2023-09-29 10:11:57] DEBUG: Attempting to reconnect the websocket...
[2023-09-29 10:11:57] WARN: Received unhandled code: HTTP/1.1 400 Bad request
[2023-09-29 10:11:57] DEBUG: Attempting to reconnect the websocket...
[2023-09-29 10:11:58] WARN: Received unhandled code: HTTP/1.1 400 Bad request
[2023-09-29 10:11:58] DEBUG: Attempting to reconnect the websocket...
[2023-09-29 10:11:59] WARN: Received unhandled code: HTTP/1.1 400 Bad request
[2023-09-29 10:11:59] WARN: Reached max loops whilst attempting to read from the websocket. Aborting websocket.
[Manager::clear_wait_vc_ready]: 1133586655811481770
[Manager::clear_connecting]: 1133586655811481770
[Manager::set_vc_ready_timeout WARN] Connection timeout
[2023-09-29 10:12:04] DEBUG: Disconnecting voice, guild: 1133586655811481770
[INFO] Thread done: 367343296
[INFO] Total active thread: 5
[EVENT] on_voice_state_leave: 1133586655811481770
[INFO] Total thread done: 1
[INFO] Total thread done and joined: 1
[INFO] Total active thread: 4

@Neko-Life
Copy link
Member

I just tried Japan region again and now it connects, US West still failing

@KontraCity
Copy link
Author

Unfortunately commit 87ec1a1eb5cc03a304d0a211dac51b7972823fd1 didn't fix Musicat failing to connect to US West and Japan region, Rotterdam was also a region which Musicat can't connect to a few week ago but it successfully connect now, US West was working until lately.

Because the Sec-WebSocket-Key header field generation is still incorrect.
According to RFC 6455 - 4.1. Client Requirements,

The request MUST include a header field with the name
|Sec-WebSocket-Key|.  The value of this header field MUST be a
nonce consisting of a randomly selected 16-byte value that has
been base64-encoded (see [Section 4 of [RFC4648]](https://datatracker.ietf.org/doc/html/rfc4648#section-4)).  The nonce
MUST be selected randomly for each connection.

In DPP code time(nullptr) gives

std::cout << key.length() << '\n';
10

bytes.
I would consider using std::mt19937 to generate 16 random bytes instead of using UNIX epoch.

@braindigitalis
Copy link
Contributor

so what was your test value iMfprdrquvBMoMdc562JrA==? it cannot decode to a string and gives invalid encoding on base64decode.org

Screenshot_20230929-103005

@KontraCity
Copy link
Author

I don’t know how to make a PR, but this would my C++11 solution:

#include <random>

inline uint8_t random_byte() {
    static std::random_device random_device;
    static std::mt19937 generator(random_device());
    static std::uniform_int_distribution<int> distribution(0, 255);

    return distribution(generator);
}
…
key.resize(16);
for (int byte_index = 0; byte_index < 16; ++byte_index) {
    key[byte_index] = random_byte();
}
…

@KontraCity
Copy link
Author

so what was your test value iMfprdrquvBMoMdc562JrA==? it cannot decode to a string and gives invalid encoding on base64decode.org

I took it from discord.js HTTP request with tshark.
Anyway that would be a non-ASCII friendly string.

@KontraCity
Copy link
Author

I tested, this:

//key = std::to_string(time(nullptr));
key = "1234567890123456"; // 16 bytes
key = base64_encode(reinterpret_cast<const unsigned char*>(key.c_str()), key.length());

works, but dirty because constant.

@KontraCity
Copy link
Author

Can confirm this finally works.
rpi4, aarch64, g++ (Debian 10.2.1-6) 10.2.1 20210110

Region          | Server            | Response
-------------------------------------------------------------
Brazil          | brazil11042       | 101 Switching Protocols
Hong Kong       | hongkong6888      | 101 Switching Protocols
India           | india4938         | 101 Switching Protocols
Japan           | japan11008        | 101 Switching Protocols
Rotterdam       | rotterdam2708     | 101 Switching Protocols
Russia          | russia4606        | 101 Switching Protocols
Singapore       | singapore11024    | 101 Switching Protocols
South Africa    | southafrica9888   | 101 Switching Protocols
Syndey          | sydney8936        | 101 Switching Protocols
US Central      | us-central4565    | 101 Switching Protocols
US East         | us-east7761       | 101 Switching Protocols
US South        | us-south5750      | 101 Switching Protocols
US West         | us-west827        | 101 Switching Protocols

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audio audio issues or feature requests bug Something isn't working
Projects
None yet
7 participants