Handle reconnection for Sessions #609

sashahilton00 · 2018-02-08T23:54:59Z

sashahilton00
Feb 8, 2018
Maintainer

This issue serves as a placeholder for discussion around the rewrite of the session handling logic, as it is currently one of the less stable parts of librespot. Issue #103 is related to this.

plietar · 2018-02-10T00:46:38Z

plietar
Feb 10, 2018
Maintainer

So here's my brain dump about the whole reconnection thing.

Currently one Session object is associated with one TCP connection to Spotify servers. We can choose to keep it this way and start returning appropriate errors to the caller, making it main.rs (or any user of librespot) responsible for creating a new Session and switching to that. This is pretty tricky and puts the burden on every end application using librespot. Note however that the "switch session" part already kind of exists when switching users with discovery.

The alternative is to move to a model where a Session object can alternate back and forth between a connected and a not connected state. The API for Session becomes something like :

impl Session {
    fn new(config: SessionConfig, cache: Option<Cache>, handle: Handle) -> Session { ... }
    fn connect(&self, credentials: Credentials) -> impl Future<Credentials, ConnectionError> { ... }
}

The connect function can be called multiple times (switching users when discovery is enabled for example), and should be called again when a disconnection is noticed.

Trying to use an unconnected session (to get metadata, audio key/data, etc...) should fail gracefully. All the APIs already use Future/Stream, so these just need to resolve into errors. Same happens if the connection is lost while a request is in flight.

Player and Spirc need to behave correctly when those requests fail. For Player it just means returning to the NotPlaying state and signal the caller that playback has completed with an error. For Spirc, when Player signals a connection problem it should probably stop playback. Reading from or sending to the mercury channel may start failing as well.

Now to get back up, I suggest the Session signals in some way to main that connection was closed, and main is responsible for attempting to reconnect by calling connect again, potentially with some exponential back off. It should do so by using the Credential that was returned by the last successful connect call. The various components get notified of this and need to carry over with the new connection.

This can mean various things depending on the component. Spirc for example should reset its state if the username is different, and reestablish the mercury channel.

The components need to expose an API that looks like:

impl Component {
    fn connection_established(&self, username: String)  { ... }
    fn dispatch(&self, cmd: u8, mut data: Bytes) { ... }
    fn connection_closed(&self)  { ... }
}

There are some risks of race conditions that need to be carefully though about.

I think I have some initial implementation of this somewhere, I'll try and find it again.

0 replies

sashahilton00 · 2018-02-17T16:57:42Z

sashahilton00
Feb 17, 2018
Maintainer Author

@plietar did you find that initial implementation anywhere?

0 replies

sashahilton00 · 2018-02-26T02:20:06Z

sashahilton00
Feb 26, 2018
Maintainer Author

Regardless, I think we should be aiming to make this as simple to integrate as possible, as handling reconnection logic from whatever project uses librespot sounds like plenty of headache for anyone who tries to use librespot, whilst also being pretty low level functionality that I personally would expect to be handled in any library I used. Anyway, these are just my two cents since I'm not sufficently versed in rust to be able to rewrite how sessions are handled. Would be good to hear others thoughts as to how this should be handled.

0 replies

lrbalt · 2018-04-30T13:44:03Z

lrbalt
Apr 30, 2018

I am trying to wrap my head around this issue.

AFAICS, the disconnect error starts at Session::DispatchTask::poll. The error causes the invalid flag to be set on Session (patch #206) and the error is returned from the future and picked up by the map_err on Session::connect which only logs an error. Both Spirc::poll and Player::poll check on a valid session and they shut down if they notice that the session is invalid, freeing all resources like mercury

Since Player is managed by Spirc, I think we can improve this by only checking on Session at Spirc::poll and let spirc inform Player to stop/shutdown.

main currently does not notice the lost connection at all which is probably wrong since it does not clean up spirc and spirc_task and player_event_channel. But they are replaced when a new connection is made, so no real harm done. This can be improved by letting spirc_task return an error and handle this in main.

So this works for the binary case.

For the library case @plietar suggests to let main reconnect using last credentials and let Spirc, Player, etc. pick up the new session and reset its state accordingly. I have two problems trying to implement this

session is currently a future which requires AFAIK a 'static lifetime. Making Session::new causes the Session state to be non-static. I couldn't get Session::new() followed by Session::connect(&mut self) to work because &self not being 'static
why is it difficult for the library to let main recreate Session/Spirc/Player from the credentials that main manages anyway? We could let Spirc return with a ConnectionError? I'm trying to figure out the use case.

If you look at examples/play.rs you can do a match on core.run(player.load(track, true, 0)) and handle the ConnectionError??

0 replies

maciex · 2018-11-21T18:00:33Z

maciex
Nov 21, 2018

Is there any way to mitigate this problem?
Is it possible to somehow detect that it disconnected and restart it manually (using some shell script)?

0 replies

oboote · 2018-11-26T02:15:58Z

oboote
Nov 26, 2018

Is there any way to mitigate this problem?
Is it possible to somehow detect that it disconnected and restart it manually (using some shell script)?

This doesn't address the cause but aids with the symptoms:
https://gist.github.com/JeremieRapin/bc35a2632c072a082d48f5503270df16

Script to be run as a cronjob every minute which checks the status and restarts the service if an error is spotted.

0 replies

cdlenfert · 2019-07-29T15:03:20Z

cdlenfert
Jul 29, 2019

Is the cronjob patch the best solution for this issue currently? Sometimes I'm 25 songs in, sometimes I'm 4 when I lose connection. This is running on a server connected directly to my modem/router via ethernet. I can't imagine it's actually losing internet connection this consistently.

0 replies

willstott101 · 2019-10-19T10:50:02Z

willstott101
Oct 19, 2019

This can mean various things depending on the component. Spirc for example should reset its state if the username is different, and reestablish the mercury channel.

Sounds like this is closely related to, and would probably help solve #266

0 replies

Johannesd3 · 2021-05-24T08:38:05Z

Johannesd3
May 24, 2021
Maintainer

Let's solve this as part of the new api.

Mercury will not be central anymore for the session. There will be the http api and the websocket.

I implemented the websocket already (#753), and considered reconnection from the start. There is the Dealer struct, and as long as somebody owns it, a background task waits that the receiver or sender task fails and tries a reconnection is this case.

The websocket works almost independent from the rest of the session. Only we need to request a token from mercury to establish a connection. I solved this as follows: One must pass an "async" closure to Dealer on construction, and on every reconnection attempt, this closure is called to get the next url.

The closure holds a weak reference to the session.

What happens if Mercury is down as well or the session object is gone? The closure shouldn't return anything but have it's own little reconnection logic and try it periodically again. The Dealer isn't interested in how often something goes wrong, it just wants a url to continue its work.

This wasn't so hard, since the websocket receives only things and isn't used to send requests. There's no one else who has to care if the session is down.

Nevertheless, I think a similar approach could work for Mercury.

First of all, we should split the Mercury connection into another struct which is parallel to Dealer.

Then, we create a background task to handle reconnection like in Dealer.

If a request is done, it should be possible to pass an optional timeout, and it should fail with a MercuryError::Reconnecting

Audio channels just won't yield anything, and maybe it's possible to make it look exactly as if it's lagging because of slow internet connection. In this case, no big changes would be required for the player.

For every other case it has to be investigated how to handle a MercuryError::Reconnecting.

Your feedback?

7 replies

sashahilton00 May 25, 2021
Maintainer Author

This sounds sensible. W.R.T how audio playback could be handled in the event of a reconnection being in progress, I believe the 'yield nothing' approach is what is used by the official clients - the controls and UI all indicate that playback is occurring, except the progress bar does not move and no sound comes out. Once the connection is reestablished the audio resumes playing and the progress bar continues. I believe we already implement a microcosm of this when a track is loaded - the player is set to the loading state, then once loaded the progress bar is set to the current playback position and everything continues normally.

sashahilton00 May 26, 2021
Maintainer Author

Something else worth considering is whether we actually need mercury at all. It's mainly the connect functionality where this question remains, but from what I can see the apps have almost fully switched to http/grpc

Johannesd3 May 26, 2021
Maintainer

Are you sure regarding grpc? I didn't see any signs of it in librespot-java. Which address? How does authentication work?

sashahilton00 May 26, 2021
Maintainer Author

Are you sure regarding grpc? I didn't see any signs of it in librespot-java. Which address? How does authentication work?

There aren't that many grpc endpoints, but if one looks at the http traffic, a number of them respond with a content type of grpc, and the proto definitions have grpc annotations

Johannesd3 May 26, 2021
Maintainer

So I assume there's still the usual authentication necessary, and then a token is used for the http api?

roderickvd · 2021-06-21T10:11:37Z

roderickvd
Jun 21, 2021
Maintainer

@Johannesd3 having started work on this now I realize the same holds not only for Mercury but also SpClient (what librespot-java calls ApiClient). It's not just a matter of reconnecting, but possibly also selecting a new access point from apresolve -- it could be that the network error is not on our end but on that particular Spotify server. I think your Dealer can handle this with the get_url closure.

What do you think would be the best approach to copy this behavior to Mercury and SpClient? I could literally take your work on Dealer, copy and adapt it, but perhaps there is a more elegant approach? For example, there already is a component! macro, maybe we could do something like a pubsub! or handler! or client! or...

3 replies

Johannesd3 Jun 21, 2021
Maintainer

I wasn't able to read every message the last week, and I don't quite understand what you mean by "the same".

It's not just a matter of reconnecting, but possibly also selecting a new access point from apresolve -- it could be that the network error is not on our end but on that particular Spotify server. I think your Dealer can handle this with the get_url closure.

Yes, you're right. I already had it in mind when I wrote it.

What do you think would be the best approach to copy this behavior to Mercury and SpClient? I could literally take your work on Dealer, copy and adapt it, but perhaps there is a more elegant approach? For example, there already is a component! macro, maybe we could do something like a pubsub! or handler! or client! or...

I'm not sure, after all they differ very much. Reconnecting was easy in dealer, because

there are no pending requests that would fail on reconnection
subscriptions are not announced to the server, and we're able to continue using our listeners on reconnection (although I'm not sure whether this is true in reality)

I'd assume that spclient is stateless, isn't it just an http api? Why is reconnection necessary?

The old "mercury" session is another story. What happens to subscriptions, what happens to pending requests on reconnection? On the other hand, iirc librespot-java doesn't even use subscriptions anymore. Do the official Spotify clients still need mercury subscriptions, @devgianlu @sashahilton00? It would make things a lot easier if we can remove the subscription functionality.

If you still think it's possible to generalize approaches, I would try to avoid a macro based solution. I'm always afraid of wrapping large code parts into a macro. Traits make better use of type-safety and are easier to understand for IDEs.

devgianlu Jun 21, 2021

Do the official Spotify clients still need mercury subscriptions?

Most of the subscriptions can be done with the dealer, the last time I was able to capture Hermes, there ware subscriptions only for UI stuff like presence.

roderickvd Jun 21, 2021
Maintainer

I wasn't able to read every message the last week, and I don't quite understand what you mean by "the same".

I'd assume that spclient is stateless, isn't it just an http api? Why is reconnection necessary?

I did speak/write too soon, you're right it's not the same at all in that sense. Yes, spclient is stateless and all we want to do is update the access point once we start getting "Connection reset by peer" errors.

The old "mercury" session is another story. What happens to subscriptions, what happens to pending requests on reconnection? On the other hand, iirc librespot-java doesn't even use subscriptions anymore. Do the official Spotify clients still need mercury subscriptions, @devgianlu @sashahilton00? It would make things a lot easier if we can remove the subscription functionality.

If so then the same would apply for spclient: resolve a new access point once the one we're using stops responding.

sinni800 · 2022-01-10T15:43:56Z

sinni800
Jan 10, 2022

I want to add here a use case which makes these silent disconnects very frequent: laptops with docks. Undocking my laptop will continue playing until the current song ends, then go silent. It seems like it dies at a point at which it would have downloaded something, maybe? Where the buffer ends, basically.

The client should be able to survive network cards going missing (while another is still there, like a wifi), imo. If it does that, we'd be golden on all possible fronts.

0 replies

jonatrey · 2022-03-20T11:57:28Z

jonatrey
Mar 20, 2022

Since this feature request is still open, I'm adding another use case. I use power over ethernet for my raspberry pi. Sometimes the connection drops for a second and it crashes raspotify. As the track data is cached for a few seconds, it could keep playing whatever is cached and attempt a reconnect in the background using an exponential backoff.

Here is the error log I'm seeing when the connection drops:

Mar 20 10:38:59 raspberrypi librespot[779]: [2022-03-20T09:38:59Z ERROR librespot_core::session] Connection reset by peer (os error 104)
Mar 20 10:38:59 raspberrypi librespot[779]: [2022-03-20T09:38:59Z WARN  librespot_connect::spirc] Cannot flush spirc event sender.
Mar 20 10:52:15 raspberrypi librespot[779]: [2022-03-20T09:52:15Z ERROR librespot_playback::player] Unable to load audio item: MercuryError
Mar 20 10:52:15 raspberrypi librespot[779]: [2022-03-20T09:52:15Z ERROR librespot_core::session] Connection reset by peer (os error 104)
Mar 20 10:52:15 raspberrypi librespot[779]: [2022-03-20T09:52:15Z ERROR librespot_connect::spirc] subscription terminated
Mar 20 10:52:15 raspberrypi librespot[779]: [2022-03-20T09:52:15Z WARN  librespot] Spirc shut down unexpectedly

4 replies

aykevl Jul 19, 2023

Looks like I have a very similar issue: Spotifyd/spotifyd#1211
I still haven't figured out what triggers it (I have a stable internet connection, but still I get "connection reset by peer").

JasonLG1979 Jul 19, 2023

The connection is being dropped on Spotify's side. It doesn't really have anything to do with the quality of your connection.

paulbastian Jul 19, 2023

It seems to me that this error mostly happens on raspberry pi and it never occurred to me on desktop. How could that be if it is dropped on their side?

JasonLG1979 Jul 19, 2023

It's a pretty vague error, but that's what a 104 means. It could be for who knows what reason but Spotify is the "peer" and they reset the connection.

jcadduono · 2023-06-15T17:48:39Z

jcadduono
Jun 15, 2023

Until more gets done I wrote a better monitoring script to restart librespot, if anyone wants it. I need it due to using Starlink, which loses connection for a moment at least once every hour.

It will actively follow librespot's journal and restart it if core::session throws an error (ignoring other common errors that don't break it, like 502's from ::spirc)
This makes it very quick to respond and restart. It will not restart more than once per 5 seconds. (in the case of internet offline)
Uses standard Debian packages (systemd, bash, coreutils, awk), I am unsure how it will perform on non-Debian based systems.

/root/bin/monitor-librespot.sh:

#!/bin/bash
last_restart=$(</proc/uptime)
last_restart=${last_restart%%.*}
echo "Starting librespot monitoring at uptime: $last_restart"
journalctl -b -f -o cat _SYSTEMD_UNIT=librespot.service --since now | stdbuf -o0 awk '$2 == "ERROR" && $3 == "librespot_core::session]" { print }' | while read -r line; do
        now=$(</proc/uptime)
        now=${now%%.*}
        echo "$line"
        if [ "$now" -gt $((last_restart + 5)) ]; then
                last_restart="$now"
                echo "Error found! Restarting librespot service at uptime: $now"
                systemctl restart librespot.service
        fi
done

/etc/systemd/system/monitor-librespot.service:

[Unit]
Description=Librespot monitoring service

[Service]
Type=simple
User=root
Group=root
Restart=on-failure
RestartSec=10s
ExecStart=/root/bin/monitor-librespot.sh

[Install]
WantedBy=multi-user.target

Once created, run chmod +x /root/bin/monitor-librespot.sh and systemctl enable monitor-librespot.service then systemctl start monitor-librespot.service and it should be good to go. This assumes you installed librespot as a systemd service already.

6 replies

jcadduono Jun 23, 2023

looks like it is indeed is! seems last time i compiled was a few days before that went in, and that commit appears to be working well!

JasonLG1979 Jun 23, 2023

If you feel like being adventurous I could use some testing of this PR?

jcadduono Jul 15, 2023

after some longer testing, this script is still useful as it appears the 4d402e6 commit does not always restart spirc, see log:

Jul 15 10:13:05 shores-office librespot[529427]: [2023-07-15T14:13:05Z INFO  librespot] librespot 0.5.0-dev 4d402e6 (Built on 2023-06-16, Build ID: 59w7ZsIv, Profile: release)
Jul 15 10:13:05 shores-office librespot[529427]: [2023-07-15T14:13:05Z INFO  librespot_playback::mixer::softmixer] Mixing with softvol and volume control: Cubic(60.0)
Jul 15 10:13:05 shores-office librespot[529427]: [2023-07-15T14:13:05Z INFO  librespot_playback::audio_backend::alsa] Using AlsaSink with format: S32
Jul 15 10:13:05 shores-office librespot[529427]: [2023-07-15T14:13:05Z INFO  librespot_core::session] Connecting to AP "ap-gue1.spotify.com:4070"
Jul 15 10:13:05 shores-office librespot[529427]: [2023-07-15T14:13:05Z INFO  librespot_core::session] Authenticated as "" !
Jul 15 10:13:05 shores-office librespot[529427]: [2023-07-15T14:13:05Z INFO  librespot_core::session] Country: "CA"
Jul 15 11:01:29 shores-office librespot[529427]: [2023-07-15T15:01:29Z INFO  librespot_core::spclient] Resolved "gue1-spclient.spotify.com:443" as spclient access point
Jul 15 11:01:30 shores-office librespot[529427]: [2023-07-15T15:01:30Z INFO  librespot_playback::player] Loading <I've Been Prayin'> with Spotify URI <spotify:track:4oldEeb01wrjmo8LjcK>
Jul 15 11:01:31 shores-office librespot[529427]: [2023-07-15T15:01:31Z INFO  librespot_connect::spirc] Resolved 50 tracks from <"spotify:playlist:7lQu0IRGR1qTjWYdZbbKXE">
Jul 15 11:01:32 shores-office librespot[529427]: [2023-07-15T15:01:32Z INFO  librespot_playback::player] <I've Been Prayin'> (225364 ms) loaded
Jul 15 11:01:34 shores-office librespot[529427]: [2023-07-15T15:01:34Z INFO  librespot_core::cache] Removed 1 cache files.
Jul 15 11:01:46 shores-office librespot[529427]: [2023-07-15T15:01:46Z ERROR librespot_connect::spirc] ContextError: Error { kind: Unknown, error: StatusCode(502) }
Jul 15 11:01:47 shores-office librespot[529427]: [2023-07-15T15:01:47Z INFO  librespot_playback::player] Loading <I Cry> with Spotify URI <spotify:track:4bZd0nRuX8HyjeXAUBczvm>
Jul 15 11:01:48 shores-office librespot[529427]: [2023-07-15T15:01:48Z INFO  librespot_playback::player] <I Cry> (223800 ms) loaded
Jul 15 11:01:50 shores-office librespot[529427]: [2023-07-15T15:01:50Z INFO  librespot_core::cache] Removed 1 cache files.
Jul 15 11:05:02 shores-office librespot[529427]: [2023-07-15T15:05:02Z INFO  librespot_playback::player] Loading <Remind Me> with Spotify URI <spotify:track:0XSHQ049xxgzb5iokiv8ut>
Jul 15 11:05:02 shores-office librespot[529427]: [2023-07-15T15:05:02Z ERROR librespot_core::session] Connection reset by peer (os error 104)
Jul 15 11:05:02 shores-office librespot[529427]: [2023-07-15T15:05:02Z ERROR librespot_connect::spirc] remote update selected, but none received

JasonLG1979 Jul 15, 2023

You don't need to restart librespot on a 104. You would just reconnect manually. It would be nice if that were automatic but you're script isn't doing anything useful either in that case.

jcadduono Jul 15, 2023

I wish this were the case but it seems when I receive those errors, the librespot instance disappears from the windows Spotify desktop app's streaming target list and restarting librespot is the only thing that makes it show up again.

jcadduono · 2023-07-13T17:53:29Z

jcadduono
Jul 13, 2023

Is there any work being done to add some session recovery to 4d402e6?
Currently the way it is handled is the same as simply restarting the librespot service, which loses connection to the spotify control clients (making them go back to local playback target), and stops the current song.
It would be nice if it could cache the current playing song position, then seamlessly retry playing from that position & playlist after reconnection, or at least not disconnect the control client.
I can sadly reproduce the disconnections on librespot just by playing a 4K netflix show in my house. It's strange that it disconnects in the middle of songs rather than playing a song and failing to retrieve the next one, I assumed that it downloaded the entire song while playing. Is there a buffer limit?
Everything else in my house is maintaining connection without any issue as well (no disconnections from online games, etc.), only librespot has issues. It's like it doesn't have enough bandwidth to play the entire song and once it reaches the end of a buffer it just gives up?

1 reply

roderickvd Jul 14, 2023
Maintainer

I’m not working on it but can answer some of your questions.

The buffer is currently a on-disk write-through so practically unlimited in capacity.

The issue is probably more that when the Spirc thread detects a disconnection, it drops to the main thread which in turn drops the player.

Theoretically it should be possible to restart on the same position for sure, though I am not sure how much work it would be to hack main and playback to do that. If that ends up to be a lot of changes (it was never designed that way) then it may be the same and better spent effort to fix reconnections period.

szygmunt · 2023-09-23T18:34:37Z

szygmunt
Sep 23, 2023

Hi,
I also have this problem often. Maybe we should set up a reward for someone who can solve this. This problem is very disturbing to use spotify-connect. For my part, I can offer cooperation in testing/debugging.

5 replies

crowjake Jan 3, 2024

I'll throw 50 quid in the hat, no qualms, for a thing I've used everyday for free for years this is the only irritation!

noahhaon Sep 19, 2024

Would gladly contribute to a bounty to improve this!

szygmunt Sep 19, 2024

I'm testing go-librespot now. My experiences are much better. Very stable connection.

kingosticks Sep 19, 2024

I see that go-librespot is just retrying a bunch of times, rotating around the access points. That isn't what librespot does but we might as well go down this route and paper over whatever problem this is. Maybe it is just Spotify's shoddy servers. Happy with that approach @roderickvd ?

roderickvd Sep 19, 2024
Maintainer

Certainly. Hope you can find some straightforward way to do that. Don't want to be discouraging but just to share some memories: because librespot previously was not architected to do that, I think I remember it would require a bit of refactoring to do it. Sounds simple but may take more lines than you'd expect, because AP authentication and finally seeing if connectivity is OK happen in rather different places.

roderickvd · 2024-10-26T15:11:04Z

roderickvd
Oct 26, 2024
Maintainer

Reconnection logic has been greatly improved, closing for now but feel free to reopen if you still have specific use cases.

0 replies

Handle reconnection for Sessions #609

sashahilton00 Feb 8, 2018 Maintainer

Replies: 16 comments · 26 replies

plietar Feb 10, 2018 Maintainer

sashahilton00 Feb 17, 2018 Maintainer Author

sashahilton00 Feb 26, 2018 Maintainer Author

Johannesd3 May 24, 2021 Maintainer

sashahilton00 May 25, 2021 Maintainer Author

sashahilton00 May 26, 2021 Maintainer Author

Johannesd3 May 26, 2021 Maintainer

sashahilton00 May 26, 2021 Maintainer Author

Johannesd3 May 26, 2021 Maintainer

roderickvd Jun 21, 2021 Maintainer

Johannesd3 Jun 21, 2021 Maintainer

roderickvd Jun 21, 2021 Maintainer

roderickvd Jul 14, 2023 Maintainer

roderickvd Sep 19, 2024 Maintainer

roderickvd Oct 26, 2024 Maintainer

sashahilton00
Feb 8, 2018
Maintainer

Replies: 16 comments 26 replies

plietar
Feb 10, 2018
Maintainer

sashahilton00
Feb 17, 2018
Maintainer Author

sashahilton00
Feb 26, 2018
Maintainer Author

Johannesd3
May 24, 2021
Maintainer

sashahilton00 May 25, 2021
Maintainer Author

sashahilton00 May 26, 2021
Maintainer Author

Johannesd3 May 26, 2021
Maintainer

sashahilton00 May 26, 2021
Maintainer Author

Johannesd3 May 26, 2021
Maintainer

roderickvd
Jun 21, 2021
Maintainer

Johannesd3 Jun 21, 2021
Maintainer

roderickvd Jun 21, 2021
Maintainer

roderickvd Jul 14, 2023
Maintainer

roderickvd Sep 19, 2024
Maintainer

roderickvd
Oct 26, 2024
Maintainer