Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: chunk stream upload endpoint #2230

Merged
merged 16 commits into from
Aug 3, 2021
Merged

feat: chunk stream upload endpoint #2230

merged 16 commits into from
Aug 3, 2021

Conversation

aloknerurkar
Copy link
Contributor

@aloknerurkar aloknerurkar commented Jun 28, 2021

Currently in the /chunks API we only support single chunk upload

Applications using the /chunks endpoint need to loop while sending multiple chunks. This can cause server to hit issues due to too many connections (or on client-side while re-using connections we might see an EOF).

As we already use the gorilla/websocket library, I think it would be better to have a client-side streaming endpoint to upload multiple chunks. This will be easier on the server side as well as would re-use all the resources initialised for the first chunk upload.


This change is Reviewable

@aloknerurkar aloknerurkar marked this pull request as draft June 28, 2021 09:34
@zelig
Copy link
Member

zelig commented Jun 28, 2021

why not support stream for files first?
in other words why not use similar for the file endpoint

@aloknerurkar
Copy link
Contributor Author

why not support stream for files first?
in other words why not use similar for the file endpoint

I actually implemented this as part of something else I was working on. I was using the /chunks endpoint as I am using all our other packages in the code. (https://github.com/aloknerurkar/bee-fs)

If we are okay with the approach I can implement this for files as well.

@acud
Copy link
Member

acud commented Jul 8, 2021

why not support stream for files first?
in other words why not use similar for the file endpoint

@zelig streaming of files can be done today on the /bzz endpoint using multipart AFAIK. This is indeed not the most up-to-date method of streaming. That would be using the streams API in conjunction with the Fetch API.
@alok I would consider to alter the implementation so that it is compatible with the streams and fetch API, since this is implemented in a non-standardized manner.
It would be also good to have the people from the JS team to comment which way to go, since it would be mainly leveraged and implemented by them as a primary stakeholder, at least at the beginning. cc @agazso @AuHau @nugaon

@AuHau
Copy link
Contributor

AuHau commented Jul 9, 2021

Hey,

yeah, to have a possibility to stream chunks upload seems like a crucial thing to have the more we will go into the direction of "lean Bee", so I definitely welcome this idea!

That would be using the streams API in conjunction with the Fetch API.

@acud I believe what are you mentioning are JS native APIs that are used for streaming in JS, but do not directly affect the transport protocol. Stream provides an abstraction to handle streaming data and fetch is a tool for making HTTP requests. So here we still have to choose the transport protocol and its usage.

I see two ways here.

As @acud mentioned there is already support for /bzz using multipart. We could utilize the same principle here. But this does have disadvantages for us in browsers because browsers do not support streaming requests.

From this point of view actually, @aloknerurkar's approach is kind of innovative and even though non-standard it would bypass this browser's limitation, which would allow us in the future to do for example uploading big files from the browser using only chunks (eq. chunking, BMT construction and/or manifests all on client side). So I am actually in favor of this approach!

@nugaon
Copy link
Member

nugaon commented Jul 9, 2021

As @acud mentioned there is already support for /bzz using multipart. We could utilize the same principle here. But this does have disadvantages for us in browsers because browsers do not support streaming requests.

I don't completely understand this statement or it is just not true. multipart/form-data does have support for browsers, even for HTML forms.
Also, on browser-side with JS you can stream but the data that before uploading via any stream solution will be copied into the RAM AFAIK (see FileReader usages). [Though somehow the file stream reading should be possible as Youtube also solved somehow ;) if you know it how pls let me know].

Therefore

From this point of view actually, @aloknerurkar's approach is kind of innovative and even though non-standard it would bypass this browser's limitation, which would allow us in the future to do for example uploading big files from the browser using only chunks (eq. chunking, BMT construction and/or manifests all on client side). So I am actually in favor of this approach!

is not connected to the transport protocol and its handsakes/headers question like in the case of the stream and fetch APIs that you pointed out correctly.
I mean by that with multipart you can also upload as many chunks as you want in one upload process by putting the chunks as files into the FileList and send it to the Bee client, but it still reading whole file(s) into the memory first, so I don't see it would solve the "browser issue". Can you make some clarification about it @AuHau?

Nevertheless, the websocket implementation seems a good approach too, because it may save some unnecessary header data between messages, but the interpretation of the response messages about the successful chunk uploads may be cumbersome on client side. Because of the latter I don't see why would it be better than a traditional multipart uploading with the extension that I just described.

@nugaon
Copy link
Member

nugaon commented Jul 9, 2021

[Though somehow the file stream reading should be possible as Youtube also solved somehow ;) if you know it how pls let me know].

Maybe the File's stream or slice method? https://developer.mozilla.org/en-US/docs/Web/API/File#instance_methods

@aloknerurkar
Copy link
Contributor Author

aloknerurkar commented Jul 9, 2021

From this point of view actually, @aloknerurkar's approach is kind of innovative and even though non-standard it would bypass this browser's limitation, which would allow us in the future to do for example uploading big files from the browser using only chunks (eq. chunking, BMT construction and/or manifests all on client side). So I am actually in favor of this approach!

This resonates with why I implemented this in the first place. I am using bee manifests (in golang) outside of bee and hence when I interact with bee, I can already format the data I want to upload into a series of chunks.

Because of the latter I don't see why would it be better than a traditional multipart uploading with the extension that I just described.

Putting chunks into FileList seems hacky and deviating from the standard.

but the interpretation of the response messages about the successful chunk uploads may be cumbersome on client side

The example code in JS (websockets) that I have seen has callbacks for messages coming from upstream. So I don't understand why it would be complicated. The approach I have taken is a Request-Response scheme. So similar to how we do in libp2p protocols. Each response is read before we send the next request.

@AuHau
Copy link
Contributor

AuHau commented Jul 15, 2021

Also, on browser-side with JS you can stream but the data that before uploading via any stream solution will be copied into the RAM AFAIK (see FileReader usages). [Though somehow the file stream reading should be possible as Youtube also solved somehow ;) if you know it how pls let me know].

@nugaon yes you are right browser can handle Streams but not completely. As you have guessed the problem is that it will load all the data to memory, which is the main problem here.

[Though somehow the file stream reading should be possible as Youtube also solved somehow ;) if you know it how pls let me know].

Downloading stream is not the problem here. That is already supported, only the upload is problematic. Or do you mean that they can upload big files? My understanding is that this limitation that I am talking about here is only about XHR/AJAX/fetch requests. I think that when submitting a normal form with large files, the browser handles it internally differently and actually stream the file.

To my knowledge, only Chrome recently shipped this feature but only as a closed experiment (you have to register somewhere for a given URL and only then Chrome actually enables it for your site). See for example here, here and open issue here.

Therefore ... is not connected to the transport protocol and its handsakes/headers question like in the case of the stream and fetch APIs that you pointed out correctly.
I mean by that with multipart you can also upload as many chunks as you want in one upload process by putting the chunks as files into the FileList and send it to the Bee client, but it still reading whole file(s) into the memory first, so I don't see it would solve the "browser issue". Can you make some clarification about it @AuHau?

Yes, with multipart you could upload multiple chunks, but you would have to load them into memory, which is the problem I am trying to raise here :-)

With the websockets you can assemble chunks on the fly and send them (=stream them) as you have a persistent connection with the server. You can do similar thing now, only making a new request for each chunk, which is something that @aloknerurkar is trying to fix. With multipart you would have to load all the data into memory before sending them.


I guess the question is what is the end-game here. As I see it, from the direction of "lean Bee", a lot of operations with data will in the future happen on the client-side and it is then crucial for the client-side to be able to work with data efficiently which requires for example not loading all the data into the memory. To my knowledge from JS API point of view there is already support for everything we need (like having streams, having a way to stream a file from FS) except stream the data to Bee node.

I know that the focus of Swarm/Bee now is not long-term file storage for big data and that we are not promoting uploading big files, but I would argue that if/once we make this shift the tools should be ready for it. Or in another word, the tools/libraries should not be the problem for a task that is already possible but discouraged.

pkg/api/router.go Outdated Show resolved Hide resolved
pkg/api/router.go Outdated Show resolved Hide resolved
@aloknerurkar aloknerurkar marked this pull request as ready for review July 21, 2021 14:31
@aloknerurkar aloknerurkar requested a review from acud July 21, 2021 14:32
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few minor comments

Reviewed 7 of 7 files at r2.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @aloknerurkar)


pkg/api/chunk_stream.go, line 1 at r2 (raw file):

package api

copyright missing


pkg/api/chunk_stream.go, line 164 at r2 (raw file):

		if tag != nil {
			// indicate that the chunk is stored
			err = tag.Inc(tags.StateStored)

if the chunk has been seen before, no need to increment this again


pkg/api/chunk_stream.go, line 166 at r2 (raw file):

			err = tag.Inc(tags.StateStored)
			if err != nil {
				s.logger.Debugf("chunk upload: increment tag", err)

i would also add an Error here, as we do in the other endpoints. This applies also to the other error checks in this file


pkg/api/chunk_stream_test.go, line 1 at r2 (raw file):

package api_test

code copyright missing

Copy link
Contributor Author

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @AuHau and @acud from 6 discussions.
Reviewable status: 5 of 8 files reviewed, all discussions resolved (waiting on @acud)


pkg/api/chunk_stream.go, line 1 at r2 (raw file):

Previously, acud (acud) wrote…

copyright missing

Done.


pkg/api/chunk_stream.go, line 164 at r2 (raw file):

Previously, acud (acud) wrote…

if the chunk has been seen before, no need to increment this again

Done.


pkg/api/chunk_stream.go, line 166 at r2 (raw file):

Previously, acud (acud) wrote…

i would also add an Error here, as we do in the other endpoints. This applies also to the other error checks in this file

Done.


pkg/api/chunk_stream_test.go, line 1 at r2 (raw file):

Previously, acud (acud) wrote…

code copyright missing

Done.


pkg/api/router.go, line 66 at r1 (raw file):

Previously, AuHau (Adam Uhlíř) wrote…

I would maybe suggest /chunks/stream URL?

Done.


pkg/api/router.go, line 67 at r1 (raw file):

Previously, AuHau (Adam Uhlíř) wrote…

Why is it supposed to be disabled on the gateway?

Done.

Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r3.
Reviewable status: 7 of 8 files reviewed, 2 unresolved discussions (waiting on @acud and @aloknerurkar)


pkg/api/chunk_stream.go, line 185 at r3 (raw file):

			if err := s.pinning.CreatePin(ctx, chunk.Address(), false); err != nil {
				s.logger.Debugf("chunk stream handler: creation of pin for %q failed: %v", chunk.Address(), err)
				s.logger.Error("chunk stream handler: creation of pin failed")

for consistency consideration if this happens and if the pinning on localstore really did happen correctly, you would need to now unpin the chunk again from the localstore


pkg/api/chunk_stream.go, line 191 at r3 (raw file):

		}

		err = sendMsg(websocket.TextMessage, successWsMsg)

nit, but since the API is handling binary data, a string success message might not be needed. you could utilize some response code which is purely binary too, like 0x0 for clean exit (like exit codes), or 0x1 for error, etc

Copy link
Contributor Author

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 8 files reviewed, 2 unresolved discussions (waiting on @acud)


pkg/api/chunk_stream.go, line 185 at r3 (raw file):

Previously, acud (acud) wrote…

for consistency consideration if this happens and if the pinning on localstore really did happen correctly, you would need to now unpin the chunk again from the localstore

In this case, pinning has failed, right? Is there a way we could get an error from CreatePin even if pinning succeeded? Or you meant if we fail to send the response?


pkg/api/chunk_stream.go, line 191 at r3 (raw file):

Previously, acud (acud) wrote…

nit, but since the API is handling binary data, a string success message might not be needed. you could utilize some response code which is purely binary too, like 0x0 for clean exit (like exit codes), or 0x1 for error, etc

For the error, we use sendErrorClose which sends a code and a message. So we only need this message for confirmation of chunk upload. I have used a 0 byte now instead of the string message.

Copy link
Contributor Author

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @acud from a discussion.
Reviewable status: 4 of 8 files reviewed, all discussions resolved (waiting on @acud)


pkg/api/chunk_stream.go, line 185 at r3 (raw file):

Previously, aloknerurkar wrote…

In this case, pinning has failed, right? Is there a way we could get an error from CreatePin even if pinning succeeded? Or you meant if we fail to send the response?

Done.

@nugaon
Copy link
Member

nugaon commented Aug 2, 2021

The approach I have taken is a Request-Response scheme. So similar to how we do in libp2p protocols. Each response is read before we send the next request.

I think a global WS endpoint with RPC-like messages would be more effective for all ws services as we need this streaming on the download part as well later.
By that, if we will have auth layer on HTTP requests (that we should have sonner than later) this authentication could happen once at the initialization of the websocket connection (like new BeeWs('ws://localhost:1636/ws/api-key). If every service gets its own ws endpoint later the things may go complex (understanding each EPs message syntax one-by-one, also handling the authorization in every handler on bee side) and not unified.

Also I don't see why requests-responses need to be blocked in a queue instead of simply just upload chunks independently from each other and handle the responses by their IDs

@aloknerurkar aloknerurkar requested a review from acud August 2, 2021 16:42
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r4, 3 of 3 files at r6.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @aloknerurkar)


pkg/api/chunk.go, line 156 at r6 (raw file):

			err = s.storer.Set(ctx, storage.ModeSetUnpin, chunk.Address())
			if err != nil {
				s.logger.Debugf("chunk upload: deletion of pin for %q failed: %v", chunk.Address(), err)

nit, not sure if %q will catch the Address to String method

Copy link
Contributor

@mrekucci mrekucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 7 files at r2, 2 of 3 files at r4, 1 of 3 files at r6, 2 of 2 files at r7.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @aloknerurkar)


pkg/api/chunk.go, line 60 at r7 (raw file):

	putter, err = newStamperPutter(s.storer, s.post, s.signer, batch)
	if err != nil {
		s.logger.Debugf("chunk upload: putter:%v", err)

Missing space after : in the debug message.


pkg/api/chunk.go, line 68 at r7 (raw file):

			return nil, nil, nil, errors.New("batch not usable")
		}
		return nil, nil, nil, errors.New("")

The returned error errors.New("") doesn't tell much. Don't you wanna return err instead?


pkg/api/chunk_stream.go, line 106 at r7 (raw file):

	for {
		select {

You should also check for ctx.Done().


pkg/api/pss.go, line 26 at r7 (raw file):

)

var (

You can change the variables into constants.

@aloknerurkar
Copy link
Contributor Author


pkg/api/chunk_stream.go, line 106 at r7 (raw file):

Previously, mrekucci wrote…

You should also check for ctx.Done().

This is not required as gorilla websocket library ignores the http request context. The gone channel will ensure we stop if client goes away and the quit channel will ensure we stop if the application is stopped. Apart from that we dont need any more synchronization.

The request context can be canceled prematurely. Ideally we should use a new context here, but none of the function calls which need the context actually check on ctx.Done() in this context.

Copy link
Contributor Author

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @mrekucci from a discussion.
Reviewable status: 7 of 9 files reviewed, all discussions resolved (waiting on @acud and @mrekucci)


pkg/api/chunk.go, line 68 at r7 (raw file):

Previously, mrekucci wrote…

The returned error errors.New("") doesn't tell much. Don't you wanna return err instead?

Done.

@aloknerurkar aloknerurkar requested a review from mrekucci August 3, 2021 08:00
Copy link
Contributor

@mrekucci mrekucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r8.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @aloknerurkar)

@aloknerurkar aloknerurkar merged commit 0b244e9 into master Aug 3, 2021
@aloknerurkar aloknerurkar deleted the traversal.1 branch August 3, 2021 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants