Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send small blobs inline. #8318

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Send small blobs inline. #8318

wants to merge 2 commits into from

Conversation

hvlad
Copy link
Member

@hvlad hvlad commented Nov 14, 2024

The feature allows to send small blob contents in the same data stream as main resultset.
This lowers number of roundtrips required to get blob data and significantly improves performance on high latency networks.

The blob metadata and data is send using new type of packet op_inline_blob and new structure P_INLINE_BLOB.
The op_inline_blob packet is send before corresponding op_sql_response (in case of answer on op_execute2 or op_exec_immediate2), or op_fetch_response (answer on op_fetch).
There could be as much op_inline_blob packets as number of blob fields in output format.
NULL blobs and too big blobs are not sent.
The blob send as a whole, i.e. current implementation doesn't support sending of part of blob. The reasons - attempt to not over-complicate the code and the fact that seek is not implemented for segmented blobs.

Current, initial, implementation send all blobs that total size is not greater than 16KB.

The open questions is what API changes is required to allow user to customize this process:

  • allow to enable and disable inline blob sending
  • allow to set inline blob size limit
  • decide on what level should be applicable settings above: per-attachment, per-statement, etc
  • decide default and maximum values for inline blob size limit.

Also, will good to have but not required:

  • allow to set BPB in advance
  • allow to enable blob in-lining on per-field basis, if output format contains many blob fields.

This PR currently in draft state and published for the early testers and commenters.

@hvlad hvlad self-assigned this Nov 14, 2024
@hvlad hvlad marked this pull request as draft November 14, 2024 11:58
@aafemt
Copy link
Contributor

aafemt commented Nov 14, 2024

Why a new packet instead of sending them in the response message itself? IIRC response packets contain its own format so inline BLOBs can be described individually as strings and then transformed to cached BLOBs on client.

@AlexPeshkoff
Copy link
Member

Vlad, I suppose that content of op_inline_blob is cached by remote provider in order to serve requests for data in that blobs w/o network access. If yes - how long is data in that cache kept?

@hvlad
Copy link
Member Author

hvlad commented Nov 14, 2024

Vlad, I suppose that content of op_inline_blob is cached by remote provider in order to serve requests for data in that blobs w/o network access. If yes - how long is data in that cache kept?

Yes, sure. Cached blob is bound to the transaction object and will be released (what happens first):

  • at transaction end, or
  • when user opens the blob with non-empty BPB, or
  • user opens the blob with empty BPB and then closed it.

Note, in the case when user opens the blob with non-empty BPB, cached blob is discarded.

@AlexPeshkoff
Copy link
Member

Imagine RO-RC transaction which leasts VERY long (nothing prevents from keeping it open for client application lifetime). Would not such long life of cache be an overhead?

@hvlad
Copy link
Member Author

hvlad commented Nov 14, 2024

Imagine RO-RC transaction which leasts VERY long (nothing prevents from keeping it open for client application lifetime). Would not such long life of cache be an overhead?

It is supposed that cached blobs will be read by application.
Anyway, it will be good to have a way to set limit on blobs cache size, is it your point ?

@AlexPeshkoff
Copy link
Member

Telling true my first thought was that cache is very tiny - just blobs from last fetched row, but this appears inefficient when we try to support various grids.

First of all let's think about binding cache not to transaction but to request/statement. It's hardly typical to close statement and read blobs from it after close. Moreover, in the worst case that will anyway work - in old way over the wire.

With limiting cache size arrives one more tunable parameter and I'm afraid there are already too much of them: blob size limit per-attachment or per-statement, may be on per-field basis (at least on/off), default BPB, may be on per-field basis too. (Hmm - are there too many cases when >1 blob per row is returned ?)

Last but not least - is blob's inlining enabled by default? On my mind yes, but very reasonable (ie not too big) defaults should be used.

@sim1984
Copy link

sim1984 commented Nov 14, 2024

There should be cache size limits in any case. If you loaded 1000000 records (1 blob per record) at 16K, that's already 16G. But if I understand correctly, this will be provided that the user does not read these cached blobs as the records are fetched. Maybe it's worth limiting the blob cache to some amount, for example 1000 (configurable) and when the number of blobs becomes greater than this value, the oldest of them are removed from the cache.

And of course, this should be disabled/enabled at the statement level. And perhaps some dpb to set the default parameter.

@hvlad
Copy link
Member Author

hvlad commented Nov 14, 2024

Telling true my first thought was that cache is very tiny - just blobs from last fetched row, but this appears inefficient when we try to support various grids.

Yes, it was my thoughts too. Also, consider batch fetching, when whole batch of rows should be read from the wire - it will cache all corresponding blobs anyway.

First of all let's think about binding cache not to transaction but to request/statement. It's hardly typical to close statement and read blobs from it after close. Moreover, in the worst case that will anyway work - in old way over the wire.

It was in my very first version of code. Until I started to handle op_exec_immediate2 - it have no statement :)

It is possible to mark blobs by stmt id (when possible) and remove such blobs from transaction cache on statement close.
But I prefer to avoid it, so far. It gives a chance for the "not typical" apps to access cached blobs after statement close - I guess is it not so non-typical when there is no cursor, i.e. for 'EXECUTE PROCEDURE', etc.

With limiting cache size arrives one more tunable parameter and I'm afraid there are already too much of them: blob size limit per-attachment or per-statement, may be on per-field basis (at least on/off), default BPB, may be on per-field basis too. (Hmm - are there too many cases when >1 blob per row is returned ?)

If there will be too many parameters, we can put them into separate dedicated interface, say IClientBlobCache, that will be implemented by Remote provider only.

And I'm sure there is applications that have many blobs in its resultsets. Look at monitoring tables, for example: MON$STATEMENTS have two blobs, there are other.

Last but not least - is blob's inlining enabled by default? On my mind yes, but very reasonable (ie not too big) defaults should be used.

Currently it is enabled - else nobody could be able to test the feature ;)

One of the goals of this PR is to discuss and then implement necessary numbers of parameters and corresponding API to customize the blobs cache.

So far, I see two really required parameters: 'maximum blob size for inline sending' (per-statement or per-attachment- to be decided, it should be known to the server) and 'size of blob cache' (per-attachment, client-only). Others is 'good to have' but not highly required : BPB, per-field inlining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants