BlobUploader utilities to enable handling of large data in instrumentation#3122
BlobUploader utilities to enable handling of large data in instrumentation#3122michaelsafyan wants to merge 21 commits intoopen-telemetry:mainfrom
Conversation
...telemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob_uploader.py
Show resolved
Hide resolved
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/content_type.py
Outdated
Show resolved
Hide resolved
|
Ran |
|
Looks like I'm already getting some review comments, so will convert from DRAFT to READY. |
There was a problem hiding this comment.
I've review most of the code and suggested type hints, and a few simple improvements like f-strings.
But more generally I suggest this should be fundamentially reconsidered:
The obvious and most scalable way to implement blob uploads today is using pre-signed URLs.
The idea would be:
- the client makes a get or post request to an endpoint provided by the backend with the file type (optional), name (optional) and size (you could implement support for more specific attributes like width/height etc)
- The backend returns a URL (most likely an S3 style pre-signed URL), and a reference
- the client posts the data to that URL, raises and error if the response is not 2XX
- the client stores the reference in the OTEL data
There are lots of advantages of this approach IMHO:
- it means the client needs to implement ZERO logic related to different providers and object stores, it just gets a URL and posts data to it
- this pre-signed url approach is already implement by S3, GCS and ever other S3 compatible object store, so it should be pretty easy for backends to implement
- if backends want to implement things differently, they can, the client logic is completely independent of the signing method, destination URL etc.
| self._labels[k] = labels[k] | ||
|
|
||
| @staticmethod | ||
| def from_data_uri(uri: str, labels: Optional[dict] = None) -> "Blob": |
There was a problem hiding this comment.
this would be easier to extend if this was a classmethod that returned cls(raw_bytes, content_type=content_type, labels=labels).
Alternatively, if this class shouldn't be subclassed, it should be marked as final.
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py
Outdated
Show resolved
Hide resolved
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py
Outdated
Show resolved
Hide resolved
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py
Outdated
Show resolved
Hide resolved
| This object conteptually has the following properties: | ||
|
|
||
| - raw_bytes: the actual data (payload) of the Blob | ||
| - content_type: metadata about the content type (e.g. "image/jpeg") | ||
| - labels: key/value data that can be used to identify and contextualize | ||
| the object such as {"trace_id": "...", "span_id": "...", "filename": ...} |
There was a problem hiding this comment.
this duplicates the docs on the properties.
|
|
||
| 'traces/12345/spans/56789' | ||
| 'traces/12345/spans/56789/events/0' | ||
| 'traces/12345/spans/56789/events/some.event.name' |
There was a problem hiding this comment.
what happens if we want to include some kind of customer or project reference in the path?
...nstrumentation/src/opentelemetry/instrumentation/_blobupload/backend/google/gcs/_gcs_impl.py
Outdated
Show resolved
Hide resolved
| """Returns a variant of the Blob with the content type auto-detected if needed.""" | ||
| if blob.content_type is not None: | ||
| return blob | ||
| content_type = detect_content_type(blob.raw_bytes) |
There was a problem hiding this comment.
can't we infer the content type from the labels, instead of inspecting the bytes?
...entation/src/opentelemetry/instrumentation/_blobupload/utils/simple_blob_uploader_adaptor.py
Outdated
Show resolved
Hide resolved
...entation/src/opentelemetry/instrumentation/_blobupload/utils/simple_blob_uploader_adaptor.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Samuel Colvin <s@muelcolvin.com>
|
Apologies for the delay. Full transparency: this is being deprioritized behind work to add instrumentation for the GenAI SDK (github.com/googleapis/python-genai). I will probably not be able to pick this work up again until that is completed. |
Description
Provides an experimental library for uploading signals data to blob storage as a proof-of-concept to help inform direction of instrumentation that handles request/response data, with a focus on GenAI multimodal.
Related discussion to this PR:
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Wrote unit tests for the relevant files added.
Does This PR Require a Core Repo Change?
Unsure.
Checklist:
See contributing.md for styleguide, changelog guidelines, and more.