Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Data Sampling #25064

Open
3 of 15 tasks
rohdesamuel opened this issue Jan 18, 2023 · 0 comments
Open
3 of 15 tasks

[Task]: Data Sampling #25064

rohdesamuel opened this issue Jan 18, 2023 · 0 comments

Comments

@rohdesamuel
Copy link
Contributor

What needs to happen?

This issue is to track adding a data sampling feature to the SDKs as discussed in the associated slide deck. This adds:

  • The Sample instruction (SampleRequest and SampleResponse) to query the samples from an SDK
  • The "beam:protocol:data_sampling:v1" capability to track which SDKs can sample
  • The "enable_data_sampling" experiment to toggle the feature
  • Automatic sampling in the Python, Java, and Go SDKs for running PTransforms

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@rohdesamuel rohdesamuel mentioned this issue Feb 10, 2023
3 tasks
pabloem pushed a commit that referenced this issue Feb 16, 2023
* Implement the functionality to sample/encode/decode elements

* Implement Python data sampling

* Final cleanup for Python SDK data sampling implementation

* lint

* lint

* add data sampler test

* finish data sampler tests

* Add comments, encode in the nested context, sample based on time instead

* run tests

* fix tests

* run tests

* rebase based on new proto names

* trigger tests

* linter

* replace mockito clocks with simple FakeClock

---------

Co-authored-by: Sam Rohde <srohde@google.com>
lukecwik pushed a commit that referenced this issue Feb 22, 2023
PCollection data sampling for Java SDK harness #25064

This adds the capability for the Java SDK harness to sample in-flight elements. The implementation modifies the ProcessBundleDescriptor when received by the BundleProcessor to create additional DataSampling PTransforms on PCollections. The samples are then returned when the SDK receives a SampleDataRequest.

Task #25064
lostluck pushed a commit to lostluck/beam that referenced this issue Feb 22, 2023
…5354)

PCollection data sampling for Java SDK harness apache#25064

This adds the capability for the Java SDK harness to sample in-flight elements. The implementation modifies the ProcessBundleDescriptor when received by the BundleProcessor to create additional DataSampling PTransforms on PCollections. The samples are then returned when the SDK receives a SampleDataRequest.

Task apache#25064
ruslan-ikhsan pushed a commit to akvelon/beam that referenced this issue Mar 10, 2023
…5354)

PCollection data sampling for Java SDK harness apache#25064

This adds the capability for the Java SDK harness to sample in-flight elements. The implementation modifies the ProcessBundleDescriptor when received by the BundleProcessor to create additional DataSampling PTransforms on PCollections. The samples are then returned when the SDK receives a SampleDataRequest.

Task apache#25064
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants