Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fields parameter in join to fetch a subset of fields from rhs #535

Merged
merged 9 commits into from
Sep 7, 2024

Conversation

satrana42
Copy link
Contributor

Add right_fields parameter in join to fetch only a subset of fields from rhs

Copy link

ellipsis-dev bot commented Aug 21, 2024

Your free trial has expired. To keep using Ellipsis, sign up at https://app.ellipsis.dev for $20/seat/month or reach us at help@ellipsis.dev

]
]

c = rating.join(revenue, how="left", on=[str(cls.movie)], right_fields=["revenue"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/right_fields/rhs_fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used right_fields because the param for join on is called right_on and not rhs_on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just use fields here since that's quite disambiguated on its own (there is no analog of left_fields)? cc @aditya-nambiar

c = a.join(b, how="inner", on=["user_id"], right_fields=["rank"]) # type: ignore
return c

assert "doesn't exist in allowed fields of right schema" in str(e.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add assert that we print the pipeline and dataset name in the error message so that user can figure where the issue is

c = a.join(b, how="inner", on=["user_id"], right_fields=["user_id"]) # type: ignore
return c

assert "doesn't exist in allowed fields of right schema" in str(e.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user doesnt know what the allowed fields are? can we tell them eg value fields and or timestamp of the rhs dataset

c = a.join(b, how="inner", on=["user_id"], right_fields=["timestamp"]) # type: ignore
return c

assert "already exists in left schema" in str(e.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please assert on teh full message rather than using "in"

Comment on lines 1274 to 1283
"on":{
"a1":"b1"
},
"withinLow":"259200s",
"withinHigh":"31536000s"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the how missing ? We should not modify what the protos used to generate since that is stored state on disk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the test throw an error if I copied wrong json?

Comment on lines 1040 to 1042
# TODO(sat): Are the same checks here and in SchemaValidator redundant?
# If right_fields is set, check that it contains elements from right schema values and timestamp only
if self.right_fields is not None and len(self.right_fields) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to raise error in SchemaValidator as we can give a lot more context about the error there

@satrana42 satrana42 changed the title [WIP] Add fields parameter in join to fetch a subset of fields from rhs Add fields parameter in join to fetch a subset of fields from rhs Sep 4, 2024
@aditya-nambiar aditya-nambiar merged commit 8427e32 into main Sep 7, 2024
8 checks passed
@aditya-nambiar aditya-nambiar deleted the sat/fields_join_rhs branch September 7, 2024 04:00
satrana42 added a commit that referenced this pull request Oct 18, 2024
In this change, we update or add documentation for operator
changes done in #553, #556, #535, #560 and #578.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants