Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement __repr__ on user-facing classes #399

Closed
wants to merge 11 commits into from

Conversation

axiomofjoy
Copy link
Contributor

@axiomofjoy axiomofjoy commented Mar 19, 2023

Implements __repr__ on user-facing classes so users can inspect and understand the classes they are interacting with in their notebooks.

The __repr__ for Schema and EmbeddingColumnNames is implemented so that a user can copy and paste in order to instantiate an identical dataclass. I made an effort to handle invalid values nicely so the user can still inspect their dataclasses to see where they messed up.

Example Output for Schema

Schema(
    prediction_id_column_name='prediction_id',
    timestamp_column_name='timestamp',
    feature_column_names=[
        'feature_1',
        'feature_2',
    ],
    embedding_feature_column_names={
        'embedding_feature': EmbeddingColumnNames(
            vector_column_name='embedding_vector',
            raw_data_column_name='raw_data',
        ),
    },
)
Schema(
    feature_column_names=   A  B
                         0  1  7
                         1  5  2
                         2  3  8,
)

Example Output for EmbeddingColumnNames

EmbeddingColumnNames(
    vector_column_name='embedding_vector',
)
EmbeddingColumnNames(
    vector_column_name='embedding_vector',
    raw_data_column_name='raw_data',
)

Example Output for DatasetDict

DatasetDict({
    'primary': Dataset(
        dataframe=...,
        schema=...,
        name='primary',
    ),
    'reference': Dataset(
        dataframe=...,
        schema=...,
        name='reference',
    ),
})

Example Output for Dataset

Phoenix Dataset
===============

name: 'example'

dataframe:
    columns: ['A', 'B', 'C', 'D', 'E', 'timestamp', 'prediction_id']
    shape: (10, 7)

schema: Schema(
    prediction_id_column_name='prediction_id',
    timestamp_column_name='timestamp',
    feature_column_names=[
        'A',
        'B',
        'C',
        'D',
        'E',
    ],
)

@@ -46,6 +46,7 @@ dev = [
"pytest-lazy-fixture",
"strawberry-graphql[debug-server]==0.155.3",
"pre-commit",
"mypy==0.991",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newer version of MyPy causing the MyPy daemon on VSCode to crash.

Comment on lines +335 to +340
def __getitem__(self, key: str) -> Dataset:
try:
return cast(Dataset, getattr(self, key))
except AttributeError:
raise KeyError(f"Invalid key: {key}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-classing off of Dict does not make this a dictionary. Adding this dunder so users can actually use indexing as you would expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@axiomofjoy axiomofjoy marked this pull request as ready for review March 19, 2023 00:11
@axiomofjoy axiomofjoy changed the title feat: make Schema and EmbeddingColumnNames inspectable feat: make user-facing classes inspectable Mar 19, 2023
@axiomofjoy axiomofjoy changed the title feat: make user-facing classes inspectable feat: implement __repr__ on user-facing classes Mar 19, 2023
@axiomofjoy
Copy link
Contributor Author

@fjcasti1 I like your idea for displaying Dataset instances long-term: https://arize-ai.slack.com/archives/C04QMRADE1L/p1679023435224509

@axiomofjoy axiomofjoy removed the request for review from fjcasti1 March 20, 2023 05:12
@RogerHYang
Copy link
Contributor

RogerHYang commented Mar 20, 2023

Not blocking, but this doesn't seem like the idiomatic use of __repr__. See discussion here.

From the docs:

[...] For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval() [...]

Copy link
Contributor

@fjcasti1 fjcasti1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little unsure about the Viewable class. Maybe we can sync offline about that

Comment on lines 256 to 270
def __repr__(self) -> str:
"""
Return a string to display the dataset's name, dataframe, and schema.
"""
repr_string = (
"""Phoenix Dataset
===============\n\n"""
+ f"name: '{self.name}'\n\n"
+ f"""dataframe:
columns: {list(self.dataframe.columns)}
shape: {self.dataframe.shape}\n\n"""
+ f"schema: {self.schema}"
)
return repr_string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the dunder methods need docstrings in general since their intentions are self-explanatory. I also think it's common practice to place these methods at the top. Could be wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this StackOverflow post, there's no official recommendation on this topic. Let's have a conversation about what convention we want to use.

Comment on lines +335 to +340
def __getitem__(self, key: str) -> Dataset:
try:
return cast(Dataset, getattr(self, key))
except AttributeError:
raise KeyError(f"Invalid key: {key}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines 348 to 353
def _format_dataset(dataset: Dataset) -> str:
return f"""Dataset(
dataframe=...,
schema=...,
name='{dataset.name}',
)"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in love with the ... but I think for now we can pass that until we get more feedback. Any other solution that comes to mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about this as well. I haven't thought of anything better, but open to suggestions.

@axiomofjoy
Copy link
Contributor Author

axiomofjoy commented Mar 20, 2023

Not blocking, but this doesn't seem like the idiomatic use of __repr__. See discussion here.

From the docs:

[...] For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval() [...]

I'm not sure what that would look like for Dataset and DatasetDict. Do you have something in mind? If you call repr on a Pandas DataFrame, for example, it looks something like this:

   A  B
0  1  7
1  5  2
2  3  8

For Schema and EmbeddingColumnNames, you are able to copy/ paste the __repr__ output to instantiate an identical dataclass.

@axiomofjoy axiomofjoy closed this Mar 20, 2023
@axiomofjoy axiomofjoy reopened this Mar 20, 2023
@RogerHYang
Copy link
Contributor

Do you have something in mind? If you call repr on a Pandas DataFrame, for example, it looks something like this:

I don't have any thing better, so it's not blocking...just pointing out that this approach is unorthodox.

rule of thumb: __repr__ is for developers, __str__ is for customers.

Output from pandas:

>>> repr(pd.DataFrame({"x":[1,2,3]}))
'   x\n0  1\n1  2\n2  3'
>>> print(pd.DataFrame({"x":[1,2,3]}))
   x
0  1
1  2
2  3

@axiomofjoy
Copy link
Contributor Author

axiomofjoy commented Mar 20, 2023

Do you have something in mind? If you call repr on a Pandas DataFrame, for example, it looks something like this:

I don't have any thing better, so it's not blocking...just pointing out that this approach is unorthodox.

rule of thumb: __repr__ is for developers, __str__ is for customers.

Output from pandas:

>>> repr(pd.DataFrame({"x":[1,2,3]}))
'   x\n0  1\n1  2\n2  3'
>>> print(pd.DataFrame({"x":[1,2,3]}))
   x
0  1
1  2
2  3
>>> pd.DataFrame({"x":[1,2,3]})
   x
0  1
1  2
2  3

__repr__ also called here.

@axiomofjoy axiomofjoy self-assigned this Mar 22, 2023
@axiomofjoy axiomofjoy mentioned this pull request Mar 22, 2023
25 tasks
@axiomofjoy
Copy link
Contributor Author

overly complicated, closing in favor of #425

@axiomofjoy axiomofjoy closed this Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📘 Todo
Development

Successfully merging this pull request may close these issues.

3 participants