Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: revise point cloud plumbing and add unit tests #205

Merged
merged 14 commits into from
Feb 6, 2023

Conversation

RogerHYang
Copy link
Contributor

@RogerHYang RogerHYang commented Feb 1, 2023

resolves #169

This PR refactors PointCloud into a stand-alone process by inverting its dependencies, removing non-essential responsibilities and adding unit tests.

Data filters and sampling will be addressed more comprehensively in a future PR

@RogerHYang RogerHYang changed the title refactor!: revise point cloud process flow and add unit tests refactor!: revise point cloud plumbing and add unit tests Feb 1, 2023
Copy link
Contributor

@fjcasti1 fjcasti1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I like how it is structured. My only nit would be that I would subclass UMAP from Projectors. But I understand what you're going for and this might be the design decision.

I see you making changes so let me know when it's ready so I give a final look and approve

src/phoenix/pointcloud/clustering.py Outdated Show resolved Hide resolved
src/phoenix/pointcloud/clustering.py Outdated Show resolved Hide resolved
src/phoenix/pointcloud/pointcloud.py Show resolved Hide resolved
@RogerHYang RogerHYang marked this pull request as ready for review February 1, 2023 19:05
src/phoenix/pointcloud/clustering.py Outdated Show resolved Hide resolved
src/phoenix/pointcloud/clustering.py Show resolved Hide resolved
src/phoenix/pointcloud/clustering.py Outdated Show resolved Hide resolved

Identifier = TypeVar("Identifier", bound=Hashable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need for this generic? trying to understand what you mean by an identifier here. A doc-string could help here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly a placeholder. I see that we have used uuid (which is a str) as an identifier for the prediction_id, so it could be just a str in the dict key, but it's a dynamic language, so it could work if the prediction_id is a number (e.g. row number). Making it generic here is just to keep the typing flexible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - I think we should probably consider not over abstracting but this is neat.

src/phoenix/pointcloud/pointcloud.py Outdated Show resolved Hide resolved
src/phoenix/pointcloud/pointcloud.py Outdated Show resolved Hide resolved
src/phoenix/pointcloud/pointcloud.py Outdated Show resolved Hide resolved
src/phoenix/pointcloud/projectors.py Outdated Show resolved Hide resolved

Identifier = TypeVar("Identifier", bound=Hashable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - I think we should probably consider not over abstracting but this is neat.


Parameters
----------
data : mapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data : mapping
data : Mapping[Identifier, Vector]

Does this not corralate with the type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. i was looking for examples from other code bases but haven't found any, so kind of just gave up. i wasn't sure if these types variables belong in a docstring since they're local constructs relevant only to the type checker



@dataclass
class MockClustersFinder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice dependency injection.

n_components,
)
assert projections.shape == (samp_size, n_components)
assert abs(projections.mean()) < 1e-5 # is centered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is is deterministic? Kinda scary if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i opted for a randomized input here so we don't need to create fixture, so it's all random for now

@pytest.mark.parametrize("samp_size,n_features,n_components", [(10, 8, 5), (15, 7, 3)])
def test_umap(samp_size: int, n_features: int, n_components: int) -> None:
a = np.random.rand(n_features, n_features)
projections = Umap().project(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if this is testing too much of UMAP itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it probably is, and we can see that from the runtime (about 8 seconds). But the other intent here is to have a piece of live code available for type checking and to run debugger on Umap itself so we can understand it better. So I kept it mostly for the that purpose, instead of having it be a real test per se. (The random matrix generator is also handy for other developments.) Eventually I don't see this test being useful in the long term.

@RogerHYang RogerHYang merged commit 8b120b3 into main Feb 6, 2023
@RogerHYang RogerHYang deleted the point-cloud-process-flow branch February 6, 2023 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run HDBSccan in series w/ UMAP
3 participants