Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore (base, export): Soft code text_featurize method; ensure unique feature names #94

Merged
merged 38 commits into from
Jan 23, 2024

Conversation

Arkhymadhe
Copy link
Collaborator

@Arkhymadhe Arkhymadhe commented Oct 20, 2023

This PR should:

  • soft code the text_featurize featurizer methods. This will allow flexibility with respect to the parts of speech for molecular text.
  • Ensure feature names are unique.

Should close issues #74 and #90.

@Arkhymadhe Arkhymadhe added the refactor Changes (mostly refactoring) might be needed label Oct 20, 2023
@Arkhymadhe Arkhymadhe linked an issue Oct 20, 2023 that may be closed by this pull request
@Arkhymadhe Arkhymadhe linked an issue Oct 20, 2023 that may be closed by this pull request
@Arkhymadhe Arkhymadhe changed the title chore (base): Soft code text_featurize method chore (base, export): Soft code text_featurize method; ensure unique feature names Oct 21, 2023
@Arkhymadhe
Copy link
Collaborator Author

@kjappelbaum Are there any changes you'd like to see here?

Comment on lines 45 to 66
if __name__ == "__main__":
repetitive_labels, all_labels = get_repetitive_labels(FEATURIZER)

print("Diagnostics:")
print("=" * 50)
print("Number of featurizers implemented:", len(FEATURIZER.featurizers))
print("=" * 50)
print("Number of labels:", len(all_labels))
print("Number of repeated labels:", len(repetitive_labels))
print("Number of unique labels:", len(set(all_labels)))

if len(repetitive_labels) == 0:
print("=" * 50)
print("All detected labels:\n")

for i, label in enumerate(all_labels, 1):
print(f"{i:.3g}.", label)
exit()

for k, v in repetitive_labels.items():
if v["count"] > 1:
print(f"{k}", v)
Copy link
Collaborator

@kjappelbaum kjappelbaum Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you run this now, are there still repetitive labels? Could we run this in the GitHub actions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. All is good. You can run on your end and confirm. When I first ran it, there were, so I actually had fix things to ensure there weren't.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! What do you think of adding it as something we run in the GitHub actions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would make sense. I have never setup Github Actions before, so this should be educative for me too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! What do you think of adding it as something we run in the GitHub actions?

This is done 👍. Check it out. Also, as you might observe, I had to refactor some of the tests.

@@ -2,8 +2,9 @@ name: Tests

on:
push:
branches: [ main ]
branches: [ main, text_featurize ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this should not be there in general? Why not use workflow_dispatch instead for manual tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll check out this workflow_dispatch.

@@ -40,6 +41,8 @@ jobs:
- name: Install dependencies
run: |
pip install -e ".[symmetry,tests]"
- name: Export pre-tests
run: python src/chemcaption/export/pre_test_export.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this "only" prints stuff, right?

That is, it would not raise an error in the actions if we found repetitive labels? For this we would need to raise an exception in the python code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't think of that. I'll do that now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now.

@@ -40,6 +42,8 @@ jobs:
- name: Install dependencies
run: |
pip install -e ".[symmetry,tests]"
- name: Export pre-tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the name "Export pre-tests" ideal? I understand that the main purpose is to check for uniqueness?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now. Renamed as Test for label uniqueness.

Copy link
Collaborator

@kjappelbaum kjappelbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks all good to me here. I didn't run the unittests but I assume that they pass

@Arkhymadhe
Copy link
Collaborator Author

looks all good to me here. I didn't run the unittests but I assume that they pass

The only test failing here is the SVGFeaturizer test for the generated text. Maybe I should look into that in the unit_tests PR to prevent this PR from being unnecessarily bloated?

@kjappelbaum
Copy link
Collaborator

The only test failing here is the SVGFeaturizer test for the generated text. Maybe I should look into that in the unit_tests PR to prevent this PR from being unnecessarily bloated?

ok, we can open a new issue for that and merge that soon.

@@ -2,8 +2,10 @@ name: Tests

on:
push:
branches: [ main ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in principle ok, but was there any particular reason for that change?

@kjappelbaum kjappelbaum merged commit 70e32af into lamalab-org:main Jan 23, 2024
@kjappelbaum kjappelbaum deleted the text_featurize branch January 23, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Changes (mostly refactoring) might be needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ensure that all feature names are unique do not hard-code text_featurize to rely on noun
2 participants