Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature frequencies #37

Open
wants to merge 21 commits into
base: dev
Choose a base branch
from
Open

Feature frequencies #37

wants to merge 21 commits into from

Conversation

JPapir
Copy link
Collaborator

@JPapir JPapir commented Sep 22, 2024

This PR closes #31 and helps for #32 .

It is a reworked version of the frequencies class that I used in the overabundance branch. Features:

  • One object to create
  • Reads the tables at the first opening and try to catch all frequency information that are available.
  • In theory (but it is hard to test everything), it is able to handle multiple sources of frequencies (which can happen in frequency tables)
  • Tables are accessible as attributes : f.forms, f.cells, f.lexemes
  • But it is recommended to access them using the two following methods:
    • get_absolute_freq()
    • get_relative_freq()
      Which are able to do some filtering/grouping/sum/mean/cookies/cheesecakes depending on the arguments.

In particular, the get_relative_freq uses a C implementation to do the sums. This wasn't straightforward, because skipna arguments are missing from the groupby.sum() C implementation in pandas.... So I had to cheat a bit.

This class is maybe a bit overkill. But in theory, it can handle very different kinds of data sources, and recover a lot of information, which makes it quite multipurpose.

There are tests included and it is documented.

Uses

There is no rush to merge it by the way !

@JPapir JPapir changed the base branch from master to dev September 22, 2024 20:03
@JPapir
Copy link
Collaborator Author

JPapir commented Sep 22, 2024

I forgot to push the test package and to update the links in the docstrings 🫤
It will be for tomorrow.

@JPapir JPapir added the enhancement New feature or request label Sep 22, 2024
@JPapir
Copy link
Collaborator Author

JPapir commented Sep 23, 2024

Hum, just noticed that I used the paralex.read_table() and forgot to add Paralex to the setup.cfg

But this raises another question: Paralex expects frictionless=5.16, which in turn expects jsonschema<=4.17.3.
This constraint is a bit annoying, because it creates conflicts with libraries that expect higher versions of jsonschema (for instance, jupyter). Do you think we could bump Paralex dependency to frictionless 5.17 ? It doesn't seem to introduce any critical change, but it has better support for more recent jsonschema versions.

@JPapir
Copy link
Collaborator Author

JPapir commented Sep 23, 2024

So as you can see, I still find some inconsistencies and problems, so that I will test additional edge cases and confirm when it seems better. Unfortunately, I can't turn the PR into a draft PR, but this would be the idea.

Returning results in the same structure format
Fixed issues with different col_names
@JPapir JPapir force-pushed the feature-frequencies branch from 520d071 to 1e6c39e Compare September 23, 2024 12:46
@JPapir
Copy link
Collaborator Author

JPapir commented Sep 23, 2024

Things are better now !

@JPapir JPapir force-pushed the feature-frequencies branch from d66be3f to 1e6c39e Compare September 23, 2024 18:59
@XachaB
Copy link
Owner

XachaB commented Dec 17, 2024

Hum, just noticed that I used the paralex.read_table() and forgot to add Paralex to the setup.cfg

But this raises another question: Paralex expects frictionless=5.16, which in turn expects jsonschema<=4.17.3. This constraint is a bit annoying, because it creates conflicts with libraries that expect higher versions of jsonschema (for instance, jupyter). Do you think we could bump Paralex dependency to frictionless 5.17 ? It doesn't seem to introduce any critical change, but it has better support for more recent jsonschema versions.

Are these frictionless issues solved with paralex ?

@XachaB
Copy link
Owner

XachaB commented Dec 17, 2024

Oh, I see, this Frequencies class behaves like my ugly singleton static Inventory class. Was this necessary ? Could it be a more simple instance ?

My problem with Inventory was that I had many different classes which all needed to depend on a single and identical definition of the sounds inventory, and I did not want to spend my life always passing around an object so that everything relied on the same phonology. But I don't see why we couldn't have a frequency object and pass it where needed ?

Eg: We could manipulate it like so:

p = fl.Package('tests/data/TestPackage/test.package.json')
freqs = Frequencies(p)
print(freqs.info().to_markdown())

@JPapir
Copy link
Collaborator Author

JPapir commented Dec 18, 2024

Are these frictionless issues solved with paralex ?

I think yes, (at some point Paralex was updated to latest frictionless I believe). I will check

@JPapir
Copy link
Collaborator Author

JPapir commented Dec 18, 2024

Oh, I see, this Frequencies class behaves like my ugly singleton static Inventory class. Was this necessary ? Could it be a more simple instance ?

Initially it was a standard class, and then I felt like 'oh, he had a really nice idea with this static Segment inventory', let's replicate that. But I think that you are right, using objects is a better solution because in some cases, we might need two Frequency instances (eg two datasets, bipartite, etc). I will change that.

@XachaB
Copy link
Owner

XachaB commented Dec 18, 2024

Lol, ok, I take responsibility for your aesthetic sense being skewed towards my ugly solutions. In this case I do think normal Frequency instances would be simpler :)

@JPapir
Copy link
Collaborator Author

JPapir commented Dec 18, 2024

I'm leaving for the lab, once I'm there I will fix all those points with the frequencies.

@JPapir
Copy link
Collaborator Author

JPapir commented Dec 18, 2024

Are these frictionless issues solved with paralex ?

I think yes, (at some point Paralex was updated to latest frictionless I believe). I will check

I confirm.

@XachaB
Copy link
Owner

XachaB commented Dec 18, 2024

Current pytest does not run doctests. When running doctests, it turns out a few tests elsewhere don't pass: I'm cleaning that up in this MR and will commit.

@JPapir
Copy link
Collaborator Author

JPapir commented Dec 18, 2024

Current pytest does not run doctests. When running doctests, it turns out a few tests elsewhere don't pass: I'm cleaning that up in this MR and will commit.

We could also turn to proper pytest. My docstring examples are meant to roughly test the behaviour, but this is not a proper testing strategy.

@XachaB
Copy link
Owner

XachaB commented Dec 18, 2024

For some things, real unit tests are important. For ensuring basic functionality, doctests are nice because they're also documentation and they're easy to adjust when we change the implementation. So I think both have a role !

@XachaB
Copy link
Owner

XachaB commented Dec 18, 2024

I have broken things further, right now I can't see why it doesn't work when it does work locally. I'll leave it a bit and come back to it, perhaps with fresher eyes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Frequencies class
2 participants