-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature frequencies #37
base: dev
Are you sure you want to change the base?
Conversation
I forgot to push the test package and to update the links in the docstrings 🫤 |
Hum, just noticed that I used the But this raises another question: Paralex expects frictionless=5.16, which in turn expects jsonschema<=4.17.3. |
So as you can see, I still find some inconsistencies and problems, so that I will test additional edge cases and confirm when it seems better. Unfortunately, I can't turn the PR into a draft PR, but this would be the idea. |
Returning results in the same structure format Fixed issues with different col_names
520d071
to
1e6c39e
Compare
Things are better now ! |
d66be3f
to
1e6c39e
Compare
For compatibility with filtering
c8c15d0
to
f887291
Compare
Are these frictionless issues solved with paralex ? |
Oh, I see, this Frequencies class behaves like my ugly singleton static Inventory class. Was this necessary ? Could it be a more simple instance ? My problem with Inventory was that I had many different classes which all needed to depend on a single and identical definition of the sounds inventory, and I did not want to spend my life always passing around an object so that everything relied on the same phonology. But I don't see why we couldn't have a frequency object and pass it where needed ? Eg: We could manipulate it like so:
|
I think yes, (at some point Paralex was updated to latest frictionless I believe). I will check |
Initially it was a standard class, and then I felt like 'oh, he had a really nice idea with this static Segment inventory', let's replicate that. But I think that you are right, using objects is a better solution because in some cases, we might need two Frequency instances (eg two datasets, bipartite, etc). I will change that. |
Lol, ok, I take responsibility for your aesthetic sense being skewed towards my ugly solutions. In this case I do think normal Frequency instances would be simpler :) |
I'm leaving for the lab, once I'm there I will fix all those points with the frequencies. |
I confirm. |
Current pytest does not run doctests. When running doctests, it turns out a few tests elsewhere don't pass: I'm cleaning that up in this MR and will commit. |
We could also turn to proper pytest. My docstring examples are meant to roughly test the behaviour, but this is not a proper testing strategy. |
For some things, real unit tests are important. For ensuring basic functionality, doctests are nice because they're also documentation and they're easy to adjust when we change the implementation. So I think both have a role ! |
I have broken things further, right now I can't see why it doesn't work when it does work locally. I'll leave it a bit and come back to it, perhaps with fresher eyes. |
This PR closes #31 and helps for #32 .
It is a reworked version of the frequencies class that I used in the overabundance branch. Features:
get_absolute_freq()
get_relative_freq()
Which are able to do some filtering/grouping/sum/mean/cookies/cheesecakes depending on the arguments.
In particular, the get_relative_freq uses a C implementation to do the sums. This wasn't straightforward, because skipna arguments are missing from the
groupby.sum()
C implementation in pandas.... So I had to cheat a bit.This class is maybe a bit overkill. But in theory, it can handle very different kinds of data sources, and recover a lot of information, which makes it quite multipurpose.
There are tests included and it is documented.
Uses
There is no rush to merge it by the way !