Feature frequencies #37

JPapir · 2024-09-22T20:02:52Z

This PR closes #31 and helps for #32 .

It is a reworked version of the frequencies class that I used in the overabundance branch. Features:

One object to create
Reads the tables at the first opening and try to catch all frequency information that are available.
In theory (but it is hard to test everything), it is able to handle multiple sources of frequencies (which can happen in frequency tables)
Tables are accessible as attributes : f.forms, f.cells, f.lexemes
But it is recommended to access them using the two following methods:
- get_absolute_freq()
- get_relative_freq()
  Which are able to do some filtering/grouping/sum/mean/cookies/cheesecakes depending on the arguments.

In particular, the get_relative_freq uses a C implementation to do the sums. This wasn't straightforward, because skipna arguments are missing from the groupby.sum() C implementation in pandas.... So I had to cheat a bit.

This class is maybe a bit overkill. But in theory, it can handle very different kinds of data sources, and recover a lot of information, which makes it quite multipurpose.

There are tests included and it is documented.

Uses

Weighting of cells ([Entropy] Add weighting by the token frequencies of the cell #32)
Weighting of overabundant cellmates

There is no rush to merge it by the way !

JPapir · 2024-09-22T20:06:13Z

I forgot to push the test package and to update the links in the docstrings 🫤
It will be for tomorrow.

JPapir · 2024-09-23T10:40:11Z

Hum, just noticed that I used the paralex.read_table() and forgot to add Paralex to the setup.cfg

But this raises another question: Paralex expects frictionless=5.16, which in turn expects jsonschema<=4.17.3.
This constraint is a bit annoying, because it creates conflicts with libraries that expect higher versions of jsonschema (for instance, jupyter). Do you think we could bump Paralex dependency to frictionless 5.17 ? It doesn't seem to introduce any critical change, but it has better support for more recent jsonschema versions.

JPapir · 2024-09-23T11:13:11Z

So as you can see, I still find some inconsistencies and problems, so that I will test additional edge cases and confirm when it seems better. Unfortunately, I can't turn the PR into a draft PR, but this would be the idea.

Returning results in the same structure format Fixed issues with different col_names

JPapir · 2024-09-23T12:53:55Z

Things are better now !

For compatibility with filtering

XachaB · 2024-12-17T21:28:17Z

Hum, just noticed that I used the paralex.read_table() and forgot to add Paralex to the setup.cfg

But this raises another question: Paralex expects frictionless=5.16, which in turn expects jsonschema<=4.17.3. This constraint is a bit annoying, because it creates conflicts with libraries that expect higher versions of jsonschema (for instance, jupyter). Do you think we could bump Paralex dependency to frictionless 5.17 ? It doesn't seem to introduce any critical change, but it has better support for more recent jsonschema versions.

Are these frictionless issues solved with paralex ?

XachaB · 2024-12-17T21:35:26Z

Oh, I see, this Frequencies class behaves like my ugly singleton static Inventory class. Was this necessary ? Could it be a more simple instance ?

My problem with Inventory was that I had many different classes which all needed to depend on a single and identical definition of the sounds inventory, and I did not want to spend my life always passing around an object so that everything relied on the same phonology. But I don't see why we couldn't have a frequency object and pass it where needed ?

Eg: We could manipulate it like so:

p = fl.Package('tests/data/TestPackage/test.package.json')
freqs = Frequencies(p)
print(freqs.info().to_markdown())

src/qumin/representations/frequencies.py

JPapir · 2024-12-18T08:31:38Z

Are these frictionless issues solved with paralex ?

I think yes, (at some point Paralex was updated to latest frictionless I believe). I will check

JPapir · 2024-12-18T08:33:24Z

Oh, I see, this Frequencies class behaves like my ugly singleton static Inventory class. Was this necessary ? Could it be a more simple instance ?

Initially it was a standard class, and then I felt like 'oh, he had a really nice idea with this static Segment inventory', let's replicate that. But I think that you are right, using objects is a better solution because in some cases, we might need two Frequency instances (eg two datasets, bipartite, etc). I will change that.

XachaB · 2024-12-18T08:35:37Z

Lol, ok, I take responsibility for your aesthetic sense being skewed towards my ugly solutions. In this case I do think normal Frequency instances would be simpler :)

JPapir · 2024-12-18T08:38:36Z

I'm leaving for the lab, once I'm there I will fix all those points with the frequencies.

JPapir · 2024-12-18T10:42:36Z

Are these frictionless issues solved with paralex ?

I think yes, (at some point Paralex was updated to latest frictionless I believe). I will check

I confirm.

XachaB · 2024-12-18T11:18:14Z

Current pytest does not run doctests. When running doctests, it turns out a few tests elsewhere don't pass: I'm cleaning that up in this MR and will commit.

JPapir · 2024-12-18T11:20:03Z

Current pytest does not run doctests. When running doctests, it turns out a few tests elsewhere don't pass: I'm cleaning that up in this MR and will commit.

We could also turn to proper pytest. My docstring examples are meant to roughly test the behaviour, but this is not a proper testing strategy.

…ature-frequencies

XachaB · 2024-12-18T12:14:13Z

For some things, real unit tests are important. For ensuring basic functionality, doctests are nice because they're also documentation and they're easy to adjust when we change the implementation. So I think both have a role !

XachaB · 2024-12-18T12:27:03Z

I have broken things further, right now I can't see why it doesn't work when it does work locally. I'll leave it a bit and come back to it, perhaps with fresher eyes.

add frequency management

4821e8e

JPapir changed the base branch from master to dev September 22, 2024 20:03

JPapir added this to the introduce overabundance & frequencies milestone Sep 22, 2024

JPapir added the enhancement New feature or request label Sep 22, 2024

JPapir added 2 commits September 22, 2024 23:54

add a small test package

0118161

make the Frequencies class static

a47f27f

add paralex to dependencies

9cd9d35

JPapir added 2 commits September 23, 2024 14:14

better handling of missing frequency columns

c79905d

refactor get_relative/absolute_freq for consistency

1e6c39e

Returning results in the same structure format Fixed issues with different col_names

JPapir force-pushed the feature-frequencies branch from 520d071 to 1e6c39e Compare September 23, 2024 12:46

JPapir force-pushed the feature-frequencies branch from d66be3f to 1e6c39e Compare September 23, 2024 18:59

JPapir added 3 commits September 23, 2024 22:00

normalized index names, new info function

c081b4f

For compatibility with filtering

constraint on Paralex version

b094c1e

simpler attribute name

f887291

JPapir force-pushed the feature-frequencies branch from c8c15d0 to f887291 Compare November 11, 2024 10:06

This was referenced Nov 11, 2024

Add weighting of the results by the cell frequencies #42

Draft

Long format patterns #44

Draft