-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add core population #1279
Conversation
0da8886
to
0e10d12
Compare
1d53eeb
to
20887aa
Compare
entity: t.CoreEntity | ||
|
||
#: ??? | ||
ids: Sequence[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjello What are the ids here? household
, or Juan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be the ids of the different households o the different people for the atomic/person entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a string, like household_1, person_2, and so on. Attributes that only exist in the context of a simulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say so. Bit O am not the person who refactored that part. @Morendil is your manI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, in a way, although a bit confusing, that there is no data class to represent an invidivual member or entry of a population, as these are finally just the value at index n on the corresponding variable vectors.
So, say we have tax_1
and tax_2
, on a SinglePopulation
with count = 3
of entity Person
, behind the scenes, we should have two Holder
for a given period:
holder_tax_1 = numpy.array([100, 200, 300])
holder_tax_2 = numpy.array([300, 200, 100])
ids = ["Mauko", "Mahdi", "Sandra"]
How much of tax_1
have we got for "Mauko"?
tax_1_mauko = holder_tax_1[[ids.index("Mauko")]] # [100]
Now we group "Mauko" and "Mahdi" in a GroupEntity
, Foundation
.
How much of tax_1
have we got for "OpenFisca Foundation"? Fancy-indexing:
tax_1_foundation = sum(holder_tax_1[[ids.index("Mauko"), ids.index("Mahdi")]]) # [300]
So, this attribute id
of each entity entry in a simulation, is sort of ephemeral
. There is no class modeling it, and they exist internally only as reference records to get values out of holders.
Is that how it works, @Morendil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjello Why can't we model a population as a structured array or a dataframe? I mean, it looks like it is more or less the way it works now. As long as you have such indices, then you can link columns the way you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that pandas
and polars
are great for this kind of use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A challenge there are the periods. It would be like a huge 3D sparse matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dataframe per period would be ok.
Or adding the period to the index: every value would be indentified by an index (period, id) for a column variable in the corresponding entity dataframe.
I am concerned about performance, but we may experiment with different backend.
Historically, we stuck to numpy
because it was the more stable package at the time. Pandas
was in its infancy. Polars
didn't exist yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example:
import numpy as np
dtype = [
("people", "U10"),
("variables", [("tax", np.int32), ("pension", bool), ("benefit", np.uint8)]),
]
array = np.array(
[("juan", (100, True, 1)), ("jean", (200, False, 2)), ("john", (300, True, 3))],
dtype=dtype,
)
print(array["people"])
# ['juan' 'jean' 'john']
print(array["variables"])
# [(100, True, 1) (200, False, 2) (300, True, 3)]
print(array["variables"]["tax"])
# [100 200 300]
print(array[array['people'] == 'juan']['variables']['tax'])
# [100]
# operation only in sub-vector
array["variables"]["tax"] *= 2
# it didn't copy data
assert array["variables"]["tax"].base is not None
20887aa
to
6b12b80
Compare
Depends on #1274
Depended upon by #1238
New features
populations.CorePopulation