Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add at_reg_id_event #642
add at_reg_id_event #642
Changes from 14 commits
7b01fed
f0096d7
a93474f
09254d1
4236025
af28fef
cff3318
f96ae47
bdcf0d4
c99ce83
8cc4769
17d6d5e
f4ea128
afb5f3b
f262bb1
5fcf72b
6cce7ce
cf9c79c
c84deb0
1188c45
4a9b991
57053be
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make it a pandas series instead of a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for two reasons:
pd.Series.unique()
method because it retains the order. Doing the same withnumpy
is more cumbersome, to my knowledge.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that this is a misuse of a pandas series.
I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
That is independent on using numpy arrays or pandas series. But using a pandas series when it is not needed is imho actually less clean.
I like the pd.Series.unique() method because it retains the order. Doing the same with numpy is more cumbersome, to my knowledge.
I would argue that this is suboptimal as an important point is now hidden in the subtle internal working of the
pandas.series.unique()
vs thenp.unique()
methods.@peanutfun : what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the first point you are right: having a
pd.Series
is not strictly needed to generate apd.DataFrame
on the second: I am transforming
agg_regions
intopd.Series
only when the user supplies it as annp.array
orlist
. But the user can supply apd.Series
in the first place. So I don't think it's a misuse, I just wrote the code in a way that it works with apd.Series
.So I think your doubt is rather: should we use
pandas
for things that can be done withnumpy
? Not sure I have a clear answer to that, as this would probably apply to any code that make use ofpandas
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that
pd.series
are objects that are made to be used in pandas dataframes, but they are not made to be used a standalone objects. They are essentially numpy arrays, with some more. Thus, instead of making an unclear use of thepd.series
, I suggest to use numpy arrays, and to make the ordering requirement clear (as currently this is a hidden feature).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see how the order is important because the order in the
agg_regions
parameter is determined by the "order" of the exposure points. Also, the order of columns in the final data frame does not appear relevant to me 😅I would actually argue that the unordered result of
pandas.Series.unique
does not need further explanation, whereas I feel we should add a note to the docstring that the unique items will be ordered when usingnp.unique
. So, I think the easiest solution is to just stick to the current implementation.@chahank: Following our discussion on the "clean" indexing, I just realized that
pandas.Series.unique
returns a numpy array, so everything is fine from that perspective.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the order does not count. I am now using only
numpy
arrays.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. According to the docs a
Series
is a "one-dimensionalndarray
with axis labels (including time series)". Sounds useful also outside of aDataFrame
.🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree with @emanuel-schmid here 😃
I'm fine with the current implementation. However, please make sure to only call
np.unique
once, and store the result for later use. It is currently executed twice and will be costly for large arrays. I can also take care of that when merging. Are we ready to go ahead?