Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add at_reg_id_event #642

Merged
merged 22 commits into from
Mar 1, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 38 additions & 2 deletions climada/engine/impact.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,8 +191,6 @@ def __init__(self,
else:
self.imp_mat = sparse.csr_matrix(np.empty((0, 0)))



def calc(self, exposures, impact_funcs, hazard, save_mat=False, assign_centroids=True):
"""This function is deprecated, use ``ImpactCalc.impact`` instead.
"""
Expand Down Expand Up @@ -390,6 +388,44 @@ def impact_per_year(self, all_years=True, year_range=None):
year_set[year] = sum(self.at_event[orig_year == year])
return year_set

def impact_at_reg(self, agg_regions=None):
"""Aggregate impact on given aggregation regions. This method works
only if Impact.imp_mat was stored during the impact calculation.

Parameters
----------
agg_regions : np.array, list, pd.Series
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
It has the same lenght as the number of coordinates points in exposure
and reports what macro-regions these points belong to. If no aggregation
regions are passed, the method aggregates impact at the admin_0 level.
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
Default is None.

Returns
-------
pd.DataFrame
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
"""
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
if self.imp_mat.nnz == 0:
LOGGER.warning("No Impact.imp_mat was stored during the impact calculation,"
"thus the impact per region cannot be computed")
return None
aleeciu marked this conversation as resolved.
Show resolved Hide resolved

if agg_regions is None:
LOGGER.warning("Aggregation regions were not specified,"
"impact is aggregated at the admin_0 level")
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
agg_regions = pd.Series(u_coord.get_country_code(self.coord_exp[:,0],
self.coord_exp[:,1]))
peanutfun marked this conversation as resolved.
Show resolved Hide resolved
elif not isinstance(agg_regions, pd.Series):
agg_regions = pd.Series(agg_regions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it a pandas series instead of a numpy array?

Copy link
Collaborator Author

@aleeciu aleeciu Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for two reasons:

  • I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
  • I like the pd.Series.unique() method because it retains the order. Doing the same with numpy is more cumbersome, to my knowledge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this is a misuse of a pandas series.

I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
That is independent on using numpy arrays or pandas series. But using a pandas series when it is not needed is imho actually less clean.
I like the pd.Series.unique() method because it retains the order. Doing the same with numpy is more cumbersome, to my knowledge.
I would argue that this is suboptimal as an important point is now hidden in the subtle internal working of the pandas.series.unique() vs the np.unique() methods.

@peanutfun : what do you think?

Copy link
Collaborator Author

@aleeciu aleeciu Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • on the first point you are right: having a pd.Series is not strictly needed to generate a pd.DataFrame

  • on the second: I am transforming agg_regions into pd.Series only when the user supplies it as an np.array or list. But the user can supply a pd.Series in the first place. So I don't think it's a misuse, I just wrote the code in a way that it works with a pd.Series.

So I think your doubt is rather: should we use pandas for things that can be done with numpy? Not sure I have a clear answer to that, as this would probably apply to any code that make use of pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that pd.series are objects that are made to be used in pandas dataframes, but they are not made to be used a standalone objects. They are essentially numpy arrays, with some more. Thus, instead of making an unclear use of the pd.series, I suggest to use numpy arrays, and to make the ordering requirement clear (as currently this is a hidden feature).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see how the order is important because the order in the agg_regions parameter is determined by the "order" of the exposure points. Also, the order of columns in the final data frame does not appear relevant to me 😅

I would actually argue that the unordered result of pandas.Series.unique does not need further explanation, whereas I feel we should add a note to the docstring that the unique items will be ordered when using np.unique. So, I think the easiest solution is to just stick to the current implementation.

@chahank: Following our discussion on the "clean" indexing, I just realized that pandas.Series.unique returns a numpy array, so everything is fine from that perspective.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the order does not count. I am now using only numpy arrays.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this is a misuse of a pandas series.

Not necessarily. According to the docs a Series is a "one-dimensional ndarray with axis labels (including time series)". Sounds useful also outside of a DataFrame.
🤷

Copy link
Member

@peanutfun peanutfun Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree with @emanuel-schmid here 😃

I'm fine with the current implementation. However, please make sure to only call np.unique once, and store the result for later use. It is currently executed twice and will be costly for large arrays. I can also take care of that when merging. Are we ready to go ahead?


at_reg_event = np.hstack([
self.imp_mat[:, np.where(agg_regions == reg)[0]].sum(1)
for reg in agg_regions.unique()
])

at_reg_event = pd.DataFrame(at_reg_event, columns=agg_regions.unique())
aleeciu marked this conversation as resolved.
Show resolved Hide resolved

return at_reg_event

def calc_impact_year_set(self,all_years=True, year_range=None):
"""This function is deprecated, use Impact.impact_per_year instead."""
LOGGER.warning("The use of Impact.calc_impact_year_set is deprecated."
Expand Down
43 changes: 43 additions & 0 deletions climada/engine/test/test_impact.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,48 @@ def test_local_exceedance_imp_pass(self):
self.assertAlmostEqual(np.max(impact_rp), 2916964966.388219, places=5)
self.assertAlmostEqual(np.min(impact_rp), 444457580.131494, places=5)

class TestImpactReg(unittest.TestCase):
"""Test impact aggregation per exposure region id or admin0"""
def test_impact_at_reg(self):
"""Test calc local impacts per region"""

# Read default hazard file
hazard = Hazard.from_hdf5(HAZ_TEST_TC)

# Read an exposure
ent = Entity.from_excel(ENT_DEMO_TODAY)
ent.check()

# Calculate impact
impact = ImpactCalc(ent.exposures, ent.impact_funcs, hazard).impact(save_mat=True)
peanutfun marked this conversation as resolved.
Show resolved Hide resolved

# Aggregate impact at the admin 0 level
at_reg_event = impact.impact_at_reg()

self.assertEqual(at_reg_event.sum().sum(), impact.at_event.sum())
self.assertEqual(at_reg_event.shape[0], impact.at_event.shape[0])

# Aggregate impact at user-defined aggregation regions
region_ids = np.hstack([np.repeat(i, 10) for i in range(5)])
at_reg_event = impact.impact_at_reg(region_ids)

self.assertAlmostEqual(at_reg_event.sum().sum(), impact.at_event.sum(), places=2)
self.assertEqual(at_reg_event.shape[0], impact.at_event.shape[0])
self.assertListEqual(at_reg_event.columns.tolist(), np.unique(region_ids).tolist())

self.assertEqual(at_reg_event[0].sum(), 2071193014030.06)
self.assertEqual(at_reg_event[1].sum(), 2163005244090.206)
self.assertEqual(at_reg_event[2].sum(), 2219969197097.4907)
self.assertEqual(at_reg_event[3].sum(), 2203363516026.8047)
self.assertEqual(at_reg_event[4].sum(), 1827112892434.1558)

# Do not save Impact.imp_mat in the impact calculation and does do not aggregate
impact = ImpactCalc(ent.exposures, ent.impact_funcs, hazard).impact(save_mat=False)
at_reg_event = impact.impact_at_reg()

self.assertIsNone(at_reg_event)


class TestRiskTrans(unittest.TestCase):
"""Test risk transfer methods"""
def test_risk_trans_pass(self):
Expand Down Expand Up @@ -816,6 +858,7 @@ def test__exp_build_event(self):
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestImpactPerYear))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIO))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRPmatrix))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestImpactReg))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRiskTrans))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSelect))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestConvertExp))
Expand Down