Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add at_reg_id_event #642

Merged
merged 22 commits into from
Mar 1, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 41 additions & 2 deletions climada/engine/impact.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,8 +191,6 @@ def __init__(self,
else:
self.imp_mat = sparse.csr_matrix(np.empty((0, 0)))



def calc(self, exposures, impact_funcs, hazard, save_mat=False, assign_centroids=True):
"""This function is deprecated, use ``ImpactCalc.impact`` instead.
"""
Expand Down Expand Up @@ -390,6 +388,47 @@ def impact_per_year(self, all_years=True, year_range=None):
year_set[year] = sum(self.at_event[orig_year == year])
return year_set

def impact_at_reg(self, agg_regions=None):
"""Aggregate impact on given aggregation regions. This method works
only if Impact.imp_mat was stored during the impact calculation.

Parameters
----------
agg_regions : np.array, list, pd.Series (optional)
It has a lenght equal to the number of coordinates points in exposure
and it reports what macro-regions these points belong to. For example,
peanutfun marked this conversation as resolved.
Show resolved Hide resolved
asuming there are three centroids and agg_regions = ['A', 'A', 'B']
then impact of the first and second centroids will be assigned to
region A, whereas impact from the second centroid will be assigned
peanutfun marked this conversation as resolved.
Show resolved Hide resolved
to area B. If no aggregation regions are passed, the method aggregates
impact at the country (admin_0) level.
Default is None.

Returns
-------
pd.DataFrame
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
Contains the aggregated data per event.
Rows: Hazard events. Columns: Aggregation regions.
"""
aleeciu marked this conversation as resolved.
Show resolved Hide resolved
if self.imp_mat.nnz == 0:
raise ValueError("The aggregated impact cannot be computed as no"
"Impact.imp_mat was stored during the impact calculation")

if agg_regions is None:
agg_regions = pd.Series(u_coord.get_country_code(self.coord_exp[:,0],
self.coord_exp[:,1]))
peanutfun marked this conversation as resolved.
Show resolved Hide resolved
elif not isinstance(agg_regions, pd.Series):
agg_regions = pd.Series(agg_regions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make it a pandas series instead of a numpy array?

Copy link
Collaborator Author

@aleeciu aleeciu Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for two reasons:

  • I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
  • I like the pd.Series.unique() method because it retains the order. Doing the same with numpy is more cumbersome, to my knowledge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this is a misuse of a pandas series.

I find it cleaner as you know what impact belongs to what region simply by looking at the dataframe
That is independent on using numpy arrays or pandas series. But using a pandas series when it is not needed is imho actually less clean.
I like the pd.Series.unique() method because it retains the order. Doing the same with numpy is more cumbersome, to my knowledge.
I would argue that this is suboptimal as an important point is now hidden in the subtle internal working of the pandas.series.unique() vs the np.unique() methods.

@peanutfun : what do you think?

Copy link
Collaborator Author

@aleeciu aleeciu Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • on the first point you are right: having a pd.Series is not strictly needed to generate a pd.DataFrame

  • on the second: I am transforming agg_regions into pd.Series only when the user supplies it as an np.array or list. But the user can supply a pd.Series in the first place. So I don't think it's a misuse, I just wrote the code in a way that it works with a pd.Series.

So I think your doubt is rather: should we use pandas for things that can be done with numpy? Not sure I have a clear answer to that, as this would probably apply to any code that make use of pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that pd.series are objects that are made to be used in pandas dataframes, but they are not made to be used a standalone objects. They are essentially numpy arrays, with some more. Thus, instead of making an unclear use of the pd.series, I suggest to use numpy arrays, and to make the ordering requirement clear (as currently this is a hidden feature).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see how the order is important because the order in the agg_regions parameter is determined by the "order" of the exposure points. Also, the order of columns in the final data frame does not appear relevant to me 😅

I would actually argue that the unordered result of pandas.Series.unique does not need further explanation, whereas I feel we should add a note to the docstring that the unique items will be ordered when using np.unique. So, I think the easiest solution is to just stick to the current implementation.

@chahank: Following our discussion on the "clean" indexing, I just realized that pandas.Series.unique returns a numpy array, so everything is fine from that perspective.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the order does not count. I am now using only numpy arrays.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this is a misuse of a pandas series.

Not necessarily. According to the docs a Series is a "one-dimensional ndarray with axis labels (including time series)". Sounds useful also outside of a DataFrame.
🤷

Copy link
Member

@peanutfun peanutfun Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree with @emanuel-schmid here 😃

I'm fine with the current implementation. However, please make sure to only call np.unique once, and store the result for later use. It is currently executed twice and will be costly for large arrays. I can also take care of that when merging. Are we ready to go ahead?


at_reg_event = np.hstack([
self.imp_mat[:, np.where(agg_regions == reg)[0]].sum(1)
for reg in agg_regions.unique()
])

at_reg_event = pd.DataFrame(at_reg_event, columns=agg_regions.unique(), index=self.event_id)

return at_reg_event

def calc_impact_year_set(self,all_years=True, year_range=None):
"""This function is deprecated, use Impact.impact_per_year instead."""
LOGGER.warning("The use of Impact.calc_impact_year_set is deprecated."
Expand Down
68 changes: 68 additions & 0 deletions climada/engine/test/test_impact.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,73 @@ def test_local_exceedance_imp_pass(self):
self.assertAlmostEqual(np.max(impact_rp), 2916964966.388219, places=5)
self.assertAlmostEqual(np.min(impact_rp), 444457580.131494, places=5)

class TestImpactReg(unittest.TestCase):
"""Test impact aggregation per aggregation region or admin 0"""
def test_impact_at_reg(self):
"""Test calc local impacts per region"""

imp = dummy_impact()

# Aggregate over a single region
region_ids = ['A', 'A']
at_reg_event = imp.impact_at_reg(region_ids)

self.assertEqual(at_reg_event.sum().sum(), imp.at_event.sum())
self.assertEqual(at_reg_event.shape[0], imp.at_event.shape[0])
self.assertEqual(at_reg_event.shape[1], np.unique(region_ids).shape[0])

# Aggregate over two different regions
region_ids = ['A', 'B']

at_reg_event = imp.impact_at_reg(region_ids)

self.assertEqual(at_reg_event['A'].sum(), imp.imp_mat[:,0].sum())
self.assertEqual(at_reg_event['B'].sum(), imp.imp_mat[:,1].sum())

self.assertEqual(at_reg_event.sum().sum(), imp.at_event.sum())
self.assertEqual(at_reg_event.shape[0], imp.at_event.shape[0])
self.assertEqual(at_reg_event.shape[1], np.unique(region_ids).shape[0])

# Let's specify sample cities' coords and countries' code
CHE_num_code = 756
ITA_num_code = 380
zurich_lat, zurich_lon = 47.37, 8.55
bern_lat, bern_lon = 46.94, 7.44
rome_lat, rome_lon = 41.89, 12.51

# Test admin 0 with one country
imp.coord_exp = np.array([[zurich_lat, zurich_lon], [bern_lat, bern_lon]])

at_reg_event = imp.impact_at_reg()

self.assertEqual(len(at_reg_event.columns), 1)
self.assertEqual(at_reg_event.columns[0], CHE_num_code)

self.assertEqual(at_reg_event.shape[0], imp.at_event.shape[0])
self.assertEqual(at_reg_event[CHE_num_code].sum(),
at_reg_event.sum().sum(),
imp.at_event.sum())

# Test admin 0 with two countries
imp.coord_exp = np.array([[rome_lat, rome_lon], [bern_lat, bern_lon]])

at_reg_event = imp.impact_at_reg()

self.assertEqual(len(at_reg_event.columns), 2)
self.assertEqual(at_reg_event.columns[0], ITA_num_code)
self.assertEqual(at_reg_event.columns[1], CHE_num_code)

self.assertEqual(at_reg_event.shape[0], imp.at_event.shape[0])
self.assertEqual(at_reg_event[ITA_num_code].sum(), imp.imp_mat[:,0].sum())
self.assertEqual(at_reg_event[CHE_num_code].sum(), imp.imp_mat[:,1].sum())
self.assertEqual(at_reg_event.sum().sum(), imp.at_event.sum())

# Test error when no imp_mat is stored
imp.imp_mat = sparse.csr_matrix((0, 0))

with self.assertRaises(ValueError):
imp.impact_at_reg()

class TestRiskTrans(unittest.TestCase):
"""Test risk transfer methods"""
def test_risk_trans_pass(self):
Expand Down Expand Up @@ -816,6 +883,7 @@ def test__exp_build_event(self):
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestImpactPerYear))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIO))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRPmatrix))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestImpactReg))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRiskTrans))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSelect))
TESTS.addTests(unittest.TestLoader().loadTestsFromTestCase(TestConvertExp))
Expand Down