slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

wholmgren · 2018-07-08T02:11:35Z

Creating an empty DataFrame (no data, no index) and then filling it key by key can cause performance issues in some situations. I believe the issue is due to the way that pandas computes joins on the index. This happens in two places in pvlib: Location.get_airmass and ModelChain.prepare_inputs. (determined with grep -r 'pd.DataFrame()' pvlib).

Location.get_airmass: this is most relevant with shorter input lengths, especially if solar_position is not supplied. I discovered this bottleneck when profiling a loop that called ModelChain.run_model on daily weather data.

ModelChain.prepare_inputs: this only an issue there if the user does not supply any weather data, in which case clear sky calculations will be run and the results assigned to the empty DataFrame. Less likely that anyone is running into a significant performance issue here due to the additional calculations, including a linke turbidity lookup.

Here's the key part of Location.get_airmass using an input of 1440 times, followed by two alternative implementations:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame()
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
23.9 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Alternative 1:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame(index=solar_position.index)
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
1.69 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternative 2:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame({'airmass_relative': airmass_relative, 'airmass_absolute': airmass_absolute})
airmass = airmass[['airmass_relative', 'airmass_absolute']]  # adds 0.4 ms, but guarantees same output 
1.43 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Either 1 or 2 could work for Location.get_airmass. Only 1 would easily work for ModelChain.run_model.

Versions:

pvlib.__version__: '0.5.2+16.g58f95e0'
pandas.__version__: '0.23.1'
python: 3.6.5

Approximately reproduced on python 3.5 and pandas 0.17.

line profiler:

   258         1          2.0      2.0      0.0          if solar_position is None:
   259                                                       solar_position = self.get_solarposition(times)
   260                                           
   261         1          3.0      3.0      0.0          if model in atmosphere.APPARENT_ZENITH_MODELS:
   262         1         44.0     44.0      0.1              zenith = solar_position['apparent_zenith']
   263                                                   elif model in atmosphere.TRUE_ZENITH_MODELS:
   264                                                       zenith = solar_position['zenith']
   265                                                   else:
   266                                                       raise ValueError('{} is not a valid airmass model'.format(model))
   267                                           
   268         1       1466.0   1466.0      3.6          airmass_relative = atmosphere.relativeairmass(zenith, model)
   269                                           
   270         1          9.0      9.0      0.0          pressure = atmosphere.alt2pres(self.altitude)
   271         1          1.0      1.0      0.0          airmass_absolute = atmosphere.absoluteairmass(airmass_relative,
   272         1        639.0    639.0      1.6                                                        pressure)
   273                                           
   274         1        638.0    638.0      1.6          airmass = pd.DataFrame()
   275         1      36061.0  36061.0     88.2          airmass['airmass_relative'] = airmass_relative
   276         1       2019.0   2019.0      4.9          airmass['airmass_absolute'] = airmass_absolute
   277                                           
   278         1          0.0      0.0      0.0          return airmass

The text was updated successfully, but these errors were encountered:

AndyMender · 2018-07-09T10:50:41Z

Solution #1 uses an explicit index, while solution #2 doesn't. Why? Also, solution #2 selects columns as a reflection in python airmass = airmass[['airmass_relative', 'airmass_absolute']]. Is that mandatory?

I would recommend solution #2. I always use dicts as an intermediate step in DataFrame creation and they work reliably.

wholmgren · 2018-07-09T14:55:06Z

Solution 1 uses an explicit index, while solution 2 doesn't. Why?

The explicit index in solution 1 speeds up the index join for the existing DataFrame due to something about pandas internal workings. No explicit index is necessary for solution 2 because pandas already needs to compute the join between the dict values.

Also, solution 2 selects columns as a reflection in python airmass = airmass[['airmass_relative', 'airmass_absolute']]. Is that mandatory?

The assignment pattern in the existing code guarantees order. This line in solution 2 is necessary to guarantee that the column order is the same on all python/pandas version combinations. I hesitate to say it's mandatory, but we try to keep order consistent.

cwhanse · 2018-07-09T18:54:18Z

I prefer #2 for style reasons.

adriesse · 2018-07-09T21:54:43Z

I prefer #1 for style reasons.

wholmgren added easy help wanted good first issue performance labels Jul 8, 2018

wholmgren added this to the 0.6.0 milestone Jul 8, 2018

wholmgren changed the title ~~slow performance when creating empty DataFrame in Location.get_airmass~~ slow performance when creating empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs Jul 8, 2018

wholmgren changed the title ~~slow performance when creating empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs~~ slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs Jul 8, 2018

wholmgren mentioned this issue Jul 27, 2018

Improve performance of solarposition.ephemeris #512

Closed

wholmgren mentioned this issue Aug 13, 2018

improve speed of Location.get_airmass #528

Merged

8 tasks

wholmgren closed this as completed in #528 Aug 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

wholmgren commented Jul 8, 2018

AndyMender commented Jul 9, 2018

wholmgren commented Jul 9, 2018

cwhanse commented Jul 9, 2018

adriesse commented Jul 9, 2018

slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

Comments

wholmgren commented Jul 8, 2018

AndyMender commented Jul 9, 2018

wholmgren commented Jul 9, 2018

cwhanse commented Jul 9, 2018

adriesse commented Jul 9, 2018