Description
Creating an empty DataFrame (no data, no index) and then filling it key by key can cause performance issues in some situations. I believe the issue is due to the way that pandas computes joins on the index. This happens in two places in pvlib: Location.get_airmass
and ModelChain.prepare_inputs
. (determined with grep -r 'pd.DataFrame()' pvlib
).
Location.get_airmass
: this is most relevant with shorter input lengths, especially if solar_position
is not supplied. I discovered this bottleneck when profiling a loop that called ModelChain.run_model
on daily weather data.
ModelChain.prepare_inputs
: this only an issue there if the user does not supply any weather data, in which case clear sky calculations will be run and the results assigned to the empty DataFrame. Less likely that anyone is running into a significant performance issue here due to the additional calculations, including a linke turbidity lookup.
Here's the key part of Location.get_airmass
using an input of 1440 times, followed by two alternative implementations:
%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)
airmass = pd.DataFrame()
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
23.9 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Alternative 1:
%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)
airmass = pd.DataFrame(index=solar_position.index)
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
1.69 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Alternative 2:
%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)
airmass = pd.DataFrame({'airmass_relative': airmass_relative, 'airmass_absolute': airmass_absolute})
airmass = airmass[['airmass_relative', 'airmass_absolute']] # adds 0.4 ms, but guarantees same output
1.43 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Either 1 or 2 could work for Location.get_airmass
. Only 1 would easily work for ModelChain.run_model
.
Versions:
pvlib.__version__
: '0.5.2+16.g58f95e0'pandas.__version__
: '0.23.1'- python: 3.6.5
Approximately reproduced on python 3.5 and pandas 0.17.
line profiler:
258 1 2.0 2.0 0.0 if solar_position is None:
259 solar_position = self.get_solarposition(times)
260
261 1 3.0 3.0 0.0 if model in atmosphere.APPARENT_ZENITH_MODELS:
262 1 44.0 44.0 0.1 zenith = solar_position['apparent_zenith']
263 elif model in atmosphere.TRUE_ZENITH_MODELS:
264 zenith = solar_position['zenith']
265 else:
266 raise ValueError('{} is not a valid airmass model'.format(model))
267
268 1 1466.0 1466.0 3.6 airmass_relative = atmosphere.relativeairmass(zenith, model)
269
270 1 9.0 9.0 0.0 pressure = atmosphere.alt2pres(self.altitude)
271 1 1.0 1.0 0.0 airmass_absolute = atmosphere.absoluteairmass(airmass_relative,
272 1 639.0 639.0 1.6 pressure)
273
274 1 638.0 638.0 1.6 airmass = pd.DataFrame()
275 1 36061.0 36061.0 88.2 airmass['airmass_relative'] = airmass_relative
276 1 2019.0 2019.0 4.9 airmass['airmass_absolute'] = airmass_absolute
277
278 1 0.0 0.0 0.0 return airmass