Skip to content

slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

Closed
@wholmgren

Description

@wholmgren

Creating an empty DataFrame (no data, no index) and then filling it key by key can cause performance issues in some situations. I believe the issue is due to the way that pandas computes joins on the index. This happens in two places in pvlib: Location.get_airmass and ModelChain.prepare_inputs. (determined with grep -r 'pd.DataFrame()' pvlib).

Location.get_airmass: this is most relevant with shorter input lengths, especially if solar_position is not supplied. I discovered this bottleneck when profiling a loop that called ModelChain.run_model on daily weather data.

ModelChain.prepare_inputs: this only an issue there if the user does not supply any weather data, in which case clear sky calculations will be run and the results assigned to the empty DataFrame. Less likely that anyone is running into a significant performance issue here due to the additional calculations, including a linke turbidity lookup.

Here's the key part of Location.get_airmass using an input of 1440 times, followed by two alternative implementations:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame()
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
23.9 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Alternative 1:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame(index=solar_position.index)
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
1.69 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternative 2:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame({'airmass_relative': airmass_relative, 'airmass_absolute': airmass_absolute})
airmass = airmass[['airmass_relative', 'airmass_absolute']]  # adds 0.4 ms, but guarantees same output 
1.43 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Either 1 or 2 could work for Location.get_airmass. Only 1 would easily work for ModelChain.run_model.

Versions:

  • pvlib.__version__: '0.5.2+16.g58f95e0'
  • pandas.__version__: '0.23.1'
  • python: 3.6.5

Approximately reproduced on python 3.5 and pandas 0.17.

line profiler:

   258         1          2.0      2.0      0.0          if solar_position is None:
   259                                                       solar_position = self.get_solarposition(times)
   260                                           
   261         1          3.0      3.0      0.0          if model in atmosphere.APPARENT_ZENITH_MODELS:
   262         1         44.0     44.0      0.1              zenith = solar_position['apparent_zenith']
   263                                                   elif model in atmosphere.TRUE_ZENITH_MODELS:
   264                                                       zenith = solar_position['zenith']
   265                                                   else:
   266                                                       raise ValueError('{} is not a valid airmass model'.format(model))
   267                                           
   268         1       1466.0   1466.0      3.6          airmass_relative = atmosphere.relativeairmass(zenith, model)
   269                                           
   270         1          9.0      9.0      0.0          pressure = atmosphere.alt2pres(self.altitude)
   271         1          1.0      1.0      0.0          airmass_absolute = atmosphere.absoluteairmass(airmass_relative,
   272         1        639.0    639.0      1.6                                                        pressure)
   273                                           
   274         1        638.0    638.0      1.6          airmass = pd.DataFrame()
   275         1      36061.0  36061.0     88.2          airmass['airmass_relative'] = airmass_relative
   276         1       2019.0   2019.0      4.9          airmass['airmass_absolute'] = airmass_absolute
   277                                           
   278         1          0.0      0.0      0.0          return airmass

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions