Skip to content

slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs #502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wholmgren opened this issue Jul 8, 2018 · 4 comments

Comments

@wholmgren
Copy link
Member

Creating an empty DataFrame (no data, no index) and then filling it key by key can cause performance issues in some situations. I believe the issue is due to the way that pandas computes joins on the index. This happens in two places in pvlib: Location.get_airmass and ModelChain.prepare_inputs. (determined with grep -r 'pd.DataFrame()' pvlib).

Location.get_airmass: this is most relevant with shorter input lengths, especially if solar_position is not supplied. I discovered this bottleneck when profiling a loop that called ModelChain.run_model on daily weather data.

ModelChain.prepare_inputs: this only an issue there if the user does not supply any weather data, in which case clear sky calculations will be run and the results assigned to the empty DataFrame. Less likely that anyone is running into a significant performance issue here due to the additional calculations, including a linke turbidity lookup.

Here's the key part of Location.get_airmass using an input of 1440 times, followed by two alternative implementations:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame()
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
23.9 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Alternative 1:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame(index=solar_position.index)
airmass['airmass_relative'] = airmass_relative
airmass['airmass_absolute'] = airmass_absolute
1.69 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternative 2:

%%timeit
airmass_relative = pvlib.atmosphere.relativeairmass(solar_position['zenith'])
pressure = pvlib.atmosphere.alt2pres(altitude)
airmass_absolute = pvlib.atmosphere.absoluteairmass(airmass_relative, pressure)

airmass = pd.DataFrame({'airmass_relative': airmass_relative, 'airmass_absolute': airmass_absolute})
airmass = airmass[['airmass_relative', 'airmass_absolute']]  # adds 0.4 ms, but guarantees same output 
1.43 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Either 1 or 2 could work for Location.get_airmass. Only 1 would easily work for ModelChain.run_model.

Versions:

  • pvlib.__version__: '0.5.2+16.g58f95e0'
  • pandas.__version__: '0.23.1'
  • python: 3.6.5

Approximately reproduced on python 3.5 and pandas 0.17.

line profiler:

   258         1          2.0      2.0      0.0          if solar_position is None:
   259                                                       solar_position = self.get_solarposition(times)
   260                                           
   261         1          3.0      3.0      0.0          if model in atmosphere.APPARENT_ZENITH_MODELS:
   262         1         44.0     44.0      0.1              zenith = solar_position['apparent_zenith']
   263                                                   elif model in atmosphere.TRUE_ZENITH_MODELS:
   264                                                       zenith = solar_position['zenith']
   265                                                   else:
   266                                                       raise ValueError('{} is not a valid airmass model'.format(model))
   267                                           
   268         1       1466.0   1466.0      3.6          airmass_relative = atmosphere.relativeairmass(zenith, model)
   269                                           
   270         1          9.0      9.0      0.0          pressure = atmosphere.alt2pres(self.altitude)
   271         1          1.0      1.0      0.0          airmass_absolute = atmosphere.absoluteairmass(airmass_relative,
   272         1        639.0    639.0      1.6                                                        pressure)
   273                                           
   274         1        638.0    638.0      1.6          airmass = pd.DataFrame()
   275         1      36061.0  36061.0     88.2          airmass['airmass_relative'] = airmass_relative
   276         1       2019.0   2019.0      4.9          airmass['airmass_absolute'] = airmass_absolute
   277                                           
   278         1          0.0      0.0      0.0          return airmass
@wholmgren wholmgren added this to the 0.6.0 milestone Jul 8, 2018
@wholmgren wholmgren changed the title slow performance when creating empty DataFrame in Location.get_airmass slow performance when creating empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs Jul 8, 2018
@wholmgren wholmgren changed the title slow performance when creating empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs slow performance when assigning to empty DataFrame in Location.get_airmass and ModelChain.prepare_inputs Jul 8, 2018
@AndyMender
Copy link

Solution #1 uses an explicit index, while solution #2 doesn't. Why? Also, solution #2 selects columns as a reflection in python airmass = airmass[['airmass_relative', 'airmass_absolute']]. Is that mandatory?

I would recommend solution #2. I always use dicts as an intermediate step in DataFrame creation and they work reliably.

@wholmgren
Copy link
Member Author

Solution 1 uses an explicit index, while solution 2 doesn't. Why?

The explicit index in solution 1 speeds up the index join for the existing DataFrame due to something about pandas internal workings. No explicit index is necessary for solution 2 because pandas already needs to compute the join between the dict values.

Also, solution 2 selects columns as a reflection in python airmass = airmass[['airmass_relative', 'airmass_absolute']]. Is that mandatory?

The assignment pattern in the existing code guarantees order. This line in solution 2 is necessary to guarantee that the column order is the same on all python/pandas version combinations. I hesitate to say it's mandatory, but we try to keep order consistent.

@cwhanse
Copy link
Member

cwhanse commented Jul 9, 2018

I prefer #2 for style reasons.

@adriesse
Copy link
Member

adriesse commented Jul 9, 2018

I prefer #1 for style reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants