Skip to content

BLD: tslib dependencies not fully specified #18089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Nov 3, 2017 · 36 comments · Fixed by #18117 or #21879
Closed

BLD: tslib dependencies not fully specified #18089

jreback opened this issue Nov 3, 2017 · 36 comments · Fixed by #18117 or #21879
Labels
Build Library building on various platforms Datetime Datetime data dtype Timedelta Timedelta data type Timezones Timezone data dtype
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Nov 3, 2017

    '_libs.tslib': {'pyxfile': '_libs/tslib',
                    'pxdfiles': ['_libs/src/util'],
                    'depends': tseries_depends,
                    'sources': np_datetime_sources},

is not complete, in fact this directly depends on

(pandas) bash-3.2$ ls pandas/_libs/tslibs/*.pxd
pandas/_libs/tslibs/conversion.pxd      pandas/_libs/tslibs/nattype.pxd         pandas/_libs/tslibs/timedeltas.pxd
pandas/_libs/tslibs/frequencies.pxd     pandas/_libs/tslibs/np_datetime.pxd     pandas/_libs/tslibs/timezones.pxd
(pandas) bash-3.2$ ls pandas/_libs/tslib*.pxd
pandas/_libs/tslib.pxd

not really sure of the exact rule, but I think anything you cimport from a .pxd must be listed as a dep.
Now rebuilding from clean just works, but rebuilding from a change in one part of the structure will only sometimes work.

@jreback jreback added Build Library building on various platforms Timedelta Timedelta data type Datetime Datetime data dtype Timezones Timezone data dtype labels Nov 3, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 3, 2017
@jreback
Copy link
Contributor Author

jreback commented Nov 3, 2017

@jbrockmendel need to fix this before any other changes.

@jbrockmendel
Copy link
Member

but rebuilding from a change in one part of the structure will only sometimes work.

Do you have an example for how this fails? My understanding is that cython checks the timestamps on other cython files for determining if dependencies have changed.

@jreback
Copy link
Contributor Author

jreback commented Nov 3, 2017

try making a change on a .pxd (e.g moving something), them try to make, it will fail because it can't load the object.

@jreback
Copy link
Contributor Author

jreback commented Nov 3, 2017

nothing to do with time stamps, rather the build process doesn't know about deps. that it is depending, IOW tslib depends on tslibs/converstion.pxd, but this is not indicated anywhere (to rebuild tslib).

@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

try making a change on a .pxd (e.g moving something), them try to make, it will fail because it can't load the object

I noticed this yesterday when I tried to build pandas from source in an existing environment. Some Cython files didn't get compiled again as needed (because dependencies were not specified), causing a statement like import pandas to fail.

@jbrockmendel
Copy link
Member

nothing to do with time stamps, rather the build process doesn't know about deps

OK I'll try to grok this.

Cython clearly knows something about up-to-dateness:

$python setup.py build_ext --inplace
running build_ext
skipping 'pandas/io/sas/sas.c' Cython extension (up-to-date)
skipping 'pandas/_libs/properties.c' Cython extension (up-to-date)
skipping 'pandas/_libs/groupby.c' Cython extension (up-to-date)
[...]

@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

@jbrockmendel : That's actually us 😄 . We compare timestamps on all files and dependencies to determine whether they need updating.

@jbrockmendel
Copy link
Member

try making a change on a .pxd (e.g moving something), them try to make, it will fail because it can't load the object.

OK, I just took an existing branch, went to conversion.pxd and made a whitespace edit, then ran setup.py build_ext:

$python setup.py build_ext --inplace
[...]
skipping 'pandas/_libs/tslibs/frequencies.c' Cython extension (up-to-date)
skipping 'pandas/_libs/hashtable.c' Cython extension (up-to-date)
cythoning pandas/_libs/tslibs/conversion.pyx to pandas/_libs/tslibs/conversion.c
building 'pandas._libs.tslibs.conversion' extension
clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Ipandas/_libs/src/klib -Ipandas/_libs/src -I/usr/local/lib/python2.7/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.14/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c pandas/_libs/tslibs/conversion.c -o build/temp.macosx-10.12-x86_64-2.7/pandas/_libs/tslibs/conversion.o -Wno-unused-function
[...]
skipping 'pandas/_libs/interval.c' Cython extension (up-to-date)
[...]
$python
>>> import pandas as pd
>>>

So far so good. So I guess you're referring to something less benign than a whitespace change? If I comment out convert_to_tsobject in conversion.pxd then I'll get an error at import-time. Ideally this would raise at build-time, which I guess is what you're getting at here.

The cython folks are at least aware of this; lack of discussion suggests it isn't a high priority.

There is a cython option that would allow us to do without pxd files entirely in some cases, but I'm a bit wary of it: cython/cython#1866

Pretty sure I've seen this on the cython mailing list, will try to track that down.

@jbrockmendel
Copy link
Member

That's actually us 😄 . We compare timestamps on all files to determine whether they need updating.

Hah. Good to know.

@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

If I comment out convert_to_tsobject in conversion.pxd then I'll get an error at import-time. Ideally this would raise at build-time, which I guess is what you're getting at here.

Yes, that's exactly what happened to me yesterday. I wouldn't worry too much about anything Cython-issue-related, as that's beyond our control from a pandas perspective. The best thing would be to try to correct the dependencies on our end so that we don't get import errors.

@jbrockmendel
Copy link
Member

Yes, that's exactly what happened to me yesterday.

OK, so IIUC you're making a change that doesn't actually break conversion internally, but gives a mangled error message on import pandas as pd:

ImportError: C extension: pandas._libs.tslibs.conversion does not export expected C function convert_to_tsobject not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.

That error is re-raised in pandas/__init__.py, which seems to incorrectly assume that the build state is binary. If the failed import is taken outside of the try/except block, the exception message is:

ImportError: pandas._libs.tslibs.conversion does not export expected C function convert_to_tsobject

In this particular failure mode, this error message is pretty helpful. In fact, it is more helpful than the error message we would get if we added conversion.pxd to tslib's 'pxdfile' list in setup.py:

Error compiling Cython file:
------------------------------------------------------------
...
from numpy cimport ndarray, int64_t

from tslibs.conversion cimport convert_to_tsobject
^
------------------------------------------------------------

pandas/_libs/tslib.pxd:3:0: 'pandas/_libs/tslibs/conversion/convert_to_tsobject.pxd' not found
[...]
[26 lines of less helpful information to scroll back up past to find the real error]

So I can put together a PR that will exhaustively list in setup.py all of the pxdfiles relevant to every cython module. Then we can vigilantly keep that up to date. OR I can fix the error message in __init__. The former has the benefit of raising at build-time, but the latter is less vulnerable to future leakage. (Longer-term I'm hoping cython catches this at build-time).

@jbrockmendel
Copy link
Member

Incidentally, the str-replacement in the error message in __init__ probably needs to be updated. Right now after python setup.py clean importing will raise with the message

ImportError: C extension: No module named tslib not built.[...]

which I think is intended to be

Import Error: C extension: tslib not built.[...]

@jbrockmendel
Copy link
Member

Related:
Could we generate .pxd files from .pyx files https://groups.google.com/forum/#!topic/cython-users/jqWnMwETW_s
cythonize should recompile on dependency changes https://groups.google.com/forum/#!topic/cython-users/yNCMjqTijGU

jbrockmendel added a commit to jbrockmendel/pandas that referenced this issue Nov 3, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

So I can put together a PR that will exhaustively list in setup.py all of the pxdfiles relevant to every cython module. Then we can vigilantly keep that up to date. OR I can fix the error message in init. The former has the benefit of raising at build-time, but the latter is less vulnerable to future leakage. (Longer-term I'm hoping cython catches this at build-time).

Given where things are right now, it should be our responsibility to keep track of dependencies. The latter can be used to provide a better error message. However, to close the issue, the former should be done.

@jbrockmendel
Copy link
Member

One last-ditch argument for the "fix the import error message" solution.

Listing these all in setup.py is asking for trouble. There is plenty of evidence that the deps listed in the ext_data dict do not get updated when the actual modules are changed. Until recently the tslib entry had '_libs/lib' in its 'pxdfiles' entry, while the lib entry did not (and does not) have tslib in its 'pxdfiles' (or 'depends') entries; i.e. the exact opposite of what the dependencies actually are.

Quick: which entries get '_libs/src/foo' and which get 'pandas/_libs/src/foo'? Does the entry for lib need to explicitly list out its recursive dependencies (as it kinda-sorta does by including tseries_depends in its 'depends' entry)? Does '_libs/src/util' belong under 'depends' or 'pxdfiles' (under the status quo it is inconsistent)? How does the inclusion of common_include affect any of this?

My point being: the ext_data dictionary is Frankensteined together and we should be working to get stuff out of it rather than dump more stuff in (e.g. #18059).

For modules in src we do need to specify in setup where to look for things. For regular cython modules we can let imports fail in the exact same way as python imports fail.

@jreback
Copy link
Contributor Author

jreback commented Nov 3, 2017

@jbrockmendel put it this way. if I have to make clean every once in a while, when I didn't have to before, no big deal. But this highlites a problem. I would like to fix this before doing anything else.

I don't care about the error message, in fact that is a red herring. The build process is messed up.

@jbrockmendel
Copy link
Member

I can offer moral support, but it seems like I don't understand the problem.

"When I change something that should cause the code to break, the code breaks" doesn't sound like a problem. If the problem is about the error being at import-time instead of compile-time, that I could understand. If the problem is about the error reported at import-time being unhelpful, that I could understand.

The build process is messed up.

Per my previous comment, I more or less agree. I just don't see how throwing more stuff at the wall makes it better. So for the time being I'll stick to moral support.

@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

@jbrockmendel : The whole idea is that you need to add more dependencies to specific Cython files. You can figure out which ones are dependencies given that you know where the imports fail.

@jbrockmendel
Copy link
Member

given that you know where the imports fail.

Until something gets broken, imports don't fail. Are you suggesting a fuzzing operation to trace these out? I'm honestly not trying to be obstinate here, just feel like I'm missing some major part of the picture.

@gfyoung
Copy link
Member

gfyoung commented Nov 4, 2017

Are you suggesting a fuzzing operation to trace these out?

No, no, don't do that 😄

You did a lot of refactoring of the Cython modules in previous PR's, right? So you're going to have to retrace your steps and see where imports are coming from with your new refactorings. There are a couple of ways to achieve this AFAICT:

  1. Build pandas from source pre-PR, build again from source post-PR (but don't do a clean build), and then seeing where Python complains when you try to import pandas.

  2. Look over your changes visually and make those modifications yourself based on whether Cython files import variables from different files.

@jbrockmendel
Copy link
Member

Build pandas from source pre-PR, build again from source post-PR (but don't do a clean build), and then seeing where Python complains when you try to import pandas.

Sounds like having a not-screwed-up exception at import would help with this.

Look over your changes visually and make those modifications yourself based on whether Cython files import variables from different files.

OK, so a hypothetical contributor can go through all these files and find which modules import from which other modules... just like python does at import time. So what is this accomplishing?

Supposing you convince/browbeat this contributor into doing this,

  • Does the 'depends' entry for tslib get 'pandas/_libs/tslibs/conversion.pyx' or 'pandas/_libs/tslibs/conversion.pxd'? Both for good measure?
  • Does the pxd file go into the entry for pxdfiles?
  • Do we care about the non-tslibs entries? e.g. _libs.hashing doesn't list anything other than its own pyx file, but cimports util.

@jbrockmendel
Copy link
Member

Does it matter that khash isn't listed in ext_data['tslib']?

@gfyoung
Copy link
Member

gfyoung commented Nov 4, 2017

@jbrockmendel : I understand that this work is not necessarily appealing, but given that you were the one who initiated a lot of these refactorings, you would be most familiar with the new files that you created. AFAICT, the changes you need to make would be to add your newly created files as dependencies.

Don't worry about other files for the moment, just figure out where your newly created files should be as dependencies for the time being.

@jbrockmendel
Copy link
Member

See recent PR that does this, ignores internal consistency, makes black boxes larger, as requested.

Anticipating that part of your response, the PR ignores non-tslib(s). Good thing you didn’t address the other bullet points; wouldn’t want to accidentally give the impression getting this right matters.

@gfyoung
Copy link
Member

gfyoung commented Nov 4, 2017

wouldn’t want to accidentally give the impression getting this right matters.

Oh, getting it completely right matters. We just don't need to rush it too quickly 😄

Don't worry about the black box for now. Dependencies are a bit clunky, but let's at least make sure that we can do builds without having do a clean one every time.

@jreback
Copy link
Contributor Author

jreback commented Nov 4, 2017

@jbrockmendel

Does the 'depends' entry for tslib get 'pandas/_libs/tslibs/conversion.pyx' or 'pandas/_libs/tslibs/conversion.pxd'? Both for good measure?
Does the pxd file go into the entry for pxdfiles?
Do we care about the non-tslibs entries? e.g. _libs.hashing doesn't list anything other than its own pyx file, but cimports util.

I don't really know what should be listed; the point is the dependency tree is broken. listing lots of things is not the answer, but I suspect we do need to list .pxd files.

jreback added a commit to jreback/pandas that referenced this issue Nov 4, 2017
jreback added a commit to jreback/pandas that referenced this issue Nov 4, 2017
jreback added a commit to jreback/pandas that referenced this issue Nov 4, 2017
@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 6, 2017

For testing purposes I've implemented tslibs as effectively a standalone packages. The setup.py for this package is relatively simple, the relevant bits being

import Cython
from Cython.Build import cythonize

exts = [
    [...]
    Extension("tslibs.conversion",
              sources=["tslibs/conversion.pyx"],
              include_dirs=[np.get_include()],
              define_macros=macros),
    [...]
]

setup(name='tslibs',
           [...],
           cmdclass={'build_ext': Cython.Distutils.build_ext},
           ext_modules=cythonize(exts, compiler_directives=directives))

The reason I bring this up is because when I run setup.py build_ext --inplace I see a bunch of messages:

Compiling tslibs/npy_dtime.pyx because it depends on tslibs/npy_dtime.pxd.
Compiling tslibs/period.pyx because it depends on tslibs/npy_dtime.pxd.
Compiling tslibs/api.pyx because it changed.
Compiling tslibs/timestamps.pyx because it depends on tslibs/npy_dtime.pxd.

i.e. it looks very much like cython is doing some kind of caching and dependency recognition internally.

@jbrockmendel
Copy link
Member

I asked about this on the cython mailing list. Two key points from the answer:

  • Conditional on using Cython's build_ext, this is apparently the appropriate way to specify dependencies.

  • Doing so is outdated. Using cythonize will handle most of this on its own.

https://groups.google.com/forum/#!topic/cython-users/YdGpGqi6Qw4

@jdemeyer
Copy link

Indeed. The whole point of cythonize() is that it does dependency checking for you.

@jreback
Copy link
Contributor Author

jreback commented Nov 13, 2017

and we don’t actually use cythonize; this setup.py was built with early versions of cython

would take a PR that uses cythonize to correctly infer our deps

@jreback jreback reopened this Nov 13, 2017
@jbrockmendel
Copy link
Member

OK, we're very nearly ready to use cythonize. The remaining warnings/errors that I get when doing this locally are

Warning: Extension name 'pandas.io.sas._sas' does not match fully qualified name 'pandas.io.sas.sas' of 'pandas/io/sas/sas.pyx'
[...]
$ python -m pytest pandas
[...]
_____________________________________ TestSAS7BDAT.test_from_file ______________________________________

self = <pandas.tests.io.sas.test_sas7bdat.TestSAS7BDAT object at 0x115816a50>

    def test_from_file(self):
        for j in 0, 1:
            df0 = self.data[j]
            for k in self.test_ix[j]:
                fname = os.path.join(
                    self.dirpath, "test{k}.sas7bdat".format(k=k))
>               df = pd.read_sas(fname, encoding='utf-8')

pandas/tests/io/sas/test_sas7bdat.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/io/sas/sasreader.py:58: in read_sas
    from pandas.io.sas.sas7bdat import SAS7BDATReader
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    """
    
    import pandas as pd
    from pandas import compat
    from pandas.io.common import get_filepath_or_buffer, BaseIterator
    from pandas.errors import EmptyDataError
    import numpy as np
    import struct
    import pandas.io.sas.sas_constants as const
>   from pandas.io.sas._sas import Parser
E   ImportError: dynamic module does not define init function (init_sas)

pandas/io/sas/sas7bdat.py:24: ImportError

@jdemeyer
Copy link

jdemeyer commented Dec 8, 2017

Renaming sas.pyx to _sas.pyx should fix that...

@jbrockmendel
Copy link
Member

I'll give it a shot over the weekend and see how it goes. There's a decent chance we might be able to get rid of all of my PRs in the next day or two.

@jorisvandenbossche
Copy link
Member

@jbrockmendel anything left to do for this issue? (didn't read the full thing :)) Is it a blocker for a 0.23 release?

@jbrockmendel
Copy link
Member

Nothing blocking here. It's basically on hold until the next bump in the cythonize version requirements (I forget the exact cutoff, just remember we're one below it) at which point some nice cleanup can be done.

@jorisvandenbossche
Copy link
Member

OK, removing the 0.23 milestone then

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.0, Next Major Release Mar 29, 2018
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Datetime Datetime data dtype Timedelta Timedelta data type Timezones Timezone data dtype
Projects
None yet
5 participants