Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change load_dataset cache dir default #1530

Closed
mxsrc opened this issue Aug 3, 2018 · 13 comments · Fixed by #2773
Closed

Change load_dataset cache dir default #1530

mxsrc opened this issue Aug 3, 2018 · 13 comments · Fixed by #2773

Comments

@mxsrc
Copy link

mxsrc commented Aug 3, 2018

The current default for seaborn.load_datasets data_home directory, which is used for caching is $HOME/seaborn-data. This leads to unwanted pollution of the user's home directory. The calls used in many seaborn examples use only the dataset's name, just from this, it is unclear that by default a cache directory will be created in such a prominent location.

I believe a good alternative location be $XDG_CACHE_DIR, which according to the freedesktop specification is supposed to be used for user-specific non-essential (cached) data (see specification). If the environment variable is unset, a default of $HOME/.cache is recommended.

@mxsrc mxsrc changed the title Change load_dataset cache dir default Change load_dataset cache dir default Aug 3, 2018
@mwaskom
Copy link
Owner

mwaskom commented Aug 3, 2018

I think @jakevdp just copied what scikit-learn does. I basically agree that it's not the best option. But I don't really have a sense for how widespread or intuitive the "freedesktop specification" would be (I've never heard of it).

@mxsrc
Copy link
Author

mxsrc commented Aug 3, 2018

On Linux I believe it is widely used, at least the major DEs follow it. On my machine the $HOME/.cache directory contains about fifty directories of individual applications without me explicitly configuring any of them to use it.

On other platforms there are probably similar directories, a quick search for windows brought me to a ticket of the electron project where a similar problem is discussed. It indicates that on Windows LOCALAPPDATA is a preferred location.

@mwaskom
Copy link
Owner

mwaskom commented Aug 3, 2018

The example datasets are not very large so I would like to avoid too much complexity. I'm open to changing where the data goes, but I would like to pick a single alternative.

@mxsrc
Copy link
Author

mxsrc commented Aug 3, 2018

I don't think that there is a one-size-fits-all solution, since seaborn is used in Windows, Mac OSX and Linux, where all of these have different concepts. The only one I can think of is to use a hidden folder.

Of course, the datasets are not large, but imho that doesn't really affect the problem, since it's more about the intransparent creation of a directory.

Also, I believe that the added complexity isn't really too much. I had a look at how Qt handles this, since they specifically aim to ease multiplatform development. They provide a utility which returns a proper path for all platforms. It could be implemented like this:

def _get_cache_path():
    """Return a path to a directory to be used for caching application data.
    """
    import sys
    import os

    path = None
    if sys.platform.startswith('linux') or sys.platform.startswith('freebsd'):
        path = os.getenv('XDG_CACHE_HOME', '~/.cache')
    elif sys.platform.startswith('win32') or sys.platform.startswith('cygwin'):
        path = os.getenv('LOCALAPPDATA', '~')
    elif sys.platform.startswith('darwin'):
        path = '~/Library/Caches'

    if not os.path.isdir(path):
        path = '~'

    return os.path.expanduser(os.path.join(path, 'seaborn-data'))

@mwaskom
Copy link
Owner

mwaskom commented Aug 3, 2018

I don't think that there is a one-size-fits-all solution, since seaborn is used in Windows, Mac OSX and Linux, where all of these have different concepts. The only one I can think of is to use a hidden folder.

If there isn't, it would weigh in favor of keeping the status quo.

Of course, the datasets are not large, but imho that doesn't really affect the problem, since it's more about the intransparent creation of a directory.

The point is, if the datasets are not large, it's less important that a user who is looking to clear space on their computer would need to find them and remove. So it's ok if they end up in a somewhat obscure place.

@mxsrc
Copy link
Author

mxsrc commented Aug 3, 2018

The point is, if the datasets are not large, it's less important that a user who is looking to clear space on their computer would need to find them and remove. So it's ok if they end up in a somewhat obscure place.

For me, the argument is less about space that needs to be cleared but rather that it is very easy to inadvertently create these directories at a very visible location. This is how I noticed the behavior in the first place and why I reported it. I believe that the fact that for the major platforms there are official guidelines on where to put such data (MSDN, [Apple Developer](https://developer.apple.com/icloud/documentation/data-storage/index.html, Linux).

I get that for such a minor feature one does not want to add too much code which might fail in the future, however I feel that when targeting a platform one should stick to the recommended best practices.

If there isn't, it would weigh in favor of keeping the status quo.

If that isn't an option however, might using hidden files be an alternative, or disabling caching by default?

EDIT: Added alternative.

@mwaskom
Copy link
Owner

mwaskom commented Aug 3, 2018

For me, the argument is less about space that needs to be cleared but rather that it is very easy to inadvertently create these directories at a very visible location. This is how I noticed the behavior in the first place and why I reported it.

Yes, we're talking past each other. I agree that defaulting to $HOME is suboptimal. I'm giving an argument for using a consistent, possibly obscure (even hidden) location across file systems rather than trying to pick the "correct" place for every file system. I think $HOME/.cache/seaborn is reasonable, although I don't know the details of dealing with hidden files on Windows (or other non-unix systems if they are being used).

@mxsrc
Copy link
Author

mxsrc commented Aug 7, 2018

Afaik using hidden files with Windows involves some magic with setting file attributes.

Which other non-unix systems are there? I currently can't really think of any.

$HOME/.cache/seaborn would be the intended location at least for Linux (so that would scratch my itch :-)), it might be a bit unorthodox on Windows, but for OSX I assume that other applications also use that directory.

@QuLogic
Copy link
Contributor

QuLogic commented Mar 16, 2019

If you're okay with another dependency, there are things like appdirs which take care of following all the cross-platform standards for you.

@flying-sheep
Copy link

flying-sheep commented Apr 2, 2019

The cache locations appdirs provides have the advantage that all OSs know about them and will pop up a little “do you want to clear out cache locations” if your disk space gets low, in addition to not backing them up and therefore saving you backup disk space.

So I’d argue those are the only correct locations for cache data on the respective OSs.

@mwaskom
Copy link
Owner

mwaskom commented Oct 2, 2021

It looks like appdirs (thanks @QuLogic) is a single Python module and is MIT licensed, so I would be open to vendoring it and installing the datasets in its preferred cache location.

@flying-sheep
Copy link

You should use platformdirs:

This repository is a friendly fork of the wonderful work started by ActiveState who created appdirs, this package's ancestor.

Maintaining an open source project is no easy task, particularly from within an organization, and the Python community is indebted to appdirs (and to Trent Mick and Jeff Rouse in particular) for creating an incredibly useful simple module, as evidenced by the wide number of users it has attracted over the years.

Nonetheless, given the number of long-standing open issues and pull requests, and no clear path towards ensuring that maintenance of the package would continue or grow, this fork was created.

Repository owner deleted a comment from flying-sheep Oct 5, 2021
@mwaskom
Copy link
Owner

mwaskom commented Oct 5, 2021

It seems like appdirs solves the very specific problem here so it’s not obvious that stalled development is a problem.

Also, flying-sheep is now blocked for being rude.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants