Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FolderBase Dataset automatically resolves under current directory when data_dir is not specified #6152

Open
npuichigo opened this issue Aug 16, 2023 · 15 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@npuichigo
Copy link
Contributor

Describe the bug

FolderBase Dataset automatically resolves under current directory when data_dir is not specified.

For example:

load_dataset("audiofolder")

takes long time to resolve and collect data_files from current directory. But I think it should reach out to this line for error handling

if not self.config.data_files:
raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")

Steps to reproduce the bug

load_dataset("audiofolder")

Expected behavior

Error report

Environment info

  • datasets version: 2.14.4
  • Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.17
  • Python version: 3.8.15
  • Huggingface_hub version: 0.16.4
  • PyArrow version: 12.0.1
  • Pandas version: 1.5.3
@npuichigo
Copy link
Contributor Author

@lhoestq

@lhoestq
Copy link
Member

lhoestq commented Aug 16, 2023

Makes sense, I guess this can be fixed in the load_dataset_builder method.
It concerns every packaged builder I think (see values in _PACKAGED_DATASETS_MODULES)

@npuichigo
Copy link
Contributor Author

I think the behavior is related to these lines, which short circuited the error handling.

base_path = Path(self.data_dir or "").expanduser().resolve().as_posix()
patterns = sanitize_patterns(self.data_files) if self.data_files is not None else get_data_patterns(base_path)
data_files = DataFilesDict.from_patterns(
patterns,
download_config=self.download_config,
base_path=base_path,
)

So should data_dir be checked here or still delegating to actual DatasetModule? In that case, how to properly set data_files here.

@lhoestq
Copy link
Member

lhoestq commented Aug 16, 2023

This is location in PackagedDatasetModuleFactory.get_module seems the be the right place to check if at least data_dir or data_files are passed

@mariosasko mariosasko added the good first issue Good for newcomers label Aug 17, 2023
@debrupf2946
Copy link

@mariosasko can you please assign this issue to me,I want to work on this

@debrupf2946
Copy link

#self-assign

@zutarich
Copy link

zutarich commented Oct 9, 2023

@mariosasko is this issue still open? i would love to kickstart my journey to open source with this issue!
Regards
zutarich

@mariosasko
Copy link
Collaborator

mariosasko commented Oct 10, 2023

@zutarich It is unless @debrupf2946 is working on it.

@Etelis
Copy link

Etelis commented Jan 22, 2024

#self-assign

@debrupf2946
Copy link

I am working and will open a pull request soon @Etelis

@JINO-ROHIT
Copy link
Contributor

@mariosasko can i take this up?

@JINO-ROHIT
Copy link
Contributor

JINO-ROHIT commented Apr 3, 2024

#self-assign

@mariosasko
Copy link
Collaborator

Yes, feel free to work on this :)

@JINO-ROHIT
Copy link
Contributor

i think its working as expected . Heres the log i get for the same line -

image

@sahillihas
Copy link

#self-assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

8 participants