Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlopen error when using an SFTP path #514

Open
TristanFauvel opened this issue Sep 1, 2023 · 3 comments
Open

urlopen error when using an SFTP path #514

TristanFauvel opened this issue Sep 1, 2023 · 3 comments

Comments

@TristanFauvel
Copy link

TristanFauvel commented Sep 1, 2023

Description

When adding my_data to the DataCatalog with an SFTP path:

my_data:
  type: pandas.CSVDataSet  
  filepath: "sftp://<host>/<path>/<filename>.csv"
  credentials : my_credentials

I get :

URLError: <urlopen error unknown url type: sftp>

Context

I am trying to load a .csv file from a server using SFTP. Creating the following custom SFTPDataSet class solved the issue :

class SFTPDataSet(CSVDataSet):
    def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
        metadata: Dict[str, Any] = None,
    ) -> None:
        super().__init__(
            filepath, load_args, save_args, version, credentials, fs_args, metadata
        )

    def _load(self) -> pd.DataFrame:
        load_path = str(self._get_load_path())
        if self._protocol == "file":
            return pd.read_csv(load_path, **self._load_args)

        load_path = f"{self._protocol}{PROTOCOL_DELIMITER}{load_path}"

        sftp = self._fs

        with sftp.open(load_path) as f:
            data = pd.read_csv(f, **self._load_args)

        return data

Steps to Reproduce

  1. Add a dataset to the DataCatalog with an SFTP path, and add the credentials in conf/local
  2. Create a node that loads the data in a pipeline

Expected Result

The .csv should be loaded into a pandas dataframe.

Actual Result

Instead I get:

URLError: <urlopen error unknown url type: sftp>

Your Environment

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.13
  • Python version used (python -V): Python 3.10.12
  • Operating system and version: Windows 10 Pro
@datajoely
Copy link
Contributor

Hi @TristanFauvel Kedro already supports sftp via all the datasets implemented with fsspec (and paramiko underneath), see an example here:

https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-a-csv-file-stored-in-a-remote-location-through-ssh

@TristanFauvel
Copy link
Author

Hi @datajoely,
Thanks for the quick reply. Actually, I did follow the example you linked (this is not a feature request).

I noticed that the bug occurs in CSVDataSet's _load(). Replacing:

pd.read_csv(load_path, storage_options=self._storage_options, **self._load_args)

with :

with  self._fs.open(load_path) as f:
    data = pd.read_csv(f, **self._load_args)

solved the bug (as I did in the SFTPDataSet class above).

pandas version : I got the bug with both 2.0.3 and 2.1.0

@merelcht
Copy link
Member

Hi @TristanFauvel, do you need more help with this issue or can it be closed?

@merelcht merelcht transferred this issue from kedro-org/kedro Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants