Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoogleSheetsDataset #4106

Closed
lvijnck opened this issue Aug 19, 2024 · 3 comments
Closed

GoogleSheetsDataset #4106

lvijnck opened this issue Aug 19, 2024 · 3 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@lvijnck
Copy link
Contributor

lvijnck commented Aug 19, 2024

Description

The regular ExcelDataset is rather troublesome to enable direct translator to engineer interactions due to file downloads and versioning. Using sheets instead allows for a live editable document where a kedro pipeline can contribute as a "user" writing specific sheets or columns.

Context

We're using this in our pipeline atm and it's proven to be quite effective.

Possible Implementation

_sheets_dataset: &_sheets_dataset
  type: datasets.gcp.GoogleSheetsDataset
  key: <sheet_id_here>
  service_file: conf/local/service-account.json

# catalog examples
preprocessing.int.resolved_nodes:
  <<: [*_layer_int, *_sheets_dataset]
  save_args:
    sheet_name: Nodes
    write_columns: ["curie"] # saves only the "curie" column to the sheet

  load_args:
    sheet_name: Nodes

preprocessing.int.normalized_nodes:
  <<: [*_layer_int, *_sheets_dataset]
  save_args:
    sheet_name: Nodes
    write_columns: ["normalized_curie"]

  load_args:
    sheet_name: Nodes
import pandas as pd

from kedro.io.core import Version
from kedro_datasets.spark import SparkDataset
from kedro_datasets.spark.spark_dataset import _strip_dbfs_prefix, _get_spark
from kedro.io.core import (
    PROTOCOL_DELIMITER,
    AbstractVersionedDataset,
    DatasetError,
    Version,
    get_filepath_str,
    get_protocol_and_path,
)

import pygsheets
from pygsheets import Worksheet, Spreadsheet


class GoogleSheetsDataset(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame]):
    """Dataset to load data from Google sheets."""

    DEFAULT_LOAD_ARGS: dict[str, Any] = {}
    DEFAULT_SAVE_ARGS: dict[str, Any] = {}

    def __init__(  # noqa: PLR0913
        self,
        *,
        key: str,
        service_file: str,
        load_args: dict[str, Any] | None = None,
        save_args: dict[str, Any] | None = None,
        version: Version | None = None,
        credentials: dict[str, Any] | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> None:
        """Creates a new instance of ``GoogleSheetsDataset``.

        Args:
            key: Google sheets key
            service_file: path to service accunt file.
            load_args: Arguments to pass to the load method.
            save_args: Arguments to pass to the save
            version: Version of the dataset.
            credentials: Credentials to connect to the Neo4J instance.
            metadata: Metadata to pass to neo4j connector.
            kwargs: Keyword Args passed to parent.
        """
        self._key = key
        self._service_file = service_file
        self._sheet = None

        super().__init__(
            filepath=None,
            version=version,
            exists_function=self._exists,
            glob_function=None,
        )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

    def _init_sheet(self):
        """Function to initialize the spreadsheet.

        This is executed lazily to avoid loading credentials on python runtime launch which creates issues
        in unit tests.
        """
        if self._sheet is None:
            gc = pygsheets.authorize(service_file=self._service_file)
            self._sheet = gc.open_by_key(self._key)

    def _load(self) -> pd.DataFrame:
        self._init_sheet()

        sheet_name = self._load_args["sheet_name"]
        wks = self._get_wks_by_name(self._sheet, sheet_name)
        if wks is None:
            raise DatasetError(f"Sheet with name {sheet_name} not found!")

        df = wks.get_as_df()
        if (cols := self._load_args.get("columns", None)) is not None:
            df = df[cols]

        return df

    def _save(self, data: pd.DataFrame) -> None:
        self._init_sheet()

        sheet_name = self._save_args["sheet_name"]
        wks = self._get_wks_by_name(self._sheet, sheet_name)

        # Create the worksheet if not exists
        if wks is None:
            wks = self._sheet.add_worksheet(sheet_name)

        # Write columns
        for column in self._save_args["write_columns"]:
            col_idx = self._get_col_index(wks, column)

            if col_idx is None:
                raise DatasetError(
                    f"Sheet with {sheet_name} does not contain column {column}!"
                )

            wks.set_dataframe(data[[column]], (1, col_idx + 1))

    @staticmethod
    def _get_wks_by_name(
        spreadsheet: Spreadsheet, sheet_name: str
    ) -> Optional[Worksheet]:
        for wks in spreadsheet.worksheets():
            if wks.title == sheet_name:
                return wks

        return None

    @staticmethod
    def _get_col_index(sheet: Worksheet, col_name: str) -> Optional[int]:
        for idx, col in enumerate(sheet.get_row(1)):
            if col == col_name:
                return idx

        return None

    def _describe(self) -> dict[str, Any]:
        return {
            "key": self._key,
        }

    def _exists(self) -> bool:
        return False

Possible Alternatives

@lvijnck lvijnck added the Issue: Feature Request New feature or improvement to existing feature label Aug 19, 2024
@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Aug 19, 2024

Hi @lvijnck, I definitely see the value of this.

I think there is a documentation issue because I can't find it in the official doc, but we relaxed the rules to enable community datasets to be included in kedro-datasets. You can follow the contributing guidelines to merge your dataset in the "experimental" folder of kedro-datasets.

@DimedS
Copy link
Member

DimedS commented Aug 20, 2024

That sounds really great, thank you @lvijnck! As @Galileo-Galilei mentioned, experimental datasets are a good place for this addition. Since you already have the code ready, would you like to add the dataset to the "experimental datasets" section yourself, or would you prefer to have the Kedro maintainers do it? Let us know how you'd like to proceed!

@lvijnck
Copy link
Contributor Author

lvijnck commented Aug 20, 2024

Submitted a PR to plugins, will close this.

https://github.com/kedro-org/kedro-plugins/pull/810/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants