GoogleSheetsDataset #4106

lvijnck · 2024-08-19T10:56:28Z

Description

The regular ExcelDataset is rather troublesome to enable direct translator to engineer interactions due to file downloads and versioning. Using sheets instead allows for a live editable document where a kedro pipeline can contribute as a "user" writing specific sheets or columns.

Context

We're using this in our pipeline atm and it's proven to be quite effective.

Possible Implementation

_sheets_dataset: &_sheets_dataset
  type: datasets.gcp.GoogleSheetsDataset
  key: <sheet_id_here>
  service_file: conf/local/service-account.json

# catalog examples
preprocessing.int.resolved_nodes:
  <<: [*_layer_int, *_sheets_dataset]
  save_args:
    sheet_name: Nodes
    write_columns: ["curie"] # saves only the "curie" column to the sheet

  load_args:
    sheet_name: Nodes

preprocessing.int.normalized_nodes:
  <<: [*_layer_int, *_sheets_dataset]
  save_args:
    sheet_name: Nodes
    write_columns: ["normalized_curie"]

  load_args:
    sheet_name: Nodes

import pandas as pd

from kedro.io.core import Version
from kedro_datasets.spark import SparkDataset
from kedro_datasets.spark.spark_dataset import _strip_dbfs_prefix, _get_spark
from kedro.io.core import (
    PROTOCOL_DELIMITER,
    AbstractVersionedDataset,
    DatasetError,
    Version,
    get_filepath_str,
    get_protocol_and_path,
)

import pygsheets
from pygsheets import Worksheet, Spreadsheet


class GoogleSheetsDataset(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame]):
    """Dataset to load data from Google sheets."""

    DEFAULT_LOAD_ARGS: dict[str, Any] = {}
    DEFAULT_SAVE_ARGS: dict[str, Any] = {}

    def __init__(  # noqa: PLR0913
        self,
        *,
        key: str,
        service_file: str,
        load_args: dict[str, Any] | None = None,
        save_args: dict[str, Any] | None = None,
        version: Version | None = None,
        credentials: dict[str, Any] | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> None:
        """Creates a new instance of ``GoogleSheetsDataset``.

        Args:
            key: Google sheets key
            service_file: path to service accunt file.
            load_args: Arguments to pass to the load method.
            save_args: Arguments to pass to the save
            version: Version of the dataset.
            credentials: Credentials to connect to the Neo4J instance.
            metadata: Metadata to pass to neo4j connector.
            kwargs: Keyword Args passed to parent.
        """
        self._key = key
        self._service_file = service_file
        self._sheet = None

        super().__init__(
            filepath=None,
            version=version,
            exists_function=self._exists,
            glob_function=None,
        )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

    def _init_sheet(self):
        """Function to initialize the spreadsheet.

        This is executed lazily to avoid loading credentials on python runtime launch which creates issues
        in unit tests.
        """
        if self._sheet is None:
            gc = pygsheets.authorize(service_file=self._service_file)
            self._sheet = gc.open_by_key(self._key)

    def _load(self) -> pd.DataFrame:
        self._init_sheet()

        sheet_name = self._load_args["sheet_name"]
        wks = self._get_wks_by_name(self._sheet, sheet_name)
        if wks is None:
            raise DatasetError(f"Sheet with name {sheet_name} not found!")

        df = wks.get_as_df()
        if (cols := self._load_args.get("columns", None)) is not None:
            df = df[cols]

        return df

    def _save(self, data: pd.DataFrame) -> None:
        self._init_sheet()

        sheet_name = self._save_args["sheet_name"]
        wks = self._get_wks_by_name(self._sheet, sheet_name)

        # Create the worksheet if not exists
        if wks is None:
            wks = self._sheet.add_worksheet(sheet_name)

        # Write columns
        for column in self._save_args["write_columns"]:
            col_idx = self._get_col_index(wks, column)

            if col_idx is None:
                raise DatasetError(
                    f"Sheet with {sheet_name} does not contain column {column}!"
                )

            wks.set_dataframe(data[[column]], (1, col_idx + 1))

    @staticmethod
    def _get_wks_by_name(
        spreadsheet: Spreadsheet, sheet_name: str
    ) -> Optional[Worksheet]:
        for wks in spreadsheet.worksheets():
            if wks.title == sheet_name:
                return wks

        return None

    @staticmethod
    def _get_col_index(sheet: Worksheet, col_name: str) -> Optional[int]:
        for idx, col in enumerate(sheet.get_row(1)):
            if col == col_name:
                return idx

        return None

    def _describe(self) -> dict[str, Any]:
        return {
            "key": self._key,
        }

    def _exists(self) -> bool:
        return False

Possible Alternatives

Galileo-Galilei · 2024-08-19T12:27:51Z

Hi @lvijnck, I definitely see the value of this.

I think there is a documentation issue because I can't find it in the official doc, but we relaxed the rules to enable community datasets to be included in kedro-datasets. You can follow the contributing guidelines to merge your dataset in the "experimental" folder of kedro-datasets.

DimedS · 2024-08-20T10:33:35Z

That sounds really great, thank you @lvijnck! As @Galileo-Galilei mentioned, experimental datasets are a good place for this addition. Since you already have the code ready, would you like to add the dataset to the "experimental datasets" section yourself, or would you prefer to have the Kedro maintainers do it? Let us know how you'd like to proceed!

lvijnck · 2024-08-20T11:37:27Z

Submitted a PR to plugins, will close this.

https://github.com/kedro-org/kedro-plugins/pull/810/files

lvijnck added the Issue: Feature Request New feature or improvement to existing feature label Aug 19, 2024

lvijnck closed this as completed Aug 20, 2024

github-actions bot mentioned this issue Sep 1, 2024

Monthly issue metrics report #4135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoogleSheetsDataset #4106

GoogleSheetsDataset #4106

lvijnck commented Aug 19, 2024 •

edited

Loading

Galileo-Galilei commented Aug 19, 2024 •

edited

Loading

DimedS commented Aug 20, 2024

lvijnck commented Aug 20, 2024

GoogleSheetsDataset #4106

GoogleSheetsDataset #4106

Comments

lvijnck commented Aug 19, 2024 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

Galileo-Galilei commented Aug 19, 2024 • edited Loading

DimedS commented Aug 20, 2024

lvijnck commented Aug 20, 2024

lvijnck commented Aug 19, 2024 •

edited

Loading

Galileo-Galilei commented Aug 19, 2024 •

edited

Loading