Skip to content

Commit

Permalink
Make the SQLQueryDataSet compatible with mssql. (#101)
Browse files Browse the repository at this point in the history
* [kedro-docker] Layers size optimization (#92)

* [kedro-docker] Layers size optimization

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Adjust test requirements

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Skip coverage check on tests dir (some do not execute on Windows)

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Update .coveragerc with the setup

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Fix bandit so it does not scan kedro-datasets

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Fixed existence test

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Check why dir is not created

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Kedro starters are fixed now

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Increased no-output-timeout for long spark image build

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>

* Spark image optimized

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Linting

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Switch to slim image always

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Trigger build

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Use textwrap.dedent for nicer indentation

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Revert "Use textwrap.dedent for nicer indentation"

This reverts commit 3a1e3f8.

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Revert "Revert "Use textwrap.dedent for nicer indentation""

This reverts commit d322d35.

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

* Make tests read more lines (to skip all deprecation warnings)

Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>
Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Release Kedro-Docker 0.3.1 (#94)

* Add release notes for kedro-docker 0.3.1

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update version in kedro_docker module

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Bump version and update release notes (#96)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Make the SQLQueryDataSet compatible with mssql.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Add one test + update RELEASE.md.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Add missing pyodbc for tests.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Mock connection as well.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Add more dates parsing for mssql backend (thanks to fgaudindelrieu@idmog.com)

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Fix an error in docstring of MetricsDataSet (#98)

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Bump relax pyarrow version to work the same way as Pandas (#100)

* Bump relax pyarrow version to work the same way as Pandas

We only use PyArrow for `pandas.ParquetDataSet` as such I suggest we keep our versions pinned to the same range as [Pandas does](https://github.com/pandas-dev/pandas/blob/96fc51f5ec678394373e2c779ccff37ddb966e75/pyproject.toml#L100) for the same reason.

As such I suggest we remove the upper bound as we have users requesting later versions in [support channels](https://kedro-org.slack.com/archives/C03RKP2LW64/p1674040509133529)

* Updated release notes

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Add missing type in catalog example.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Add one more unit tests for adapt_mssql.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* [FIX] Add missing mocker from date test.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* [TEST] Add a wrong input test.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Add pyodbc dependency.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* [FIX] Remove dict() in tests.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Change check to check on plugin name (#103)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Set coverage in pyproject.toml (#105)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Move coverage settings to pyproject.toml (#106)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Replace kedro.pipeline with modular_pipeline.pipeline factory (#99)

* Add non-spark related test changes
Replace kedro.pipeline.Pipeline with
kedro.pipeline.modular_pipeline.pipeline factory.
This is for symmetry with changes made to the main kedro library.

Signed-off-by: Adam Farley <adamfrly@gmail.com>

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Fix outdated links in Kedro Datasets (#111)

* fix links

* fix dill links

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Fix docs formatting and phrasing for some datasets (#107)

* Fix docs formatting and phrasing for some datasets

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Manually fix files not resolved with patch command

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Apply fix from #98

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Release `kedro-datasets` `version 1.0.2` (#112)

* bump version and update release notes

* fix pylint errors

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Bump pytest to 7.2 (#113)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Prefix Docker plugin name with "Kedro-" in usage message (#57)

* Prefix Docker plugin name with "Kedro-" in usage message

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Keep Kedro-Docker plugin docstring from appearing in `kedro -h` (#56)

* Keep Kedro-Docker plugin docstring from appearing in `kedro -h`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* [kedro-datasets ] Add `Polars.CSVDataSet` (#95)

Signed-off-by: wmoreiraa <walber3@gmail.com>

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* Remove deprecated `test_requires` from `setup.py` in Kedro-Docker (#54)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>

* [FIX] Fix ds to data_set.

Signed-off-by: Yassine Alouini <yalouini@idmog.com>

---------

Signed-off-by: Mariusz Strzelecki <mariusz.strzelecki@getindata.com>
Signed-off-by: Mariusz Strzelecki <szczeles@gmail.com>
Signed-off-by: Yassine Alouini <yalouini@idmog.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Mariusz Strzelecki <szczeles@gmail.com>
Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: OKA Naoya <pn11@users.noreply.github.com>
Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com>
Co-authored-by: adamfrly <45516720+adamfrly@users.noreply.github.com>
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Walber Moreira <58264877+wmoreiraa@users.noreply.github.com>
  • Loading branch information
10 people authored Feb 27, 2023
1 parent d31425a commit 4450ce6
Show file tree
Hide file tree
Showing 5 changed files with 126 additions and 2 deletions.
2 changes: 1 addition & 1 deletion kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
| `polars.CSVDataSet` | A `CSVDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` |

## Bug fixes and other changes

* Add `mssql` backend to the `SQLQueryDataSet` DataSet using `pyodbc` library.

# Release 1.0.2:

Expand Down
69 changes: 69 additions & 0 deletions kedro-datasets/kedro_datasets/pandas/sql_dataset.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""``SQLDataSet`` to load and save data to a SQL backend."""

import copy
import datetime as dt
import re
from pathlib import PurePosixPath
from typing import Any, Dict, NoReturn, Optional
Expand All @@ -22,6 +23,7 @@
"psycopg2": "psycopg2",
"mysqldb": "mysqlclient",
"cx_Oracle": "cx_Oracle",
"mssql": "pyodbc",
}

DRIVER_ERROR_MESSAGE = """
Expand Down Expand Up @@ -321,7 +323,49 @@ class SQLQueryDataSet(AbstractDataSet[None, pd.DataFrame]):
>>> credentials=credentials)
>>>
>>> sql_data = data_set.load()
>>>
Example of usage for mssql:
::
>>> credentials = {"server": "localhost", "port": "1433",
>>> "database": "TestDB", "user": "SA",
>>> "password": "StrongPassword"}
>>> def _make_mssql_connection_str(
>>> server: str, port: str, database: str, user: str, password: str
>>> ) -> str:
>>> import pyodbc # noqa
>>> from sqlalchemy.engine import URL # noqa
>>>
>>> driver = pyodbc.drivers()[-1]
>>> connection_str = (f"DRIVER={driver};SERVER={server},{port};DATABASE={database};"
>>> f"ENCRYPT=yes;UID={user};PWD={password};"
>>> "TrustServerCertificate=yes;")
>>> return URL.create("mssql+pyodbc", query={"odbc_connect": connection_str})
>>> connection_str = _make_mssql_connection_str(**credentials)
>>> data_set = SQLQueryDataSet(credentials={"con": connection_str},
>>> sql="SELECT TOP 5 * FROM TestTable;")
>>> df = data_set.load()
In addition, here is an example of a catalog with dates parsing:
::
>>> mssql_dataset:
>>> type: kedro_datasets.pandas.SQLQueryDataSet
>>> credentials: mssql_credentials
>>> sql: >
>>> SELECT *
>>> FROM DateTable
>>> WHERE date >= ? AND date <= ?
>>> ORDER BY date
>>> load_args:
>>> params:
>>> - ${begin}
>>> - ${end}
>>> index_col: date
>>> parse_dates:
>>> date: "%Y-%m-%d %H:%M:%S.%f0 %z"
"""

# using Any because of Sphinx but it should be
Expand Down Expand Up @@ -413,6 +457,8 @@ def __init__( # pylint: disable=too-many-arguments
self._connection_str = credentials["con"]
self._execution_options = execution_options or {}
self.create_connection(self._connection_str)
if "mssql" in self._connection_str:
self.adapt_mssql_date_params()

@classmethod
def create_connection(cls, connection_str: str) -> None:
Expand Down Expand Up @@ -456,3 +502,26 @@ def _load(self) -> pd.DataFrame:

def _save(self, data: None) -> NoReturn:
raise DataSetError("'save' is not supported on SQLQueryDataSet")

# For mssql only
def adapt_mssql_date_params(self) -> None:
"""We need to change the format of datetime parameters.
MSSQL expects datetime in the exact format %y-%m-%dT%H:%M:%S.
Here, we also accept plain dates.
`pyodbc` does not accept named parameters, they must be provided as a list."""
params = self._load_args.get("params", [])
if not isinstance(params, list):
raise DataSetError(
"Unrecognized `params` format. It can be only a `list`, "
f"got {type(params)!r}"
)
new_load_args = []
for value in params:
try:
as_date = dt.date.fromisoformat(value)
new_val = dt.datetime.combine(as_date, dt.time.min)
new_load_args.append(new_val.strftime("%Y-%m-%dT%H:%M:%S"))
except (TypeError, ValueError):
new_load_args.append(value)
if new_load_args:
self._load_args["params"] = new_load_args
2 changes: 1 addition & 1 deletion kedro-datasets/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def _collect_requirements(requires):
"pandas.JSONDataSet": [PANDAS],
"pandas.ParquetDataSet": [PANDAS, "pyarrow>=6.0"],
"pandas.SQLTableDataSet": [PANDAS, "SQLAlchemy~=1.2"],
"pandas.SQLQueryDataSet": [PANDAS, "SQLAlchemy~=1.2"],
"pandas.SQLQueryDataSet": [PANDAS, "SQLAlchemy~=1.2", "pyodbc~=4.0"],
"pandas.XMLDataSet": [PANDAS, "lxml~=4.6"],
"pandas.GenericDataSet": [PANDAS],
}
Expand Down
1 change: 1 addition & 0 deletions kedro-datasets/test_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ pre-commit>=2.9.2, <3.0 # The hook `mypy` requires pre-commit version 2.9.2.
psutil==5.8.0
pyarrow>=1.0, <7.0
pylint>=2.5.2, <3.0
pyodbc~=4.0.35
pyproj~=3.0
pyspark>=2.2, <4.0
pytest-cov~=3.0
Expand Down
54 changes: 54 additions & 0 deletions kedro-datasets/tests/pandas/test_sql_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

TABLE_NAME = "table_a"
CONNECTION = "sqlite:///kedro.db"
MSSQL_CONNECTION = "mssql+pyodbc://?odbc_connect=DRIVER%3DODBC+Driver+for+SQL"
SQL_QUERY = "SELECT * FROM table_a"
EXECUTION_OPTIONS = {"stream_results": True}
FAKE_CONN_STR = "some_sql://scott:tiger@localhost/foo"
Expand Down Expand Up @@ -417,3 +418,56 @@ def test_create_connection_only_once(self, mocker):
assert mock_engine.call_count == 2
assert fourth.engines == first.engines
assert len(first.engines) == 2

def test_adapt_mssql_date_params_called(self, mocker):
"""Test that the adapt_mssql_date_params
function is called when mssql backend is used.
"""
mock_adapt_mssql_date_params = mocker.patch(
"kedro_datasets.pandas.sql_dataset.SQLQueryDataSet.adapt_mssql_date_params"
)
mock_engine = mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine")
ds = SQLQueryDataSet(sql=SQL_QUERY, credentials={"con": MSSQL_CONNECTION})
mock_engine.assert_called_once_with(MSSQL_CONNECTION)
assert mock_adapt_mssql_date_params.call_count == 1
assert len(ds.engines) == 1

def test_adapt_mssql_date_params(self, mocker):
"""Test that the adapt_mssql_date_params
function transforms the params as expected, i.e.
making datetime date into the format %Y-%m-%dT%H:%M:%S
and ignoring the other values.
"""
mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine")
load_args = {
"params": ["2023-01-01", "2023-01-01T20:26", "2023", "test", 1.0, 100]
}
ds = SQLQueryDataSet(
sql=SQL_QUERY, credentials={"con": MSSQL_CONNECTION}, load_args=load_args
)
assert ds._load_args["params"] == [
"2023-01-01T00:00:00",
"2023-01-01T20:26",
"2023",
"test",
1.0,
100,
]

def test_adapt_mssql_date_params_wrong_input(self, mocker):
"""Test that the adapt_mssql_date_params
function fails with the correct error message
when given a wrong input
"""
mocker.patch("kedro_datasets.pandas.sql_dataset.create_engine")
load_args = {"params": {"value": 1000}}
pattern = (
"Unrecognized `params` format. It can be only a `list`, "
"got <class 'dict'>"
)
with pytest.raises(DataSetError, match=pattern):
SQLQueryDataSet(
sql=SQL_QUERY,
credentials={"con": MSSQL_CONNECTION},
load_args=load_args,
)

0 comments on commit 4450ce6

Please sign in to comment.