Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new JSON Schema to Support v0.19 #3621

Merged
merged 8 commits into from
Feb 20, 2024
Merged

Conversation

lordsoffallen
Copy link
Contributor

@lordsoffallen lordsoffallen commented Feb 14, 2024

Description

To fix #3590

Development notes

I simply replaced DataSet --> Dataset + I added huggingface datasets as they were missing. With kedro datasets moved out, I wonder if these json config somehow also should be moved into plugin repo.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: ftopal <fazilbtopal@gmail.com>
Signed-off-by: ftopal <fazilbtopal@gmail.com>
Signed-off-by: ftopal <fazilbtopal@gmail.com>
@astrojuanlu
Copy link
Member

Thanks for this PR @lordsoffallen! Approved the CI

@astrojuanlu
Copy link
Member

Diff between 0.18 and 0.19 schemas:

--- static/jsonschema/kedro-catalog-0.18.json	2023-05-17 13:21:56
+++ static/jsonschema/kedro-catalog-0.19.json	2024-02-14 19:25:40
@@ -9,42 +9,44 @@
         "type": {
           "type": "string",
           "enum": [
-            "CachedDataSet",
-            "IncrementalDataSet",
-            "MemoryDataSet",
-            "LambdaDataSet",
-            "PartitionedDataSet",
-            "api.APIDataSet",
-            "biosequence.BioSequenceDataSet",
-            "dask.ParquetDataSet",
-            "email.EmailMessageDataSet",
-            "geopandas.GeoJSONDataSet",
+            "CachedDataset",
+            "IncrementalDataset",
+            "MemoryDataset",
+            "LambdaDataset",
+            "PartitionedDataset",
+            "api.APIDataset",
+            "biosequence.BioSequenceDataset",
+            "dask.ParquetDataset",
+            "email.EmailMessageDataset",
+            "geopandas.GeoJSONDataset",
             "holoviews.HoloviewsWriter",
-            "json.JSONDataSet",
+            "huggingface.HFDataset",
+            "huggingface.HFTransformerPipelineDataset",
+            "json.JSONDataset",
             "matplotlib.MatplotlibWriter",
-            "networkx.NetworkXDataSet",
-            "pandas.CSVDataSet",
-            "pandas.ExcelDataSet",
-            "pandas.FeatherDataSet",
-            "pandas.GBQTableDataSet",
-            "pandas.HDFDataSet",
-            "pandas.JSONDataSet",
-            "pandas.ParquetDataSet",
-            "pandas.SQLTableDataSet",
-            "pandas.SQLQueryDataSet",
-            "pandas.XMLDataSet",
-            "pillow.ImageDataSet",
-            "pickle.PickleDataSet",
-            "plotly.PlotlyDataSet",
-            "redis.PickleDataSet",
-            "spark.SparkDataSet",
-            "spark.SparkHiveDataSet",
-            "spark.SparkJDBCDataSet",
+            "networkx.NetworkXDataset",
+            "pandas.CSVDataset",
+            "pandas.ExcelDataset",
+            "pandas.FeatherDataset",
+            "pandas.GBQTableDataset",
+            "pandas.HDFDataset",
+            "pandas.JSONDataset",
+            "pandas.ParquetDataset",
+            "pandas.SQLTableDataset",
+            "pandas.SQLQueryDataset",
+            "pandas.XMLDataset",
+            "pillow.ImageDataset",
+            "pickle.PickleDataset",
+            "plotly.PlotlyDataset",
+            "redis.PickleDataset",
+            "spark.SparkDataset",
+            "spark.SparkHiveDataset",
+            "spark.SparkJDBCDataset",
             "tensorflow.TensorFlowModelDataset",
-            "text.TextDataSet",
-            "tracking.JSONDataSet",
-            "tracking.MetricsDataSet",
-            "yaml.YAMLDataSet"
+            "text.TextDataset",
+            "tracking.JSONDataset",
+            "tracking.MetricsDataset",
+            "yaml.YAMLDataset"
           ]
         }
       },
@@ -53,7 +55,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "CachedDataSet"
+                "const": "CachedDataset"
               }
             }
           },
@@ -64,7 +66,7 @@
             "properties": {
               "dataset": {
                 "pattern": ".*",
-                "description": "A Kedro DataSet object or a dictionary to cache."
+                "description": "A Kedro Dataset object or a dictionary to cache."
               },
               "copy_mode": {
                 "type": "string",
@@ -77,7 +79,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "IncrementalDataSet"
+                "const": "IncrementalDataset"
               }
             }
           },
@@ -89,11 +91,11 @@
             "properties": {
               "path": {
                 "type": "string",
-                "description": "Path to the folder containing partitioned data.\nIf path starts with the protocol (e.g., ``s3://``) then the\ncorresponding ``fsspec`` concrete filesystem implementation will\nbe used. If protocol is not specified,\n``fsspec.implementations.local.LocalFileSystem`` will be used.\n**Note:** Some concrete implementations are bundled with ``fsspec``,\nwhile others (like ``s3`` or ``gcs``) must be installed separately\nprior to usage of the ``PartitionedDataSet``."
+                "description": "Path to the folder containing partitioned data.\nIf path starts with the protocol (e.g., ``s3://``) then the\ncorresponding ``fsspec`` concrete filesystem implementation will\nbe used. If protocol is not specified,\n``fsspec.implementations.local.LocalFileSystem`` will be used.\n**Note:** Some concrete implementations are bundled with ``fsspec``,\nwhile others (like ``s3`` or ``gcs``) must be installed separately\nprior to usage of the ``PartitionedDataset``."
               },
               "dataset": {
                 "pattern": ".*",
-                "description": "Underlying dataset definition. This is used to instantiate\nthe dataset for each file located inside the ``path``.\nAccepted formats are:\na) object of a class that inherits from ``AbstractDataSet``\nb) a string representing a fully qualified class name to such class\nc) a dictionary with ``type`` key pointing to a string from b),\nother keys are passed to the Dataset initializer.\nCredentials for the dataset can be explicitly specified in\nthis configuration."
+                "description": "Underlying dataset definition. This is used to instantiate\nthe dataset for each file located inside the ``path``.\nAccepted formats are:\na) object of a class that inherits from ``AbstractDataset``\nb) a string representing a fully qualified class name to such class\nc) a dictionary with ``type`` key pointing to a string from b),\nother keys are passed to the Dataset initializer.\nCredentials for the dataset can be explicitly specified in\nthis configuration."
               },
               "checkpoint": {
                 "pattern": "object",
@@ -129,7 +131,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "MemoryDataSet"
+                "const": "MemoryDataset"
               }
             }
           },
@@ -151,7 +153,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "LambdaDataSet"
+                "const": "LambdaDataset"
               }
             }
           },
@@ -184,7 +186,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "PartitionedDataSet"
+                "const": "PartitionedDataset"
               }
             }
           },
@@ -196,11 +198,11 @@
             "properties": {
               "path": {
                 "type": "string",
-                "description": "Path to the folder containing partitioned data.\nIf path starts with the protocol (e.g., ``s3://``) then the\ncorresponding ``fsspec`` concrete filesystem implementation will\nbe used. If protocol is not specified,\n``fsspec.implementations.local.LocalFileSystem`` will be used.\n**Note:** Some concrete implementations are bundled with ``fsspec``,\nwhile others (like ``s3`` or ``gcs``) must be installed separately\nprior to usage of the ``PartitionedDataSet``."
+                "description": "Path to the folder containing partitioned data.\nIf path starts with the protocol (e.g., ``s3://``) then the\ncorresponding ``fsspec`` concrete filesystem implementation will\nbe used. If protocol is not specified,\n``fsspec.implementations.local.LocalFileSystem`` will be used.\n**Note:** Some concrete implementations are bundled with ``fsspec``,\nwhile others (like ``s3`` or ``gcs``) must be installed separately\nprior to usage of the ``PartitionedDataset``."
               },
               "dataset": {
                 "pattern": ".*",
-                "description": "Underlying dataset definition. This is used to instantiate\nthe dataset for each file located inside the ``path``.\nAccepted formats are:\na) object of a class that inherits from ``AbstractDataSet``\nb) a string representing a fully qualified class name to such class\nc) a dictionary with ``type`` key pointing to a string from b),\nother keys are passed to the Dataset initializer.\nCredentials for the dataset can be explicitly specified in\nthis configuration."
+                "description": "Underlying dataset definition. This is used to instantiate\nthe dataset for each file located inside the ``path``.\nAccepted formats are:\na) object of a class that inherits from ``AbstractDataset``\nb) a string representing a fully qualified class name to such class\nc) a dictionary with ``type`` key pointing to a string from b),\nother keys are passed to the Dataset initializer.\nCredentials for the dataset can be explicitly specified in\nthis configuration."
               },
               "filepath_arg": {
                 "type": "string",
@@ -232,7 +234,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "api.APIDataSet"
+                "const": "api.APIDataset"
               }
             }
           },
@@ -280,7 +282,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "biosequence.BioSequenceDataSet"
+                "const": "biosequence.BioSequenceDataset"
               }
             }
           },
@@ -319,7 +321,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "dask.ParquetDataSet"
+                "const": "dask.ParquetDataset"
               }
             }
           },
@@ -358,7 +360,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "email.EmailMessageDataSet"
+                "const": "email.EmailMessageDataset"
               }
             }
           },
@@ -397,7 +399,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "geopandas.GeoJSONDataSet"
+                "const": "geopandas.GeoJSONDataset"
               }
             }
           },
@@ -471,12 +473,57 @@
           "if": {
             "properties": {
               "type": {
-                "const": "json.JSONDataSet"
+                "const": "huggingface.HFDataset"
               }
             }
           },
           "then": {
             "required": [
+              "dataset_name"
+            ],
+            "properties": {
+              "dataset_name": {
+                "type": "string",
+                "description": "Huggingface dataset name"
+              }
+            }
+          }
+        },
+        {
+          "if": {
+            "properties": {
+              "type": {
+                "const": "huggingface.HFTransformerPipelineDataset"
+              }
+            }
+          },
+          "then": {
+            "properties": {
+              "task": {
+                "type": "string",
+                "description": "Huggingface pipeline task name"
+              },
+              "model_name": {
+                "type": "string",
+                "description": "Huggingface model name"
+              },
+              "pipeline_kwargs": {
+                "type": "object",
+                "description": "Additional kwargs to be passed into the pipeline"
+              }
+            }
+          }
+        },
+        {
+          "if": {
+            "properties": {
+              "type": {
+                "const": "json.JSONDataset"
+              }
+            }
+          },
+          "then": {
+            "required": [
               "filepath"
             ],
             "properties": {
@@ -541,7 +588,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "networkx.NetworkXDataSet"
+                "const": "networkx.NetworkXDataset"
               }
             }
           },
@@ -580,7 +627,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.CSVDataSet"
+                "const": "pandas.CSVDataset"
               }
             }
           },
@@ -619,7 +666,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.ExcelDataSet"
+                "const": "pandas.ExcelDataset"
               }
             }
           },
@@ -662,7 +709,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.FeatherDataSet"
+                "const": "pandas.FeatherDataset"
               }
             }
           },
@@ -697,7 +744,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.GBQTableDataSet"
+                "const": "pandas.GBQTableDataset"
               }
             }
           },
@@ -738,7 +785,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.HDFDataSet"
+                "const": "pandas.HDFDataset"
               }
             }
           },
@@ -782,7 +829,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.JSONDataSet"
+                "const": "pandas.JSONDataset"
               }
             }
           },
@@ -821,7 +868,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.ParquetDataSet"
+                "const": "pandas.ParquetDataset"
               }
             }
           },
@@ -860,7 +907,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.SQLTableDataSet"
+                "const": "pandas.SQLTableDataset"
               }
             }
           },
@@ -896,7 +943,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.SQLQueryDataSet"
+                "const": "pandas.SQLQueryDataset"
               }
             }
           },
@@ -932,7 +979,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pandas.XMLDataSet"
+                "const": "pandas.XMLDataset"
               }
             }
           },
@@ -971,7 +1018,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pickle.PickleDataSet"
+                "const": "pickle.PickleDataset"
               }
             }
           },
@@ -1014,7 +1061,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "pillow.ImageDataSet"
+                "const": "pillow.ImageDataset"
               }
             }
           },
@@ -1049,7 +1096,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "plotly.PlotlyDataSet"
+                "const": "plotly.PlotlyDataset"
               }
             }
           },
@@ -1093,7 +1140,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "redis.PickleDataSet"
+                "const": "redis.PickleDataset"
               }
             }
           },
@@ -1133,7 +1180,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "spark.SparkDataSet"
+                "const": "spark.SparkDataset"
               }
             }
           },
@@ -1144,7 +1191,7 @@
             "properties": {
               "filepath": {
                 "type": "string",
-                "description": "Filepath in POSIX format to a Spark dataframe. When using Databricks\nand working with data written to mount path points,\nspecify ``filepath``s for (versioned) ``SparkDataSet``s\nstarting with ``/dbfs/mnt``."
+                "description": "Filepath in POSIX format to a Spark dataframe. When using Databricks\nand working with data written to mount path points,\nspecify ``filepath``s for (versioned) ``SparkDataset``s\nstarting with ``/dbfs/mnt``."
               },
               "file_format": {
                 "type": "string",
@@ -1172,7 +1219,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "spark.SparkHiveDataSet"
+                "const": "spark.SparkHiveDataset"
               }
             }
           },
@@ -1206,7 +1253,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "spark.SparkJDBCDataSet"
+                "const": "spark.SparkJDBCDataset"
               }
             }
           },
@@ -1285,7 +1332,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "text.TextDataSet"
+                "const": "text.TextDataset"
               }
             }
           },
@@ -1316,7 +1363,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "tracking.JSONDataSet"
+                "const": "tracking.JSONDataset"
               }
             }
           },
@@ -1351,7 +1398,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "tracking.MetricsDataSet"
+                "const": "tracking.MetricsDataset"
               }
             }
           },
@@ -1386,7 +1433,7 @@
           "if": {
             "properties": {
               "type": {
-                "const": "yaml.YAMLDataSet"
+                "const": "yaml.YAMLDataset"
               }
             }
           },

RELEASE.md Outdated Show resolved Hide resolved
Copy link
Contributor

@stichbury stichbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one minor change. Thanks!

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Signed-off-by: Fazil <fazilbtopal@gmail.com>
@astrojuanlu
Copy link
Member

Doc failures seem unrelated

@lordsoffallen
Copy link
Contributor Author

Doc failures seem unrelated

How do we fix it? I am not sure where the problem lies 😅

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this contribution @lordsoffallen! ⭐

I left some suggestions, but otherwise all good to merge!

static/jsonschema/kedro-catalog-0.19.json Outdated Show resolved Hide resolved
static/jsonschema/kedro-catalog-0.19.json Outdated Show resolved Hide resolved
static/jsonschema/kedro-catalog-0.19.json Outdated Show resolved Hide resolved
lordsoffallen and others added 3 commits February 20, 2024 18:05
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Fazil <fazilbtopal@gmail.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Fazil <fazilbtopal@gmail.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Fazil <fazilbtopal@gmail.com>
@merelcht merelcht enabled auto-merge (squash) February 20, 2024 17:13
@merelcht merelcht merged commit 30ae2c7 into kedro-org:main Feb 20, 2024
33 checks passed
@astrojuanlu astrojuanlu mentioned this pull request Feb 27, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON Schema needs update
4 participants