-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dolt // Kedro #746
Comments
Hey @limdauto! I'm interested in demoing/getting feedback on this sample code. Should I email the group that met last week? Do you guys have a separate public forum we could communicate on async? |
Hey @max-hoffman thank you very much for this. I was trying out your dataset last night and promise I would write a longer response by the end of the week. In general, I think the dataset is great. But I'm trying to make sure I understand a few points such as "Metadata that helps users and/or Kedro track lineage" more clearly. I'm not sure how lineage is being tracked here. Re comm channel, we can continue the discussion here for sure. |
Hey Lim, that's a good question! Database and workflow metadata systems can complement one-another. Here's a version we did with Metaflow. I apologize in advance for the long explanation and if I incorrectly generalized Kedro's journalizing system! Lineage in Dolt means the commit graph -- the atomic unit of change is a commit. A commit has data changes, one or more parents (more if merging two branches), and metadata associated with the commit. Dolt's
Dolt's data storage format and SQL components are unique, but otherwise this should sound like Git's model. Lineage in workflow managers usually means something else -- connecting runtime/execution/session variables with inputs and outputs. Kedro's journaling system tracks lineage metadata as an append-only log. Kedro's CSV DataSet copies files by timestamp to maintain a change-set history. When a user's data source is Dolt, they don't have to download two CSVs and pull them into Jupyter to compare side-by-side. Dolt tracks schema changes between the two, row diffs, row-level blames, entire history of data and commits. Everything is "under one roof," so to speak. A different use-case would be if, for example, Kedro's DataCatalog was implemented as a Dolt table that tracked versions of DataSet inputs/outputs between executions: |
Hi @max-hoffman, thank you very much for taking the time to write the issue and for making the demo. Apology for the delay in response, partially because I have been in some training all week and partially because I really want to wrap my head around what exactly we are trying to accomplish here. First thing first, such an awesome piece of technology you and the Dolt team have built there. I can't express enough how excited I am with Dolt. It feels like having a superpower I don't yet know what to do with. Regarding an integration with Kedro, you have touched on many great ideas in your issue and in the demo. However, please allow me to take a step back and look at this from a Kedro user perspective first. As a Kedro user, I believe I can already use Dolt right now as a data source in Kedro without any extra dataset, thanks to your SQL interface. I would use it wherever I want to track different versions of my tabular datasets. It would be an alternative option to Kedro's path-based The workflow is:
class ProjectHooks:
def __init__(self):
project_path = Path(__file__).parent.parent.parent
self.dolt = Dolt(project_path)
self.dolt_sql_server = DoltSQLServerContext(self.dolt, ServerConfig())
@hook_impl
def before_pipeline_run(self):
self.dolt_sql_server.start_server()
@hook_impl
def after_pipeline_run(self, run_params: Dict[str, Any]):
self.dolt.add(".")
try:
self.dolt.commit(
message=f"Update data from Kedro run {run_params['run_id']} with params {run_params['extra_params']}"
)
except DoltException as e:
if "no changes added to commit" not in str(e):
raise
finally:
self.dolt_sql_server.stop_server()
example_test_x:
type: pandas.SQLTableDataSet
table_name: example_test_x
credentials:
con: mysql://root@localhost:3306/kedro_dolt_demo
save_args:
if_exists: replace And voila! If your data change between
with a modified
there are corresponding commits in dolt:
We can now all of Dolt tools to interact with the data, e.g. I believe this workflow is more familiar and idiomatic to Kedro users while still showcasing the values that Dolt would bring. If you are happy with this approach, we could definitely write it up in our documentation in the section for Tools integration next to Spark. Some further ideas to improve upon this would be to allow users to checkout different data branches by passing in an extra param from the CLI, e.g. (*) I lie a little bit here. Even though I recommend we start and stop Dolt SQL server programmtically, I actually had to do it manually in my demo project with |
The previous comment was written from the perspective of a Kedro user. I want to write up a lot more thoughts from the perspective of a Kedro developer because I think Dolt is a great tool and could solve a lot of thorny problems for us like Change Data Capture, but I still need to organise my thoughts a bit more. This is very much like a superpower that I'm still trying to wrap my head around. |
Thanks for giving such a thorough response! It might take me a little while to go through this line-by-line, but my first reaction is that hooks sound like a promising direction. I did a write-up here where I wrap generic reads/writes with Dolt's features. Let me know if that helps clarify. I have my next week planned out but I'd love to take another attempt using |
@limdauto your demo is great! I'll do one iteration and try to write a blog draft next week if that's OK. Feel free to ask as many questions as you want in the meantime. |
Hey Lim, I'm working on reproducing your demo right now. I'll push some changes after I work through setup issues.
I forgot to respond to your questions:
|
Hey, thank you for the update and the extra information. It all looks good from my side.
I agree. At the same time, Kedro is, among other things, a codification of workflow. I'd love to design a sensible workflow that makes sense for our users first. I'd imagine the implementation is the simpler issue here. So I'm happy to progress with first a simple, most obvious workflow as outlined in the hook-based approach. Then we can think of further improvements once there is adoption. Re |
@max-hoffman relatedly, I need to sort out a proposal for https://ep2021.europython.eu/events/call-for-proposals/ this week. I'm thinking of submitting a proposal around a Dolt // Kedro integration because I truly believe there is a much bigger future here. I might send it over to you for comments. cc @yetudada |
@limdauto We'd be excited to speak at events. Let me know if I can help. I got a Kedro-Dolt blog draft together last night if that's helpful at all. I was going to make a plugin package and do a second draft with hardened examples before sending it your way. |
@limdauto I am going to spend time making dolt handle "replace" diffs more intuitively. I'll try to get those changes in and a blog draft to you by Monday/Tuesday next week. |
Hey @limdauto, took me a bit longer than expected but I fixed the diff bugs and added an initial blog doc here https://github.com/dolthub/kedro-dolt-demo/blob/max/branches/blog.md. The diff fixes will roll into the next dolt release on Monday. We will make a push to tighten a publish this blog soon. We'd love your feedback! |
I made and released a plugin here -- https://github.com/dolthub/kedro-dolt. |
I think we can close this :) Thanks @max-hoffman |
I'm also adding a ticket to our backlog to add some Dolt references in our docs |
Introduction
The Dolt team is interested in exposing DoltDB as a Kedro DataSet type. We are also excited about the idea of exposing diffing and other SQL features for change capture if useful to the Kedro team.
I briefly filled out the bullet points below, but the write-up in my draft PR is more straight-to-the-point.
Draft PR -> dolthub#1
Included in the PR -- brief tutorial notes/comments.
The starter integration is not heavily tested, we don't intend these additions to make it into a final PR, we are most interested in design feedback.
Background
Dolt is an SQL-database with Git-versioning. Standalone it can be datasource for workflow managers. Without custom code it just does what MySQL or SQLLite does. Varying levels of Git-functionality can be included in integrations to provide versioning, diffing, merging and reproducibility for tabular datasets that is unique to our storage layer (we have quite a few blogs on this). We have gotten a lot of positive feedback so far in this space and hope we can help solve thorny versioning problems!
Problem
What's in scope
What's not in scope
Design
Kedro remote-object interface that I've focused on:
save
andload
methods (and others)I made an example Dolt integration that behaves similar to Pandas DataFrames for end-users, but uses Dolt to capture lineage and deltas of those tables for users.
Metadata storage, remotes, and advanced branching logic are all optional extensions beyond an otherwise
pd.DataFrame
experience.Journaling scope limited to data catalogs, and versioning having a different meaning in Dolt are two friction points that I haven't addressed in my sample code.
Alternatives considered
Two other integration patterns:
Neither of these struck me as particularly suited to Kedro's existing UX.
edit: SQL-server integration was mentioned as more appealing than an FS-based approach in our intro call. The two are interchangeable, FS is just easier to demo and test currently.
Testing
Explain the testing strategies to verify your design correctness (if possible).
TODO
Rollout strategy
Is the change backward compatible? If not, what is the migration strategy?
TODO (short answer yes)
Future iterations
Will there be future iterations of this design?
Hopefully! We are excited for feedback!
The text was updated successfully, but these errors were encountered: