Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make the documentations works for kedro-datasets? #1651

Closed
Tracked by #1457
noklam opened this issue Jun 28, 2022 · 15 comments · Fixed by #2006
Closed
Tracked by #1457

How to make the documentations works for kedro-datasets? #1651

noklam opened this issue Jun 28, 2022 · 15 comments · Fixed by #2006
Assignees
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@noklam
Copy link
Contributor

noklam commented Jun 28, 2022

Introduction

This is part of the effort of splitting out kedro.extras.datasets to kedro-datasets

Background

How will docs work for kedro-datasets? We will need to generate documentation (at least the dataset part) from the kedro-datasets repo. What will be the CI/CD process to generate it?

There are also more subtle questions like:

  • What does latest or stable means when we have two repositories in the doc?

Potential Solutions

  1. Having a dedicated doc repository that add kedro and kedro-datasets into a monorepo - http://fmarco76.github.io/git%20and%20related%20services/readthedocs. This is quite complicated but seems to be the more obvious choice when I Google around.
  2. Split the doc into two ReadTheDocs projects - may be easier technically, but not ideal for user experience and diverge our visits into two places.

Both of these solutions are far from ideal, is there any popular libraries doing similar things?

@antonymilne
Copy link
Contributor

Maybe we should take the opportunity to re-evaluate whether there's a better option than sphinx for us. Last time I checked it was tricky to find an alternative that did both markdown docs + API docs, but maybe worth investigating again. The current sphinx process is quite painful already, and I'm a bit worried this will make it even worse.

kedro datasets API documentation is definitely important though. I'd guess it's the most viewed part of our API docs by some margin.

@antonymilne
Copy link
Contributor

antonymilne commented Jun 29, 2022

My initial thought would be to keep docs in the kedro repo. make build-docs would then:

  • either build docs in kedro-datasets and copy over the built version into kedro built docs
  • or copy over the datasets from kedro-datasets and then build them on kedro

But I think both options will be a bit fiddly to get working (e.g. so sphinx links work correctly) 😬

Split the doc into two ReadTheDocs projects - may be easier technically, but not ideal for user experience and diverge our visits into two places.

Agree this would be technically easier but also not such a smooth user experience. I guess it wouldn't be so bad if we link to those docs from the kedro ones very clearly though. it would be easier to resolve the stable vs. latest question this way also.

@deepyaman any ideas here? You like fiddling round with docs 😀

@antonymilne antonymilne added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 13, 2022
@antonymilne
Copy link
Contributor

antonymilne commented Jul 20, 2022

Note this is old. See #1651 (comment) for the latest scheme.


This was discussed in technical design on 13 July and the plan is as follows.

kedro-datasets and kedro docs should remain together on RTD as they do now. The only difference is that the API docs for kedro-datsets will appear under kedro.datasets rather than kedro.extras.datasets. All the document-building takes place on the kedro repo.

To achieve this we need the following flow when docs are built on kedro:

  1. Copy the kedro-datasets files into a datasets folder in kedro (maybe clone just the relevant branch of the repo?)
  2. make build-docs on the kedro repo should then work as it does now to generate API docs for both kedro core and kedro datasets
  3. the kedro-datasets files should not remain on the kedro repo - they're just temporarily copied across for docs generation

The tricky part here is ensuring that the different latest, stable, etc. versions show the right thing on RTD.
image

Ideally this is what we should have:

  • latest build corresponds to kedro branch main and kedro-datasets branch main
  • stable build corresponds to kedro latest release and kedro-datasets latest release
  • version tags like 0.18.1 just point to a fixed stable build. e.g. the version 0.18.1 would point to kedro version 0.18.1 and whatever the latest kedro-datasets release was when 0.18.1 was released. Version tags always give a kedro version; there are no tags for kedro-datasets version.

Note this is still a bit confusing because we could release kedro-datasets 2.0 while kedro points to kedro-datasets~=1.0, and the docs will then show the latest kedro-datasets 2.0 documentation even though that's not what pip install kedro[...] would be installing. I don't see any way around this though which doesn't cause other points of confusion.

To achieve this:

  • when kedro has a commit to main, it re-builds the kedro docs using kedro branch main and kedro-datasets branch main. This updates RTD latest build
  • when kedro has a release, it re-builds the kedro docs using kedro branch main (which is the same as the latest release) and kedro-datasets latest release (not main). This updates RTD stable and ${KEDRO_VERSION} build
  • when kedro-datasets has a commit to main, it re-builds the kedro docs using kedro branch main and kedro-datasets branch main. This updates RTD latest build
  • when kedro-datasets has a release, it re-builds the kedro docs using kedro branch latest release (not main) and kedro-datasets branch main (which is the same as the latest release). This updates RTD stable and ${KEDRO_VERSION} build

For the bottom two points where kedro-datasets needs to trigger a kedro CI job, we have something similar setup to trigger test jobs on kedro-viz whenever there's a new commit to main in kedro, so check how that works to see how to do this.

It would be nice to have a note of which version of docs for kedro-datasets is being used in the kedro docs. Probably the easiest way to do this is through an appropriate docstring in kedro.datasets. This gets rendered in the API docs as a description:
image

Possible catch/source of confusion: are we going to have both kedro.extras.datasets and kedro-datasets documentation in the kedro docs at the same time? Or should we just not render the kedro.extras.datasets docs as soon as we include the kedro-datasets ones? Note we will probably want to add some redirects to fix broken links anyway.

@antonymilne antonymilne removed the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 20, 2022
@deepyaman
Copy link
Member

The tricky part here is ensuring that the different latest, stable, etc. versions show the right thing on RTD. image

Ideally this is what we should have:

  • latest build corresponds to kedro branch main and kedro-datasets branch main
  • stable build corresponds to kedro latest release and kedro-datasets latest release
  • version tags like 0.18.1 just point to a fixed stable build. e.g. the version 0.18.1 would point to kedro version 0.18.1 and whatever the latest kedro-datasets release was when 0.18.1 was released. Version tags always give a kedro version; there are no tags for kedro-datasets version.

Note this is still a bit confusing because we could release kedro-datasets 2.0 while kedro points to kedro-datasets~=1.0, and the docs will then show the latest kedro-datasets 2.0 documentation even though that's not what pip install kedro[...] would be installing. I don't see any way around this though which doesn't cause other points of confusion.

How about:

  • latest build corresponds to kedro branch main and latest kedro-datasets compatible with latest Kedro--branch main of kedro-datasets, as long as the version on main is not bumped such that it's no longer compatible with the bounds of latest Kedro, in which case it would fall back
  • stable build corresponds to kedro latest release and kedro-datasets latest release compatible with kedro latest release

More complicated, but more accurate, if you assume the entrypoint for users is through Kedro, not Kedro-Datasets.

@antonymilne
Copy link
Contributor

antonymilne commented Jul 21, 2022

Yeah, agree that is the even more accurate 👍 I just thought it might be more complicated than it was worth (given my "ideal" solution above is already kind of complicated). e.g. I think you would have to do something like this for latest build:

  • parse extras_require in kedro branch main to get the version specifier of kedro-datasets (or we could just write this somewhere less buried I guess?)
  • get kedro-datasets branch main current version, check if it's compatible with that version specifier (needs semver library or something like that to do the comparison?)
  • if it is then use that branch; otherwise work out what is the latest version of kedro-datasets that satisfies the version specifier (needs a call to PyPI?)
  • use the branch of kedro-datasets that corresponds to that

... which all sounds possible. I just don't know if it's worth the extra effort (or maybe there's a much easier way of doing it that I missed here).

Ah, also I just remembered what I thought one of the points of confusion here would be: the documentation for main branch of kedro-datasets doesn't necessarily get rendered anywhere. e.g. we're on kedro-datasets 2.0 and kedro still has kedro-datasets~=1.0. We update the docstrings of kedro-datasets, people can use the library but they can't see the documentation anywhere without looking at the source code. This is ok because maybe we just say people who update to the latest kedro-datasets are advanced users who don't need the built documentation, but it's a bit annoying. I don't know whether it's overall more or less annoying than my "ideal" solution.

@antonymilne
Copy link
Contributor

Important note. Using a namespace package should make stage 1 (copy the kedro-datasets files into a datasets folder in kedro) much easier: you just pip install -t into the right directory, and by the wonders of namespace packages it should just work. Probably it won't work in Sphinx without the __init__.py files though...

@antonymilne
Copy link
Contributor

Updated scheme following #1758 (comment).

kedro-datasets and kedro docs should remain together on RTD as they do now. The only difference is that the API docs for kedro-datsets will appear under kedro.datasets rather than kedro.extras.datasets. All the document-building takes place on the kedro repo.

To achieve this we need the following flow when docs are built on kedro:

  1. Copy the kedro-datasets files into a datasets folder in kedro (maybe clone just the relevant branch of the repo? Or pip install --no-deps -t into the right directory is maybe easier?)
  2. make build-docs on the kedro repo should then work as it does now to generate API docs for both kedro core and kedro datasets
  3. the kedro-datasets files should not remain on the kedro repo - they're just temporarily copied across for docs generation

The tricky part here is ensuring that the different latest, stable, etc. versions show the right thing on RTD.
image

Ideally this is what we should have:

  • latest build corresponds to kedro branch main and kedro-datasets branch main
  • stable build corresponds to kedro latest release and kedro-datasets latest release
  • version tags like 0.18.1 just point to a fixed stable build. e.g. the version 0.18.1 would point to kedro version 0.18.1 and whatever the latest kedro-datasets release was when 0.18.1 was released. Version tags always give a kedro version; there are no tags for kedro-datasets version.

To achieve this:

  • when kedro has a commit to main, it re-builds the kedro docs using kedro branch main and kedro-datasets branch main. This updates RTD latest build
  • when kedro has a release, it re-builds the kedro docs using kedro branch main (which is the same as the latest release) and kedro-datasets latest release (not main). This updates RTD stable and ${KEDRO_VERSION} build
  • when kedro-datasets has a commit to main, it re-builds the kedro docs using kedro branch main and kedro-datasets branch main. This updates RTD latest build
  • when kedro-datasets has a release, it re-builds the kedro docs using kedro branch latest release (not main) and kedro-datasets branch main (which is the same as the latest release). This updates RTD stable and ${KEDRO_VERSION} build

For the bottom two points where kedro-datasets needs to trigger a kedro CI job, we have something similar setup to trigger test jobs on kedro-viz whenever there's a new commit to main in kedro, so check how that works to see how to do this.

It would be nice to have a note of which version of docs for kedro-datasets is being used in the kedro docs. Probably the easiest way to do this is through an appropriate docstring in kedro.datasets. This gets rendered in the API docs as a description:
image

Possible catch/source of confusion: are we going to have both kedro.extras.datasets and kedro-datasets documentation in the kedro docs at the same time? Or should we just not render the kedro.extras.datasets docs as soon as we include the kedro-datasets ones? Note we will probably want to add some redirects to fix broken links anyway.

@merelcht merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation and removed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation labels Sep 12, 2022
@merelcht
Copy link
Member

The above solution was discussed in Tech Design on 14/9 and everyone agreed that it sounds like a good plan. We should:

  • Ensure that as much of the setup as possible is automated
  • Do an experiment to see if we can "have a note of which version of docs for kedro-datasets is being used in the kedro docs." Above this is described as a nice to have, but in the discussion we agreed this is quite important and we should make an effort to make it work.

@merelcht merelcht moved this to To Do in Kedro Framework Oct 24, 2022
@ankatiyar
Copy link
Contributor

Related to #1497 - Adding this here so we don't forget to update it once we decide on how to make docs work for kedro-datasets.

There are two special considerations when contributing a dataset:
   1. Add the dataset to `kedro.extras.datasets.rst` so it shows up in the API documentation.
   2. Add the dataset to `static/jsonschema/kedro-catalog-X.json` for IDE validation.

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Nov 7, 2022

Notes on how to copy kedro-datasets to kedro:

To copy kedro-datasets main use this command:

pip install --no-deps -t TARGET_DIRECTORY git+https://github.com/kedro-org/kedro-plugins.git#subdirectory=kedro-datasets

To copy kedro-datasets latest release use this command:

pip install --no-deps -t TARGET_DIRECTORY kedro-datasets

Image

@noklam
Copy link
Contributor Author

noklam commented Nov 7, 2022

@SajidAlamQB For the kedro-datasets release - I imagine it would be the same as Workflow 2?

@SajidAlamQB
Copy link
Contributor

@noklam I thought so too but I was also thinking we would use the kedro's release workflow for workflow 2 which wouldn't make sense to do when kedro-datasets has a new release.

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Nov 21, 2022

The current problem we are facing is that ReadTheDocs checks out Kedro main branch and build the docs directly. For this issue, we need to copy kedro-datasets into kedro.datasets I managed to do this on our CI, but that has no effect on RTD. CI has 2 ways of publishing to RTD, sync and publish_kedro jobs, and both just trigger the checkout by RTD from Kedro branch main.
So how are we meant to copy over kedro-datasets so that ReadTheDocs can use it?

3 possible solutions:

Solution 1: Commit way: We can trigger a commit with kedro-datasets through the CI whenever kedro/kedro-datasets has a release or update and remove it after with a following commit. We could do this on a separate branch or on main. If we decide to have it on a separate branch we would need to update the ReadTheDocs admin panel to use that branch instead.

Solution 2: .readthedocs.yml hooks: The YAML file for RTD provides hooks to inject some code at certain sequences during the RTD builds. But since it's only YAML no way to add conditional statements for complex logic to differentiate when it's a kedro/kedro-datasets release or commit.

Solution 3: sphinx conf.py: we can add some logic in the conf.py file to do this; the only issue again is how we would know if it's a kedro/kedro-datasets commit or release.

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Nov 21, 2022
@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Nov 23, 2022

The above 3 solutions were discussed in the technical design session, and we have decided to:

  • Implement solution 2, use .readthedocs.yml hooks to copy over ONLY the latest release of kedro-datasets to kedro.datasets just before the document generation step on RTD. This means that our latest and stable tags will be the same for the kedro-datasets API docs. We agreed on this being a temporary solution as we will be moving away from RTD and Sphinx. We'll mitigate the issue of the kedro-datasets docs falling behind by having frequent releases.

  • We will copy just latest release of kedro-datasets on CI, and test that the docs still pass.

@NeroOkwa NeroOkwa moved this from In Progress to In Review in Kedro Framework Nov 24, 2022
@NeroOkwa NeroOkwa moved this from In Review to Done in Kedro Framework Nov 30, 2022
@merelcht merelcht linked a pull request Nov 30, 2022 that will close this issue
5 tasks
@merelcht
Copy link
Member

Completed in #2006

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants