Skip to content

Better support SCM functionality #772

Closed
@mattseddon

Description

@mattseddon

From #318 the basic user cases are as follows:

Proposal

Notion.

SCM view

  1. View all directories and files that have changed when compared to HEAD or cache (Visibility / Situational awareness)
  2. Checkout / commit and files or directories that have changed (Actions) - exact UX TBD.
  3. Checkout / commit / push or pull the entire repository (Actions)

Statuses that we currently provide in the extension

Status SCM View Decorations Provided ** Sourced from Notes
added Y Y diff + list
deleted Y Y diff + list
modified Y Y diff + list + status
notInCache Y Y diff + list
renamed Y Y diff + list
stageModified Y Y diff + list + status For a detailed explanation of modified vs stageModified see #318 (comment)
untracked Y Y git this is untracked with respect to both git and dvc. We show these files because the user may want to dvc add them.
tracked Y Y list we decorate tracked because they are generally "git ignored" which will give them a "greyed out" decoration

** Where possible we match the git extension's decorations because we are trying to make the extension feel as native as possible. Our SCM integration is designed to show the user the state of the workspace with respect to the most recent commit.

Current approach (parallel CLI Commands)

name command reason
list dvc list . --dvc-only -R --show-json provides a list of all tracked files that we use for both decoration and SCM purposes. In the SCM view all files that we show must be tracked by DVC. We do this because we end up with untracked but modified (duplicates) items in the tree from diff if we do not
diff dvc diff --show-json we map the output of diff directly to the list output to set all added, deleted, renamed, notInCache. We use it in combination with the status output to determine the difference between modified and "stage modified"
status dvc status --show-json only used to determine the difference between modified and "stage modified"

We currently try to run all three of the above commands in parallel. If any of the commands fail then we will retry all three until they have all completed without error. We do this to best mitigate stale information ending up in the extension.

Issues with the current approach

  1. It's still slow (General performance of trees #608)
  2. We are unsure of the what the actual UI / UX should be (Decide on and implement UX for checkout / commit workflow in SCM view #609)
  3. A lot of data is sent between the CLI and extension that is unused (example: in get-started-experiments after first running an experiment the output of diff contains ~80k "added" files, none of these files are tracked by dvc so we filter all of the records out)
  4. We have issues running multiple commands in parallel (After reloading the window experiments part is stuck in the "loading" state #767 (comment)) <- this is particularly important because it means we cannot currently run the extension against get-started-experiments

Options for mitigation

# option pros cons
1 Run commands sequentially locks should no longer be an issue even slower
2 Only rerun failed commands also mitigates lock issue involves more complicated logic, possibility of stale data
3 Make all 3 commands lockless allows us to continue to run all commands in parallel involves work from the CLI team and is only an interim solution, complicates internal of DVC
4 Combine commands into single command that the integration can run limits the amount of data needing to be transferred between the cli and extension, should be faster, cuts out grouped retry logic more effort required, unsure as to benefit to general users
5 Replace CLI calls with event driven architecture eliminates the need to call the CLI, could serve multiple clients requires even more work and is not a short or even medium term solution
6 Make commands "lightweight" (add --dvc-only) would limit the amount of data being passed and could speed things up unsure as to the benefit to general users, still requires effort, could still run into lock issues

My preference would be to start work on 4 as it would actually help us move towards 5.

Metadata

Metadata

Assignees

Labels

A: treesArea: SCM and DVC-tracked treesdiscussionenhancementNew feature or requestproductPR that affects product

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions