[bikeshed] Discussion around cognitive load, aesthetics, wasting time #2819

ghost · 2019-11-20T08:29:52Z

⚠️ DO NOT MERGE! ⚠️

Backstory:

During the weekend, while trying to finish the sprint, I found myself procrastinating --well, not sure if procrastinating or just not being able to focus-- but this time I decided to explore what was going on down there at the rabbit hole.

It all started with #2683. After some discussions with @efiop about the implementation, we agreed to support adding empty directories directly (e.g. mkdir empty && dvc add empty) but not bother with empty directories inside another directory (e.g. mkdir -p data/empty && echo foo > data/foo && dvc add data -- this will ignore data/empty when dvc checkout).

Anyways, understanding how dvc add empty/ worked internally was hard, mainly:

Following the trace requires the reader to navigate through several classes (i.e. Stage, State, RemoteBASE, RemoteLOCAL, OutputBASE) in a non-linear fashion.

Then, when I tried to pin point why dvc add empty/ was behaving differently than dvc add s3://bucket/empty/ (same operation but handled by different remotes), it was hard to tell right away. I plugged pudb, went through both executions and it turned out get_checksum didn't return the checksum with the .dir prefix for S3:

    def changed_cache_file(self, checksum):
        """Compare the given checksum with the (corresponding) actual one.

        - Use `State` as a cache for computed checksums
            + The entries are invalidated by taking into account the following:
                * mtime
                * inode
                * size
                * checksum

        - Remove the file from cache if it doesn't match the actual checksum
        """
        cache_info = self.checksum_to_path_info(checksum)
        actual = self.get_checksum(cache_info)

I spent significant time to find that it was a bug on my implementation.

A day passed, integrated master into my branch, tried a different implementation and wanted to see if it didn't affect the code base, so I ran the tests.

A wild #2806 appears!

Tests weren't passing because disabling analytics affected the tests.

Instead of doing what a sane human being would do --enabling analytics to continue working on the previous issue--, I tried to fix the issue myself. It shouldn't be that hard, I thought. Spoiler alert: ~~It was~~.

Problem:

I'm having a hard time trying to understand the execution of the code. Sometimes, reading through it is not enough for me and I need to use an interactive debugger.

Not sure if I'm the only one having troubles with this and is just a matter of Git Gud in general. Maybe my approach to reading/developing is not The Correct One ® (should I use Mac or change my text editor?).

At this point, Arnold Schwarzenegger would yield out loud: "Stop whining".

So I tried to refactor Analytics and see if could do better.

Refactoring:

First, I needed to understand what the analytics.py module does.

The first thing I found on the module was a class with several PARAM_* variables. This is a quite common pattern around the codebase, used to keep "consistancy" while accessing elements inside a container (i.e. slicing a dictionary -- d.get(Stage.PARAM_ALWAYS_CHANGED, False)).

Among with this pattern is writing those containers (dictionaries) into a file and reading from them (i.e. dvcfiles, Config, Analytics), each with their own idiosyncrasies.

This is by no means a new pattern. PyYAML, JSON, and TOML supports the same interface: dump & load

We have defined the dump and load methods in such classes (Stage, State, Config, Analytics).

We also need validation upon object creation and often conversion, that's why we added schema (@Suor took the effort recently to switch to voluptuous on #2796).

As a summary, you could view this workflow as:

unstructured -> open(file) -> json.load -> dict -> validation / conversion -> class -> structured

Sometimes, you have to do several conversions to fully structure the data, as with Stage, by having to structure also the Dependencies and Outputs. Another example would be Remotes needing to gather information from Config.

In case of analytics, you have one file where the user_id is stored as a JSON, although, there's no UserID object with load/dump.

There's a load/dump for Analytics, however, load also unlinks the given path, and dump writes into a temporary file instead of accepting an argument. (Same interface, different side effects).

The whole idea of the class is to collect some reports by calling specific modules and classes and then send them home through HTTP.
I wanted to know the report structure, but I didn't saw anything related to reports except from comments referencing them (in the code it was stored under the info variable).

PARAM_* variables gave me a clue about the keys, but not the structure (i.e. how those keys were nested), so I kept reading.

Under the class initialization, there was some code creating a global directory:

cdir = Config.get_global_config_dir()
try:
    os.makedirs(cdir)
except OSError as exc:
    if exc.errno != errno.EEXIST:
        raise

Not clear why it was needed right away.
Why was Analytics creating a directory that came from Config, why is in charge of handling it?

Anyways, the info was modified during the collect and collect_cmd methods.

It looked like this:

report = {
  "dvc_version": None,
  "user_id": None,
  "scm_class": None,
  "is_binary": None,
  "cmd_class": None,
  "cmd_return_code": None,
  "system_info": {
    "os": None,
      "linux_distro": None,
      "linux_distro_like": None,
      "linux_distro_version": None,
      "mac_version": None,
      "windows_version_build": None,
      "windows_version_major": None,
      "windows_version_minor": None,
      "windows_version_service_pack": None,
  },
}

I wondered whether there's a better way to deal with all this collected data.

Pull request:

I looked at attrs library and it was promising, this PR uses it to create the data structure used for the Report.

Slim declaration and initialization.
Typed structures (with Python 3 type hints ❤️)
Decorate them with a json_serializer to formalize the from_file, to_file (also know as load & dump) methods ussing cattrs.
Dot notation over PARAM_<key>

It also distributes the tasks among several classes:

Analytics:
  _collect_darwin(self)
  _collect_linux(self)
  _collect_system_info(self)
  _collect_windows(self)
  _get_user_id(self)
  _read_user_id(self)
  _write_user_id(self)
  collect(self)
  collect_cmd(self, args, ret)
  dump(self)
  is_enabled(cmd=None)
  load(path)
  send(self)
  send_cmd(cmd, args, ret)

Analytics:
   dump(self)
   is_enabled()
   load(path)
   send(self)
   send_cmd(cmd, args, ret)

Report:
   collect(self)
   collect_cmd(self, args, ret)

SystemInfo:
   collect(self)
   darwin(self)
   linux(self)
   windows(self)

UserID:
   generate(cls)
   load(cls)

Also, use pathinfo.Path when possible when dealing with paths.

Beyond PR:

We could do better than this.

I could argue that there's no need for a daemon.

Here's a proposal #2826

This implementation allows for easy debugging since you can run the process on the foreground.

Aesthetics:

Even that I work on the code base, I don't feel entitled to make changes that don't affect the logic. making aesthetic changes inappropiate (compared to features / bug fixes).

This is in line with having cognitive empathy since it doesn't frame other's abstractions into good or bad.

@efiop wrote the analytics module about a year ago, and he was dealing with a bunch of stuff at the same time (support, bug fixes, reviews, coordinating the efforts, discussions, etc.). Honestly, huge respect! (holds F)

So here are several controversial & opinionated comments that would reduce the code complexity for me.

Comments:

Be clear and explicit with the side effects.
Consistent wording between documentation/comments and code (when possible)
Use standard library and common third-party libraries (when possbile)
Choose modules over classes (when possible)
Use path-like objects instead of plain strings

Conclusion:

The intention is not to drag you into this plethora of nonsense, if you feel like it's not worth discussing these topics, I'm totally fine with putting more time into the grind, (practice is the only way to git gud. 🙂).

pared · 2019-11-22T11:17:22Z

@MrOutis
I agree that analytics is not easy to work with. Definitely, converting it into a module would make it easier to debug and would increase the bus factor for this particular topic.

Suor · 2019-11-22T16:11:42Z

All those string constants really make code harder to read, I agree on that. Imagine how glanceble schemes might be without those constants. We can't just drop them because they protect us from typos in dict keys (I wonder whether this is a real problem, since hordes of python, js, Ruby, etc code is just fine using strings).

However, attrs are not a way to go, they are adding one more intermediate representation for the same data. And you lose all the dict properties like key access, iteration, serialization, etc. Also you still have schema description, which will be duplicated by those attrs.

So maybe:

drop string constants
subclass dicts to not allow wrong keys or even check vs schema

ghost · 2019-11-26T03:21:54Z

Closing the discussion)

introduce attrs and start discussion around stuff

724aec1

ghost requested a review from efiop November 20, 2019 08:30

restyled-io bot mentioned this pull request Nov 20, 2019

Restyle [bikeshed] Discussion around cognitive load, aesthetics, wasting time #2820

Closed

ghost requested review from Suor, pared and shcheklein November 20, 2019 08:30

ghost mentioned this pull request Nov 20, 2019

analytics: refactor into a module #2826

Merged

1 task

weekly-digest bot mentioned this pull request Nov 24, 2019

Weekly Digest (17 November, 2019 - 24 November, 2019) #2841

Closed

ghost closed this Nov 26, 2019

ghost deleted the aesthetics-stuff branch November 26, 2019 03:21

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bikeshed] Discussion around cognitive load, aesthetics, wasting time #2819

[bikeshed] Discussion around cognitive load, aesthetics, wasting time #2819

ghost commented Nov 20, 2019 •

edited by ghost

Loading

pared commented Nov 22, 2019

Suor commented Nov 22, 2019

ghost commented Nov 26, 2019

[bikeshed] Discussion around cognitive load, aesthetics, wasting time #2819

[bikeshed] Discussion around cognitive load, aesthetics, wasting time #2819

Conversation

ghost commented Nov 20, 2019 • edited by ghost Loading

pared commented Nov 22, 2019

Suor commented Nov 22, 2019

ghost commented Nov 26, 2019

ghost commented Nov 20, 2019 •

edited by ghost

Loading