[bikeshed] Discussion around cognitive load, aesthetics, wasting time #2819
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backstory:
During the weekend, while trying to finish the sprint, I found myself procrastinating --well, not sure if procrastinating or just not being able to focus-- but this time I decided to explore what was going on down there at the rabbit hole.
It all started with #2683. After some discussions with @efiop about the implementation, we agreed to support adding empty directories directly (e.g.
mkdir empty && dvc add empty
) but not bother with empty directories inside another directory (e.g.mkdir -p data/empty && echo foo > data/foo && dvc add data
-- this will ignoredata/empty
whendvc checkout
).Anyways, understanding how
dvc add empty/
worked internally was hard, mainly:Stage
,State
,RemoteBASE
,RemoteLOCAL
,OutputBASE
) in a non-linear fashion.Then, when I tried to pin point why
dvc add empty/
was behaving differently thandvc add s3://bucket/empty/
(same operation but handled by different remotes), it was hard to tell right away. I pluggedpudb
, went through both executions and it turned outget_checksum
didn't return the checksum with the.dir
prefix for S3:I spent significant time to find that it was a bug on my implementation.
A day passed, integrated master into my branch, tried a different implementation and wanted to see if it didn't affect the code base, so I ran the tests.
A wild #2806 appears!
Tests weren't passing because disabling analytics affected the tests.
Instead of doing what a sane human being would do --enabling analytics to continue working on the previous issue--, I tried to fix the issue myself. It shouldn't be that hard, I thought. Spoiler alert:
It was.Problem:
I'm having a hard time trying to understand the execution of the code. Sometimes, reading through it is not enough for me and I need to use an interactive debugger.
Not sure if I'm the only one having troubles with this and is just a matter of Git Gud in general. Maybe my approach to reading/developing is not The Correct One ® (should I use Mac or change my text editor?).
At this point, Arnold Schwarzenegger would yield out loud: "Stop whining".
So I tried to refactor
Analytics
and see if could do better.Refactoring:
First, I needed to understand what the
analytics.py
module does.The first thing I found on the module was a class with several
PARAM_*
variables. This is a quite common pattern around the codebase, used to keep "consistancy" while accessing elements inside a container (i.e. slicing a dictionary --d.get(Stage.PARAM_ALWAYS_CHANGED, False)
).Among with this pattern is writing those containers (dictionaries) into a file and reading from them (i.e. dvcfiles,
Config
,Analytics
), each with their own idiosyncrasies.This is by no means a new pattern.
PyYAML
,JSON
, andTOML
supports the same interface:dump
&load
We have defined the
dump
andload
methods in such classes (Stage
,State
,Config
,Analytics
).We also need validation upon object creation and often conversion, that's why we added
schema
(@Suor took the effort recently to switch tovoluptuous
on #2796).As a summary, you could view this workflow as:
Sometimes, you have to do several conversions to fully structure the data, as with
Stage
, by having to structure also theDependencies
andOutputs
. Another example would beRemote
s needing to gather information fromConfig
.In case of analytics, you have one file where the
user_id
is stored as a JSON, although, there's noUserID
object withload
/dump
.There's a
load
/dump
for Analytics, however,load
alsounlinks
the given path, anddump
writes into a temporary file instead of accepting an argument. (Same interface, different side effects).The whole idea of the class is to
collect
some reports by calling specific modules and classes and thensend
them home through HTTP.I wanted to know the report structure, but I didn't saw anything related to reports except from comments referencing them (in the code it was stored under the
info
variable).PARAM_*
variables gave me a clue about the keys, but not the structure (i.e. how those keys were nested), so I kept reading.Under the class initialization, there was some code creating a global directory:
Not clear why it was needed right away.
Why was
Analytics
creating a directory that came fromConfig
, why is in charge of handling it?Anyways, the info was modified during the
collect
andcollect_cmd
methods.It looked like this:
I wondered whether there's a better way to deal with all this collected data.
Pull request:
I looked at
attrs
library and it was promising, this PR uses it to create the data structure used for theReport
.json_serializer
to formalize thefrom_file
,to_file
(also know asload
&dump
) methods ussingcattrs
.PARAM_<key>
It also distributes the tasks among several classes:
Also, use
pathinfo.Path
when possible when dealing with paths.Beyond PR:
We could do better than this.
I could argue that there's no need for a
daemon
.Here's a proposal #2826
This implementation allows for easy debugging since you can run the process on the foreground.
Aesthetics:
Even that I work on the code base, I don't feel entitled to make changes that don't affect the logic. making aesthetic changes inappropiate (compared to features / bug fixes).
This is in line with having cognitive empathy since it doesn't frame other's abstractions into good or bad.
@efiop wrote the analytics module about a year ago, and he was dealing with a bunch of stuff at the same time (support, bug fixes, reviews, coordinating the efforts, discussions, etc.). Honestly, huge respect! (holds F)
So here are several controversial & opinionated comments that would reduce the code complexity for me.
Comments:
Conclusion:
The intention is not to drag you into this plethora of nonsense, if you feel like it's not worth discussing these topics, I'm totally fine with putting more time into the grind, (practice is the only way to git gud. 🙂).