Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review current implementation #9

Closed
gentzkow opened this issue Dec 5, 2018 · 25 comments
Closed

Review current implementation #9

gentzkow opened this issue Dec 5, 2018 · 25 comments
Assignees

Comments

@gentzkow
Copy link
Owner

gentzkow commented Dec 5, 2018

Review our current implementation of the template + gslab_make, and discuss next steps.

The goal is to figure out how to make the template as simple, intuitive, and robust / bug proof as possible.

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 5, 2018

@z-y-huang @jcconway @danielagonzalez

Here are some small to dos I would propose. With these as with everything else on this thread you should holler if you disagree with any of these changes.

  • config

    • Rename /config/ -> /setup/ so we distinguish the files that define repository configuration (e.g., config_local.yaml) from tools you only use the first time you set up the repository
    • Move config.yaml to the root of the repo and rename to config_global.yaml
    • Rename config_r.r -> setup_r.r
    • Rename config_stata.do -> setup_stata.do
    • Rename configuration.py -> setup_check.py (and enforce that this script is only doing checks that the configuration is correct, not doing any installation etc.)
  • inputs.txt, externals.txt

    • Should allow pathnames to be placed in quotes (right now this creates an error)
    • Can we allow the delimiter to be general whitespace rather than a tab and then require quotes around pathnames if they include a space?
    • The links created by externals.txt should be in a directory called /external/
    • I would like us to not create the /input/ and /external/ directories in cases where these are not used. I think a nice way to do this would be to modify make.py so (i) the inputs to the clear_dir command are just 'output' and 'log'; (ii) Add a new command right after clear_dir to remove the external and input directories; (iii) the create_links command is responsible for creating the /input/ and /external/ directories if they don't exist (and it does so only if inputs.txt and externals.txt respectively define at least one link)
  • logs

    • The columns of link_map.log should be reversed
    • In the symlink colum in link_map.log we should not give the full path on the local machine but list everything relative to the module root (e.g., /input/data.txt)
  • raw

    • I think I've changed my mind and I'd rather have /raw/ be an optional subdirectory that can go inside any module rather than a directory at the top level of the repo. In the template, we can place a /raw/ directory inside /data/ (but omit it in the other modules). This is the place to put raw data as well as other input files that are not created by code (e.g., a .jpg image that we're going to use in slides).

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 6, 2018

Here are some more major to dos to consider.

  1. I am wondering whether we might save a lot of headaches by storing the gslab_make library locally in each repository rather than having it installed on a user's machine using pip. We could have a top level repo directory called /lib/ which contains /gslab_make/. We would then change the import statement in make.py to import gslab_make from the local path.

    If we did this, we'd need to enforce a clear protocol for how this is updated. We'd want to keep a README.md in /lib/gslab_make/ giving the location it was downloaded from and the version. We'd need to enforce a rule against updating gslab_make; changes should be made to the repository version and then re-exported. We'd need to make sure the README.md is updated when someone downloads a new version.

  2. We could also think about doing something similar for the Stata and R dependencies. So we'd store these in /lib/stata/ and lib/r/ and then have a line at the top of do files that adds the local location to the do file path and its analogue in R.

  3. I would like to change the format of externals.txt. The config_user.yaml defines the name of the symlink and the path for all externals in the repository. Then externals.txt just lists the symlink name. We look up the path in config_user.yaml and make the link.

  4. We want to allow modules to be nested, so someone could have say /analysis/regressions/ and /analysis/figures/ as separate modules. Moving things from one level to another would cause a lot of problems right now, however, because it would mess up all the paths in inputs.txt. (E.g., I'd need to go in and change ../data to ../../data). We should design so things are robust to this. I think a good way would be to make the relative location of the repository root one of the paths we define at the top of make.py (e.g., in the current version this would just be equal to '../'). Then we can define a special character to use in inputs.txt to indicate the root, so this could be e.g.

# Symlink	Target
data.txt	$ROOT$/data/output/data.txt
  1. We should do a major push to make error handling as friendly as possible. E.g., right now if there is an error in inputs.txt (like there's a typo in one of the paths) I get a nasty error. This should be a friendly message telling me the target does not exist and suggesting that I check the spelling etc. We should brainstorm what we think will be the most common errors (calling a script that doesn't exist, not having started the logs properly, etc.) and trap them in a friendly way.

  2. We will want to add functionality to inputs.txt that checks that each file target, and every file in each directory target and its subdirectories, is tracked by Github and shows as unmodified in git status. If files are not tracked or modified we should produce a friendly warning (not an error). We may want to set a max number of files to check so we don't get stuck iterating through millions of files, and just produce a separate warning saying there were too many files to check

  3. Once all the substantive stuff is done, I'd like to create a fancier example to illustrate the functionality (so the data looks like real data, the plots look like nice plots, etc.). One idea would be to create the "potato chips and TV" example from our old Code and Data manual.

  4. Once we are done with that, we should do an intensive round of testing. We can ask all the lab members to install the template, try out various use cases, and generally try to break things. We should test on mac and windows and linux, and make sure we're testing on some "clean" machines so we can be sure there are no dependencies that we're using but not declaring.

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 6, 2018

Note: in (4) above, I realize we can actually omit the $ROOT$ in inputs.txt. We want to require that inputs.txt can only create links to files in the same repository so we can define all paths relative to the root. So we'd just have:

# Symlink	Target
data.txt           data/output/data.txt

In parsing this we should make sure we're being friendly about whether you include leading or trailing /'s. I.e., all of the following should work fine

# Symlink	Target
data.txt           data/output/data.txt
data.txt           /data/output/data.txt
datadir           data/output
datadir           data/output/

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 6, 2018

Note: (1) above might be a nice place to use git submodules. We'd just want to make sure doing so doesn't add too much complexity or create scope for extra bugs.

@z-y-huang
Copy link
Contributor

z-y-huang commented Dec 6, 2018

Thoughts about the changes:

Minor to dos

  • config
    • Rename /config/ -> /setup/ so we distinguish the files that define repository configuration (e.g., config_local.yaml) from tools you only use the first time you set up the repository
    • Move config.yaml to the root of the repo and rename to config_global.yaml
    • Rename config_r.r -> setup_r.r
    • Rename config_stata.do -> setup_stata.do
    • Rename configuration.py -> setup_check.py (and enforce that this script is only doing checks that the configuration is correct, not doing any installation etc.)

This is probably pedantic, but I would prefer using the prefix "install" instead of "setup" just because setup.py has a very specific functionality in Python.

  • inputs.txt, externals.txt
    • Can we allow the delimiter to be general whitespace rather than a tab and then require quotes around pathnames if they include a space?

It would be cleaner code-wise and less prone to errors if we stick to a non-space delimiter. We could use another delimiter (say '|') if we dislike tabs, which would still allow paths to contain quotations and spaces if desired.

Major to dos

(1): Agree that this would probably be more user-friendly. It looks like we can specify submodules to point to a specific commit, which removes the burden of documentation. Would require more git mastery, but that seems like a small cost.

(2): I prefer having users install software libraries on their machines just because these quickly add up and some libraries can be very large. Extreme example, but just Tensorflow alone is ~500 MB. This can also be burdensome for R, where the typical user is using at least a dozen packages per script.

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 6, 2018

Great.

I'm fine with setup->install for the directory, but we don't want to name scripts e.g. install_stata since that sounds like we're installing the whole software package. We could also keep config for the directory but rename config_user and config_local to something else.

Let's keep tab delimiters then. It makes the files a lot easier to read than with '|'.

Let's do (1) but skip (2) then.

@z-y-huang Does this mean you're OK with all the items on my list? Any other to dos you want to add based on your review?

@jcconway, @danielagonzalez Let us know your reactions.

@z-y-huang
Copy link
Contributor

z-y-huang commented Dec 6, 2018

@gentzkow,

Yep! I'm still thinking about (6) more, but everything else makes sense.

To-dos I'd like to add (though these are lower priority)

  • Write README.md and add user guides where appropriate
  • Refactor tablefill and textfill (current code is messier than may be preferred)

In terms of timeline, I know our goal was to have something shippable by winter break and then continue working on long-term to-dos afterwards.

What are we currently imagining the version in a couple of week to look like? Do we want to focus on implementing as many of the to-dos as possible, or to have a completely stand alone product by then (so something stable with complete documentation).

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 6, 2018

Great. I agree about your to dos.

I think we can be at the mark we want for winter break if we:

  1. Implement all or most of the small to dos
  2. Create a skeleton readme that mostly just contains an updated version of the installation instructions here

We could then release this as v0.1.0.

We may want to solicit feedback from a wider group of people in parallel with executing some of the longer-term to dos.

z-y-huang pushed a commit that referenced this issue Dec 6, 2018
@jcconway
Copy link

jcconway commented Dec 6, 2018

@gentzkow and @z-y-huang, I agree with the discussion on the minor to-dos above, with @z-y-huang's additional to-dos, and generally with the major to-dos. I'd also add to our to-do list that we take another pass through the unit tests, ideally before asking a wider group to test. Two thoughts on a couple of these major to-dos:

  1. I like the idea of storing the gslab_make library locally, and don't yet have strong views on the best method for doing so. The following quote from the page on git submodules gave me pause, but submodules might still be the way to go:

It’s quite likely that if you’re using submodules, you’re doing so because you really want to work on the code in the submodule at the same time as you’re working on the code in the main project (or across several submodules). Otherwise you would probably instead be using a simpler dependency management system (such as Maven or Rubygems).

  1. To clarify, will our /external/ directory always be flat rather than potentially grouping our symlinks into subfolders? I don't have a problem with this, just wanted to clarify as we remove the path from each symlink and specify a single path in config_user.yaml.

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 6, 2018

Thanks @jcconway.

The submodule question is a good one. I think we should start with just hard-committing the gslab_make library, see how that goes, and then consider switching to the submodule approach if we see a benefit. I share the concern that that could add an unneeded layer of complexity / another failure point.

On /external/, I think the usual case will be that if there are many external files the target will be an entire directory so that will work fine. If we encounter a lot of cases where we wish we could create subdirectories we can consider adding that functionality.

@jcconway
Copy link

jcconway commented Dec 6, 2018

Great, I agree with both suggestions above. Flagging this gslab_make/submodule point for @z-y-huang as well.

@z-y-huang z-y-huang reopened this Dec 7, 2018
@z-y-huang
Copy link
Contributor

z-y-huang commented Dec 7, 2018

@gentzkow & @jcconway:

Played around with git submodule a bit. Basically the way it works is that git keeps track of three pieces of information for every submodule:

(1) Where to install the submodule
(2) The repository of the submodule (branches and tags can also be specified)
(3) The (submodule) commit that the submodule is pointing to

So given a clean clone of the parent repository:
git submodule update --init --remote would install the submodule to the latest commit of (2).
git submodule update --init would install the submodule to the commit in (3).

To update (3), you go the submodule directory, check out the (submodule) commit you would like the submodule to point to, then go back to the parent directory and stage the change.

Where git submodule gets much more complicated is if you want to make changes to the submodule code and push upstream. Submodules are checked out in detached mode by default--if you make changes to the submodule and push, but you're still in detached mode, your commit stays local. This means that no one else has access to your changes, and there's a possibility to lose the changes forever.

So if we're just using submodule so that users can easily import a version-specific gslab_make, I think it's relatively straightforward.

@jcconway
Copy link

jcconway commented Dec 7, 2018

@z-y-huang thanks for looking into git submodule. I take your point that this does not seem too complicated when one doesn't change the submodule, and one benefit of this over hard committing is that it would be somewhat easier to change the contained version of gslab_make. My takeaway from the quote above was that the main benefit of using git submodule as opposed to a simpler dependency system is in fact the ability to make changes and push upstream, which I don't think we care about or want. I don't yet have a very informed view of what would be the simplest such dependency system to implement, and it might very well be git submodule.

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 7, 2018

Thanks both.

Other potential advantages of git submodule

  • It forces clear provenance of what repo + version the gslab_make code is coming from
  • It adds a hurdle to people updating the gslab_make code in place which is something we don't want them to do

I'll leave it to you guys to decide which we want to try first.

@z-y-huang
Copy link
Contributor

@jcconway:

Would we actually want users to be able to change something in their local version of gslab_make? That seems like it could introduce problems.

By simpler dependency system, I think your quote is probably referring to using some variation of pip.

@jcconway
Copy link

jcconway commented Dec 7, 2018

@z-y-huang, I agree we don't want people changing their local version of gslab_make. I don't have strong opinions here. It looks like you pushed a commit using git submodule already, and I'm fine with running with that and seeing how it works for people.

z-y-huang pushed a commit that referenced this issue Dec 7, 2018
@z-y-huang
Copy link
Contributor

@gentzkow:

Just to check, when you want external.txt to have its own /external/ directory, do you want it to have its own logging files as well?

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 7, 2018

I thought about this and I lead toward not having separate logging files. So long as the logs are ordered so it's easy to see which are inputs.txt and which are externals.txt I don't see much cost to having them combined. The reality is we'll probably look at these only rarely (when we're trying to run down some unexpected change in results, for example) and the extra cost in those cases of having to sort through a longer log file doesn't seem significant.

z-y-huang pushed a commit that referenced this issue Dec 7, 2018
@danielagonzalez
Copy link

@gentzkow @z-y-huang @jcconway Sorry for the delay in responding, but these to-dos all sound good to me!

@gentzkow
Copy link
Owner Author

gentzkow commented Dec 8, 2018

Great.

@z-y-huang: Can you take charge of converting our to do list from this thread into new issue(s) for the things we want to do by Dec as well as a longer-term to do list on the outline? Then you can go ahead and close this tread.

Note that it's fine to bundle many smaller to dos into one or more issues.

@z-y-huang
Copy link
Contributor

@gentzkow, can do!

@z-y-huang
Copy link
Contributor

z-y-huang commented Dec 10, 2018

@gentzkow, @jcconway, @danielagonzalez:


General status update and summary, the following to-dos I already completed last week:

  • Config

    • Rename config.yaml to config_global.yaml and move to root of repo (with config_local.yaml)
    • Rename config and associated files to use setup prefix
  • Linking

    • Allow paths to contain quotes
    • Links created for externals.txt should be written to external
    • Only create the input and external directories when needed
  • Organization

    • Move raw to be in data
    • Store gslab_make locally

To-dos that we want to finish by winter break that I will make into new issues:

  • Link logging
    • Reverse columns in link_map.log
    • Symlinks in link_map.log should be printed relative to the module root
    • Targets in input.txt should be communicated relative to the repository root
    • Targets in external.txt should be communicated using keys for config_local.yaml

These include some of the to-dos that were previously in the long-term list, but that I think are feasible to finish by the deadline.


Remaining long-term to-dos have been moved to the project outline.

@danielagonzalez: I know you're currently working on making some additions to the unit tests. I was thinking you could continue working on that as well as make changes to the unit tests accordingly as we update gslab_make. Let me know if that works for you!

@jcconway: I think I can probably finish up the remaining to-dos tomorrow (I have my metrics final today unfortunately) and I'm not sure how efficient it is to split them given that they're all related to linking. Maybe you could start on documentation and something on the long-term to-do list? Open to thoughts on how we can best coordinate.

@jcconway
Copy link

@z-y-huang, thanks for the great work and good luck on the metrics final! I currently have some pressing work on other projects and with my own finals week, and so I don't think I'll get to anything on the long-term to-do list this week. I'm thinking that you might be able to write the documentation more efficiently given that you're more familiar with how you implemented (1), but I'll plan to check back in at the end of this week or early next week to see where things stand and where I can be most helpful.

@z-y-huang
Copy link
Contributor

@jcconway, sounds good! Good luck with finals!

@gentzkow
Copy link
Owner Author

@z-y-huang: Is this task ready to close?

z-y-huang added a commit that referenced this issue Dec 21, 2018
…emplate accordingly (#12)

* Begin updating config and analysis make.py for #8

* Update make.py scrips for #8

* Add functionalities to config for #8

* Clean up config for #8

* Change to requirements.txt for #8

* Add gslab_make submodule for #9

* Update gslab_make submodule to track issue8_update_template

* Test git submodule

* Test git submodule

* Test git submodule

* Change config to setup for #9

* Separate input and external folders for #9

* Implement linking changes for #10

* Clean up code

* Add to setup documentation

* Make local import more robust

* #10 Correct files inadvertenly committed directly instead of as LFS pointers

* #10 Bug fixes in setup_repository.py

* Bugfix setup_repository.py for #10

* Removed submodule for #10

* Commit gslab_make, update template for new external.txt formatting

* Bugfix gslab_make commit

* Fix formatting for input.txt and external.txt for #10

* Add *external/ to .gitignore and organize lib directory for #10

* #10 Update .gitignore

* #10 Fix LFS pointer error for gslab_make/tests/input/zip_test_file.zip

* #10 Fix LFS pointer errors for paper_slides PDFs

* #10 Add input directories to .gitignore

* Update path to look for gslab_make for #10

* Reorder error checking for bugfixing

* Escape special characters in path for run_stata for #10

* #10 Remove input directories -- will be ignored by git going forward

* #10 Correct format of external and input in .gitignore

* Fix typo

* Update documentation for #10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants