Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update linking format and functionalities #10

Closed
4 tasks done
z-y-huang opened this issue Dec 10, 2018 · 35 comments
Closed
4 tasks done

Update linking format and functionalities #10

z-y-huang opened this issue Dec 10, 2018 · 35 comments
Assignees

Comments

@z-y-huang
Copy link
Contributor

z-y-huang commented Dec 10, 2018

Continuation of to-dos discussed in #9.

  • Reverse columns in link_map.log
  • Symlinks in link_map.log should be printed relative to the module root
  • Targets in input.txt should be communicated relative to the repository root
  • Targets in external.txt should be communicated using keys for config_local.yaml
@z-y-huang
Copy link
Contributor Author

@gentzkow:

Progress update: I've implemented all of the to-dos above on this branch.

One note: I wrote the code such that targets in input.txt should be formatted as {root}/blah with {root} defaulting to the repository root, but something that the user can technically set. I decided to stick with the original concept because it seemed more transparent and allowed for simpler code.

I'm going to test the code some more just to make sure I haven't introduced any bugs. After that, I'll start a new issue to work on documentation and user guides.

@gentzkow
Copy link
Owner

@z-y-huang When I do a clean checkout of this repo and try to check out the issue 10 branch I get this error

$ git checkout issue10_update_linking 
Branch 'issue10_update_linking' set up to track remote branch 'issue10_update_linking' from 'origin'.
Switched to a new branch 'issue10_update_linking'
Encountered 5 file(s) that should have been pointers, but weren't:
	paper_slides/output/ondeck.pdf
	paper_slides/output/online_appendix.pdf
	paper_slides/output/paper.pdf
	paper_slides/output/slides.pdf
	paper_slides/output/text.pdf

My understanding is that this indicates files that should be tracked by git lfs were somehow committed directly (see e.g. here). Any idea what happened? This is perhaps a good point at which to make sure our git-lfs safeguards are working properly.

@gentzkow
Copy link
Owner

Note: The commit above re-commits those output files as pointers to correct the issue. So to reproduce it you'll want to roll back to the previous commit.

@gentzkow
Copy link
Owner

I don't seem to be able to install the Python library git listed in requirements.txt.

$ pip install --user git
Collecting git
  Could not find a version that satisfies the requirement git (from versions: )
No matching distribution found for git 

I can install GitPython just fine. On quick googling it wasn't obvious to me what library we're trying to install.

@z-y-huang
Copy link
Contributor Author

z-y-huang commented Dec 15, 2018

@gentzkow:

  1. I'll look into the git-lfs issue more. Not exactly sure at this point what happened. I'll also research more into git-lfs to see how we can potentially improve our safeguards.

  2. Yes, the package should be GitPython. The import command for the module is import git so I accidentally conflated the two for requirements.txt--I'll fix that.

gentzkow added a commit that referenced this issue Dec 15, 2018
@gentzkow
Copy link
Owner

Debugging setup_repository.py

  • Line 15: Looking for config.yaml in /setup/ but should be in repo root (i.e., ../config.yaml)
  • Line 40: We should not require that software w/ executable names listed in config_user.yaml be installed; the list in config.yaml should be what determines software requirements; I suggest just deleting lines 40-41
  • I get an error if I comment out lines 27-28 of config_user.yaml (so that external is empty). This should be valid and I think we should comment out these lines in the template. (Otherwise I get a warning that path_to_big_dataset is not valid.)
  • Let's remember that it's still on our to do list to make error handling in this setup script more friendly

I went ahead and pushed fixes to the first two issues above. Feel free to change these if you think I did it wrong.

@gentzkow
Copy link
Owner

On my clean checkout of the branch (following instructions in /setup/README.md to check out the submodule) git status gives

modified: lib/gslab_make (new commits)

Does this mean that we've updated gslab_make to a newer version than was committed before?
I'm thinking it will be simpler to skip the submodule approach at least for now and just commit gslab_make directly... one fewer setup step and one fewer failure point.

@gentzkow
Copy link
Owner

When I try to add a link to an external directory to the external object in config_user.yaml I get an error from setup_repository.py

File listed in 'config_user.yaml' but cannot be found: ~/Dropbox/temp/mydata

It works fine if I add a file rather than a directory, like ~/Dropbox/temp/mydata/temp.txt

@gentzkow
Copy link
Owner

I'm getting an error when I try to add an external file to /data/externals.txt. I updated config_user.yaml to have

external:
    mydata: "~/Dropbox/temp/mydata/mydata.txt"

And then I update data/externals.txt to have

# Symlink	Target
mydata  mydata

I also tried just

# Symlink	Target
mydata

Both of these produce the error

    raise CritError(messages.crit_error_bad_link % self.line)
gslab_make.private.exceptionclasses.CritError: ERROR! Link `mydata` incorrectly specified (check if tab-delimited

@z-y-huang
Copy link
Contributor Author

z-y-huang commented Dec 15, 2018

@gentzkow:

  1. We aren't requiring that software with executable names listed in config_user.yaml be installed. We are using the executable names listed in config_user.yaml to check that the software requirements in config.yaml have been installed since users may have different executable names.

  2. Yes, that status is because the current version of gslab_make committed on template isn't the newest one (was making some changes to warning messages).

  3. Ah, I was checking to see if the path was a file. I can change that to allow for directories as well (as well as non-inputs).

  4. Can you try:

# Symlink	Target
mydata  {mydata}

@gentzkow
Copy link
Owner

gentzkow commented Dec 15, 2018

(1) OK. That makes sense. But then there is some problem because it raises a warning for Matlab not being installed for me even though config.yaml says matlab is not required.

(4) OK. That fixed the immediate issue. Thanks!

Some additional externals.txt issues:

  • It does not seem happy with me using ~ as an alias for my home directory when I enter paths in config_user.yaml. Things work fine if I specify

    external:
        mydata: "/users/gentzkow/Dropbox/temp/mydata/temp.txt"
    

    but not if I specify

    external:
        mydata: "~/Dropbox/temp/mydata/temp.txt"
    

    In the latter case it seems to append the path to the working directory so I get

    (Personal)/git/template/data/~/Dropbox/temp/mydata/mydata.txt` not found

    We should specify this so it always interprets the paths in externals as absolute paths on the system, and so that they are parsed exactly as they would be if you entered them at the command line.

  • I realized I'm running into the error you flag in the comments where tabs I enter in my text editor (Sublime) are converted to spaces. I'm starting to think we should just switch to | as a delimiter. Agree?

  • Let's switch the format of externals.txt so that we don't let people specify the name of the symlink target separately. So I would just enter one value per line and the only valid entry is the name of a target specified in config_user.yaml. E.g.,

    # External Name
    mydata
    

    If we do it that way we can omit the curly braces. The external name can itself be used as the symlink target so the above will be equivalent to the current

    #  External Name
    mydata  {mydata}
    

@gentzkow
Copy link
Owner

Something in the run_program directives is still not playing nicely with spaces in filenames. When I run /data/make.py I get an error because when it tries to create the Stata log file at /Users/gentzkow/Dropbox (Personal)/git/template/data/create_graph_data.log it actually creates a file called ../Dropbox.

z-y-huang pushed a commit that referenced this issue Dec 15, 2018
@z-y-huang
Copy link
Contributor Author

z-y-huang commented Dec 15, 2018

@gentzkow:

I pushed a commit fixing the bugs in setup_repository.py.

In terms of the path bugs, I'm going to research more into how software engineers usually handle paths for Python. Our current implementation is probably more hacky than ideal.

Compiling changes we want to make:

  • Commit gslab_make locally
  • Change delimiter for linking files to |
  • Change formatting for externals.txt

@gentzkow
Copy link
Owner

Thanks. Looks good!

z-y-huang pushed a commit that referenced this issue Dec 18, 2018
@z-y-huang
Copy link
Contributor Author

@gentzkow:

I implemented the changes listed in this comment and fixed the issue with using ~ as an alias.

For the error involving spaces in paths, it looks like that was actually specific to Stata. Apparently, when you run Stata in batch mode and the path you call has a space (e.g., statamp -e do "example path/script.do"), the Stata doesn't name the log file generated correctly and instead just stops at the space. (So for the previous example, a log file named example.log would be saved.)

I've found a workaround, but thought it would be worth documenting this quirk.

@gentzkow
Copy link
Owner

Great. Thanks @z-y-huang. Just to be clear, you've implemented the workaround for the Stata paths issue? Or not yet?

@gentzkow
Copy link
Owner

Small to dos:

  • /external/ needs to be added to .gitignore
  • Looks like the gslab_make code is at /lib/gslab_make/gslab_make/. We should get rid of that redundant subdirectory. (This relates to the fact that we should get rid of the gslab_make_dev subdirectory in the gslab_make repo. We did this to prevent conflicts with the gslab_python version. I think it's now time to delete this from gslab_python; for backwards compatability, we can migrate it to a separate repo called gslab_make_legacy. Maybe you can add an item for this on the outline.)

@gentzkow
Copy link
Owner

I'm still getting frequent pointer errors. Is it possible something's wrong with your git LFS? The output below is after a clean checkout of the issue 10 branch.

$ git status
On branch issue10_update_linking
Your branch is up-to-date with 'origin/issue10_update_linking'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   paper_slides/output/ondeck.pdf
	modified:   paper_slides/output/online_appendix.pdf
	modified:   paper_slides/output/paper.pdf
	modified:   paper_slides/output/slides.pdf
	modified:   paper_slides/output/text.pdf

no changes added to commit (use "git add" and/or "git commit -a")
________________________________________________________________________________
~/temp/template 
$ git checkout paper_slides/output/ondeck.pdf
Pointer file error: Unable to parse pointer at: "paper_slides/output/ondeck.pdf"
________________________________________________________________________________
~/temp/template 
$ git lfs status
On branch issue10_update_linking
Git LFS objects to be pushed to origin/issue10_update_linking:


Git LFS objects to be committed:


Git LFS objects not staged for commit:

	paper_slides/output/ondeck.pdf (Git: 9295501 -> File: 9295501)
	paper_slides/output/online_appendix.pdf (Git: db0d55f -> File: db0d55f)
	paper_slides/output/paper.pdf (Git: 50de5a2 -> File: 50de5a2)
	paper_slides/output/slides.pdf (Git: eec0128 -> File: eec0128)
	paper_slides/output/text.pdf (Git: afd23f3 -> File: afd23f3)

@z-y-huang
Copy link
Contributor Author

@gentzkow:

  • Yes, I have implemented the Stata workaround.
  • I added the *external/ to .gitignore (though it looks like you also just pushed a commit adding to .gitignore).
  • I kept the structure like that in case we decided to switch back to submodule, but I just simplified the structure.
  • I'll check my git-lfs.
  • I added the gslab_python task to the outline.

@gentzkow
Copy link
Owner

@z-y-huang One other question. I just added /input as well to .gitignore thinking we would not want the /input/ directories to be committed. But I see that in the current version you have committed them. Is this intentional? Are we confident that git will play nice with committing simlinks? We've run into problems in the past where git can end up trying to operate on the link targets instead of the links themselves for some operations.

@gentzkow
Copy link
Owner

Sorry for the continuing hassle on this, but I still seem to be getting the error related to the space in the paths for the stata logs. See below.

~/Dropbox/git/template/data 
$ python make.py
Cleared: "output"
Cleared: "log"
Starting makelog file at: "/Users/gentzkow/Dropbox (Personal)/git/template/data/log/make.log"
Executing: "Rscript --no-save "/Users/gentzkow/Dropbox (Personal)/git/template/data/code/create_table_data.r""
Executing: "stata-mp -e do \"/Users/gentzkow/Dropbox (Personal)/git/template/data/code/create_graph_data.do\""
Traceback (most recent call last):
  File "make.py", line 48, in <module>
    gs.run_stata(PATHS, program = 'code/create_graph_data.do')
  File "/Users/gentzkow/Dropbox (Personal)/git/template/lib/gslab_make/gslab_make/run_program.py", line 64, in run_stata
    direct.move_program_output(program_log, direct.log)
  File "/Users/gentzkow/Dropbox (Personal)/git/template/lib/gslab_make/gslab_make/private/programdirective.py", line 259, in move_program_output
    with open(program_output, 'r') as f:
IOError: [Errno 2] No such file or directory: '/Users/gentzkow/Dropbox (Personal)/git/template/data/Dropbox.log'

@z-y-huang
Copy link
Contributor Author

@gentzkow,

My understanding was that git commits symlinks as links. Though given our current setup make.py deletes the input folders anyways, so it probably makes sense to add *input/ to .gitignore.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

Just to double-check, what version Stata do you have?

When you run "stata-mp -e do \"/Users/gentzkow/Dropbox (Personal)/git/template/data/code/create_graph_data.do\" on the command line, what is the name of the .log file written in your current directory?

@gentzkow
Copy link
Owner

Stata 14.2. We probably don't want to do anything that's incompatible with recent Stata versions.

Running that command at the command line the log file in the cwd is called Dropbox.log.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

Hm, it looks like Stata is naming the log file as I would expect (Dropbox.log).

The error message indicates that the code can't find file /Users/gentzkow/Dropbox (Personal)/git/template/data/Dropbox.log, which looks like where Stata should be writing the log file to me, but for some reason it isn't.

I'll try to see if I can reproduce the error.

@z-y-huang
Copy link
Contributor Author

z-y-huang commented Dec 19, 2018

@gentzkow:

Just to make sure, could you actually run the command:
stata-mp -e do \"/Users/gentzkow/Dropbox (Personal)/git/template/data/code/create_graph_data.do\"? I couldn't get an equivalent to run on my computer unless I escaped the parentheses. My suspicion is that the error is because the Stata command doesn't like it when special characters are in the path passed in.

If that is indeed the case, a potential solution is that for run_stata only, we try to sanitize the path as much as possible (e.g., by escaping all special characters). I was thinking we don't want to do this for other run functions to minimize potential bugs given that this appears to be a Stata idiosyncratic issue.

I also just pushed a commit to reorder the priority in error checking for run_stata to make debugging more transparent.

@gentzkow
Copy link
Owner

Yup. Your hypothesis seems correct and your fix looks like it works. See below. Can we escape the space character as well? I tested that and it seemed to work fine.

If the only thing we're adding to the code is one regex in run_stata that escapes all special characters that doesn't seem too hacky at all.

Note: The only reason this comes up is that Dropbox has this insane rule that once you connect to a business account you must name your directories "Dropbox (BusinessName)" and "Dropbox (Personal)". All kinds of people on forums complaining about it but seems to be no fix at the moment.

________________________________________________________________________________
~/temp/template 
$ stata-mp -e do \"/Users/gentzkow/Dropbox (Personal)/git/template/data/code/create_graph_data.do\"
-bash: syntax error near unexpected token `('
________________________________________________________________________________
~/temp/template 
$ stata-mp -e do "/Users/gentzkow/Dropbox (Personal)/git/template/data/code/create_graph_data.do"
________________________________________________________________________________
~/temp/template 
$ stata-mp -e do "/Users/gentzkow/Dropbox \ (Personal\)/git/template/data/code/create_graph_data.do"
________________________________________________________________________________
~/temp/template 

@z-y-huang
Copy link
Contributor Author

@gentzkow:

Escaping the space works, though Stata still saves the .log file such that the name of the log file only goes up to the space (e.g., Dropbox.log in the case above). I believe this is a bug with all of the recent versions of Stata, though I haven't been able to find official documentation for it yet. I'm currently working around it by looking for log files in accordance to that rule.

I also pushed a commit to escape special characters in the path for run_stata.

@gentzkow
Copy link
Owner

On the symlinks question here: I agree that we should add this to .gitignore.

@gentzkow
Copy link
Owner

I went ahead and updated .gitignore and deleted the previously committed /input/ directories from the Repo.

Tested the new Stata code and paths now work great!

As far as I can see, we have now finished all the outstanding to dos. I would propose that you open a separate task to clean up the README.md. I think we should migrate what is currently in /setup/README.md/ to the top level README.md. It needs to be updated to correspond to the current setup (e.g., deleting the submodule stuff).

Once that's done, we can go ahead and release beta v0.1!

@gentzkow
Copy link
Owner

Oooh, sorry, one more issue I just noticed. When I add some things to /data/externals.txt and run /data/make.py the /external/ directory w/ links is created correctly, but it doesn't look like these files are getting logged in the link_*.log files. I had thought we'd said we'd include in those files info on both the /input/ and /external/ files.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

Hm, that should be the case and isn't an issue with my version. Do you mind committing your current setup to a test branch?

@gentzkow
Copy link
Owner

Sorry! Looks like it was my mistake. I think the last time I'd re-run make.py I had reset the externals.txt file. When I try it now it works fine.

@z-y-huang
Copy link
Contributor Author

@gentzkow:

Yay! I opened a task to update documentation and release.

I'll close this task and merge in the branch to master.

@z-y-huang
Copy link
Contributor Author

Summary:

  • Updated link formatting and functionalities
  • Fixed path issues (in particular for Stata)
  • Switched from submodule to hard commit for gslab_make

z-y-huang added a commit that referenced this issue Dec 21, 2018
…emplate accordingly (#12)

* Begin updating config and analysis make.py for #8

* Update make.py scrips for #8

* Add functionalities to config for #8

* Clean up config for #8

* Change to requirements.txt for #8

* Add gslab_make submodule for #9

* Update gslab_make submodule to track issue8_update_template

* Test git submodule

* Test git submodule

* Test git submodule

* Change config to setup for #9

* Separate input and external folders for #9

* Implement linking changes for #10

* Clean up code

* Add to setup documentation

* Make local import more robust

* #10 Correct files inadvertenly committed directly instead of as LFS pointers

* #10 Bug fixes in setup_repository.py

* Bugfix setup_repository.py for #10

* Removed submodule for #10

* Commit gslab_make, update template for new external.txt formatting

* Bugfix gslab_make commit

* Fix formatting for input.txt and external.txt for #10

* Add *external/ to .gitignore and organize lib directory for #10

* #10 Update .gitignore

* #10 Fix LFS pointer error for gslab_make/tests/input/zip_test_file.zip

* #10 Fix LFS pointer errors for paper_slides PDFs

* #10 Add input directories to .gitignore

* Update path to look for gslab_make for #10

* Reorder error checking for bugfixing

* Escape special characters in path for run_stata for #10

* #10 Remove input directories -- will be ignored by git going forward

* #10 Correct format of external and input in .gitignore

* Fix typo

* Update documentation for #10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants