This week covers:
- An intro to Git and Github for sharing code
- Command line tools
- Exploratory data analysis with R
Install tools: Ubuntu on Windows, GitHub for Windows, R, and RStudio
- Open http://aka.ms/wslstore and select Ubuntu on Windows
- If this seems like it's hanging, hit enter
- Create a username and password
- Updates all packages with
sudo apt-get update
andsudo apt-get upgrade
- Check that you have git under bash by typing
git --version
in the terminal - Install GitHub for Windows
- Configure git to deal with line endings in a cross-platform friendly way:
git config --global core.autocrlf true
- Download and install R from a CRAN mirror
- Download and install RStudio
- Open RStudio and install the
tidyverse
package, which includesdplyr
,ggplot2
, and more:install.packages('tidyverse', dependencies = T)
- You'll need a plain text editing program
- Atom, Sublime, and Visual Studio Code are all good options
- Files that you create in Ubuntu on Windows get stored in a somewhat hidden location within the Windows filesystem
- To make it easier to find files you work on in Ubuntu, do the following:
- Open a bash shell
- Go to your home directory:
cd ~
- Create a symbolic link to your Documents folder:
ln -s /mnt/c/Users/<your name>/Documents ~/Documents
- Change to this directory:
cd ~/Documents
- Do all of your work, including the following section, from within this folder, which you'll be able to see under "Documents" in the Windows Explorer
- Sign up for a free GitHub account
- Then follow this guide to fork your own copy of the course repository
- Clone a copy of your forked repository, which should be located at
https://github.com/<yourusername>/coursework.git
, to your local machine - Once that's done, create a new file in the
week1/students
directory,<yourfirstname>.txt
(e.g.,jake.txt
) - Use
git add
to add the file to your local repository - Use
git commit
andgit push
to commit and push your changes to your copy of the repository - Then issue a pull request to send the changes back to the original course repository
- Finally, configure a remote repository called
upstream
to point here:
git remote add upstream git@github.com:msr-ds3/coursework
- This will allow you to sync future changes to your fork with:
git fetch upstream
git merge upstream/master
- Note: this is equivalent to
git pull upstream master
- Datacamp's interactive online git course
- A full hour-long introductory video
- More resources from GitHub available here and here
- And here's a handy cheatsheet
- See the intro to the command line notebook we discussed together
- Read through Lifehacker's command line primer
- Do Codecademy's interactive command line tutorial
- See this crash course for more details on commonly used commands
- Check out Software Carpentry's guide to the Unix shell
- Review this wikibook on data analysis on the command line, covering
cut
,grep
,wc
,uniq
,sort
, etc - Learn awk in 20 minutes
- Check out some more advanced tools for Data Science at the Command Line
- See these Introduction to Counting and Data Wrangling in R slides
- Review intro_to_r.ipynb for an introduction to R
- Pull changes from the msr-ds3/coursework repo:
git pull upstream master
- Use the musical pairs script we wrote together yesterday to determine your programming partner each day
- Review intro_command_line.ipynb for an introduction to the command line
- Download one month of the Citibike data:
wget https://s3.amazonaws.com/tripdata/201402-citibike-tripdata.zip
- Decompress it:
unzip 201402-citibike-tripdata.zip
- Rename the resulting file to get rid of ugly spaces:
mv 2014-02*.csv 201402-citibike-tripdata.csv
- See the download_trips.sh file which automates this, and can be run using
bash download_trips.sh
or./download_trips.sh
- Fill in solutions of your own under each comment in citibike.sh
- Commit and push your work to your fork of the coursework repository:
git push origin master
-
Have a look at DataCamp's introduction to R tutorials (or Hadley's Advanced R if you're a pro)
-
Do the free portion of DataCamp's Data Manipulation in R and Introduction to the Tidyverse tutorials
-
Fill in solutions to the counting exercises under each comment in citibike.R
-
Read chapters 2 and 3 of R for Data Science
-
Take a look at The Anatomy of the Long Tail and think about how to generate Figures 1 and 2
-
References:
- Basic types: (numeric, character, logical, factor)
- Vectors, lists, dataframes: a one page reference and more details
- Cyclismo's more extensive tutorial
- Hadley Wickham's style guide
- The dplyr vignette
- Sean Anderson's dplyr and pipes examples (code on github)
- Rstudio's data wrangling cheatsheet
- Make sure to save your work and push it to GitHub. Do this in three steps:
git add
andgit commit
and new files to your local repository. (Omit large data files.)git pull upstream master
to grab changes from this repository, and resolve any merge conflicts, commiting the final results.git push origin master
to push things back up to your GitHub fork of the course repository.
-
Complete yesterday's intro to R assignment and the counting exercises under each comment in citibike.R
-
Review visualization_with_ggplot2.ipynb for an introduction to data visualization with ggplot2
-
Do DataCamp's Data Visualization with ggplot2 (part 1) tutorial
-
Read chapter 1 of R for Data Science
-
Modify and run the
download_trips.sh
script to grab all trip data from 2014 (usedos2unix
to fix carriage return issues if they arise) -
Run the load_trips.R file to generate
trips.RData
-
Write code in plot_trips.R to reproduce and extend the visualizations we made this morning using
trips.RData
-
Additional references
- RStudio's ggplot2 cheatsheet
- Sean Anderson's ggplot2 slides (code) for more examples
- The R Graphics Cookbook
- The official ggplot2 docs
- Videos on Visualizing Data with ggplot2
- The official ggplot2 docs
- Complete yesterdyay's plotting assignment for the Citibike data in plot_trips.R
- Review combine_and_reshape_in_r.ipynb on joins with dplyr and reshaping with tidyr
- Read chapters 9 and 10 of R for Data Science on tidyr and joins
- Do part 1 of Datacamp's Cleaning Data in R tutorial
- Do the following exercises from R for Data Science:
- Exercise 2 on page 151
- Exercise 1 and 3 on page 156
- Additional references:
- The tidyr vignette on tidy data
- The dplyr vignette on two-table verbs for joins
- A visual guide to joins
- Review the slides for reproducible code
- Read Chapter 21 of R for Data Science on Rmarkdown
- Do the following exercises:
- Exercises 1 and 2 on page 426 (try keyboard shortcuts: ctrl-shift-enter to run chunks, and ctrl-shift-k to knit the document)
- Exercise 3 on page 428, using this file
- Exercise 1 on page 434
- Read this Introduction to Make and Make for Data Scientists
- Use the download_movielens.sh script to download the MovieLens data
- Fill in code in the movielens.Rmd file to reproduce the plots from Wednesday's slides
- Sketch out (on paper) how to generate figure 2 from The Anatomy of the Long Tail
- Wrote code to do this in the last section of movielens.Rmd