Skip to content
Fernando Brito edited this page Aug 13, 2015 · 1 revision

Getting the repositories list to build the dataset

TODO (Fernando): explain the workflow and how to run the ruby script, what it outputs, how to use GitMetricsExtractor and so on

Useful commands

How to see which projects started but not ended

cat info.log | grep START | awk -F " " '{print $9}' | sort > projects_started
cat info.log | grep END | awk -F " " '{print $9}' | sed 's/.$//' | sort > projects_ended
diff projects_started projects_ended 

This sed command is necessary to remove the dot that is present at the end of the url. We have to fix the logger to not include this final dot

How to get running time for each project

This is tricky. Normally we would sort our metrics spreadsheet by projects url, get the running_times sorted by projects url and just paste them into a new column on our spreadsheet. However, there are some projects that did not finish, so first we have find them!

  • Use commands above to find which projects did not finish
  • Create a new sheet on our spreadsheet, copy the column of sorted project urls and paste there
  • Run the following command
cat info.log | grep END | grep -o github\.com/.*/.* | sort > projects_and_times

This will give you a list with projects url and running time. We will use this just to make sure we are using the right data in the right place. Paste in the new sheet you created. Create new empty cells on this new column so the projects match with the first column.

This may sound confusing, but just look example sheet (running_times) on input/apps_and_modules_metrics.ods.

  • Run the following command to get just the times
cat info.log | grep END | grep -o github\.com/.*/.* | sort | grep -o '(in seconds): .*' | cut -c -14 --complement > running_times
  • Paste the times in a new column and create the empty cells again so everything matches. Verify if the data is correct!
  • Now that you have a column with just the times and empty cells on projects that did not finish, copy and paste this column on the main sheet. Make sure rows are also sorted by projects url there, so everything matches.

How to get total running time

  • Get start time of first project
cat info.log | grep START | head -n 1
  • Get end time of last project to end
cat info.log | grep END | tail -n 1

Ideas

Ideas for future works:

  • Have the option to ignore test cases files when running the analysis. It seems that ignoring "/test/" and "/spec/" would cover lots of cases.
Clone this wiki locally