-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: datasets, source files, clean up & mention to Jarbas #54
Changes from 5 commits
bea4b36
0d7d53d
305aeef
c00d15d
2c431b8
678cc5f
6322b4d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,17 +2,19 @@ | |
|
||
## The basic | ||
|
||
A lot of discussions about ideas take place in the [Issues](https://github.com/datasciencebr/serenata-de-amor/issues) section. There you can catch up with what's going on and also suggest new ideas. | ||
|
||
1. _Fork_ this repository | ||
2. Create your feature branch: `$ git checkout -b my-new-feature` | ||
3. Commit your changes: `$ git commit -am 'Add some feature'` | ||
4. Push to the branch to your fork: `$ git push origin my-new-feature` | ||
5. Create a new _Pull Request_ | ||
2. Create your branch: `$ git checkout -b xx-new-stuff` where `xx` are your initials (e.g `jd` for John Doe) | ||
3. Commit your changes: `$ git commit -am 'My cool contribution'` | ||
4. Push to the branch to your fork: `$ git push origin xx-new-stuff` | ||
5. Create a new _Pull Requesthttps://github.com/datasciencebr/jarbas/issues_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing spacing/formatting/typo? |
||
|
||
## Environment | ||
|
||
The recommended way of setting your environment up is with [Anaconda](https://www.continuum.io/downloads), a Python distribution with packages useful for Data Science already preinstalled. Download it from the link above and create an "environment" for the project. | ||
The recommended way of setting your environment up is with [Anaconda](https://www.continuum.io/), a Python distribution with useful packages for Data Science. [Download it](https://www.continuum.io/downloads) and create an _environment_ for the project. | ||
|
||
``` | ||
```console | ||
$ conda update conda | ||
$ conda create --name serenata_de_amor python=3 | ||
$ source activate serenata_de_amor | ||
|
@@ -32,15 +34,58 @@ Basically we have four big directories with different purposes: | |
| **`develop/`** | This is where we _explore_ data, feel free to create your own notebook for your exploration. | `[ISO 8601 date]-[author-initials]-[2-4 word description].ipynb` (e.g. `2016-05-13-ec-air-tickets.ipynb`) | | ||
|**`report/`** | This is where we write up the findings and results, here is where we put together different data, analysis and strategies to make a point, feel free to jump in. | Meaninful title for the report (e.g. `Transport-allowances.ipybn` | | ||
| **`src/`** | This is where our auxiliary scripts lies, code to scrap data, to convert stuff etc. | Small caps, no special character, `-` instead of spaces. | | ||
| **`data/`** | This is not suppose to be commit, but it is where saved databases will be stored locally (scripts from `src/` should be able to get this data for you); a copy of this data will be available elsewhere (_just in case_…). | Small caps, no special character, `-` instead of spaces. | | ||
| **`data/`** | This is not suppose to be committed, but it is where saved databases will be stored locally (scripts from `src/` should be able to get this data for you); a copy of this data will be available elsewhere (_just in case_). | Small caps, no special character, `-` instead of spaces. | | ||
|
||
### Source files (`src/`) | ||
|
||
Here we explain what each script from `src/` does for you: | ||
|
||
##### One script to rule them all | ||
|
||
1. `src/fetch_datasets.py` dowloads all the available datasets to `data/` is `.xz` compressed CSV format with headers translated to English. | ||
|
||
|
||
##### Quota for Exercising Parliamentary Activity (CEAP) | ||
|
||
1. `src/fetch_datasets.py --from-source` dowloads all CEAP datasets to `data/` from the oficial source (in XML format in Portuguese) . | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "oficial" -> "official" |
||
1. `src/fetch_datasets.py` dowloads the CEAP datasets into `data/`; it can download them from the oficial source (in XML format in Portuguese) or from our backup server (`.xz` compressed CSV format, with headers translated to English). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "oficial" -> "official" |
||
1. `src/xml2csv.py` converts the original XML datasets to `.xz` compressed CSV format. | ||
1. `src/translate_datasets.py` translates the datasets file names and the labels of the variables within these files. | ||
1. `translation_table.py` creates a `data/YYYY-MM-DD-ceap_datasets.md` file with details of the meaning and of the translation of each variable from the _Quota for Exercising Parliamentary Activity_ datasets. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The correct location is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From this line you refer |
||
|
||
##### Suppliers information (CNPJ) | ||
|
||
1. `fetch_cnpj_info.py` iterate over the CEAP datasets looking for supplier unique documents (CNPJ) and create a local dataset with each supplier info. | ||
1. `clean_cnpj_info_dataset.py` clean up and translate the supplier info dataset. | ||
1. `geocode_addresses.py` iterate over the supplier info dataset and add geolocation data to it (it uses the Google Maps API set in `config.ini`). | ||
|
||
##### Miscellaneous | ||
1. `src/backup_data.py` uploads files from `data/` to a Amazon S3 bucket set on `config.ini` . | ||
|
||
### Datasets (`data/`) | ||
|
||
Here we explain what are the datasets inside `data/`. They are not part of this repository, but downloaded with the scripts from `src/`. Most files are `.xz` compressed CSV. | ||
All files are named with a [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) date suffix. | ||
|
||
1. `current-year`, `last-year` and `previous-years`: Datasets from the _Quota for Exercising Parliamentary Activity_; for details on its variables and meaning, check `translation_table`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would recommend checking |
||
1. `translation_table`: Table comparing contents from `datasets_format` and our translation of varible names and descriptions. | ||
1. `companies`: Dataset with suppliers info containing all the fields offered in the [Federal Revenue alternative API](http://receitaws.com.br) and complemented with geolocation (latitude and longitude) gathered from Google Maps. | ||
1. `datasets_format`: Original HTML in Portuguese from the Chamber of Deputies explaining CEAP dataset varibales. | ||
|
||
## Four moments | ||
|
||
The project basically happens in four moments, and contributions are welcomed in all of them: | ||
|
||
| Moment | Description | Focus | Target | | ||
|--------|-------------|-------|--------| | ||
| **Possibilities** | To Structure hypothesis and strategies taking into account (a) the source of the data, (b) how feasible it is to get this data, and (c) what is the purpose bringing this data into the project.| Contributions here require more sagacity than technical skills.| [GitHub Issues](https://github.com/codelandev/serenata-de-amor/issues) | | ||
| **Getting the data** | Once one agrees that a certain _possibility_ is worth it, one might want to start writing code to get the data; this script goes into the src directory. | Technical skills in scrapping data and using APIs. | `src/` and `data/` | | ||
| **Exploring** | Once data is ready to be used, one might want to start to analyze it. | Here what matters is mostly data science skills. | `develop/` | | ||
| **Reporting** | Once a relevante finding emerge from the previous stages, this finding might be gathered with similar other findings (e.g. put together explorations on air tickets, car rentals and geolocation under a report on transportation) on a report. | Contributions here requires basic understanding of quantitative methods and also communication skills. | `report/` | | ||
| **Possibilities** | To structure hypothesis and strategies taking into account (a) the source of the data, (b) how feasible it is to get this data, and (c) what is the purpose of bringing this data into the project.| Contributions here require more sagacity than technical skills.| [GitHub Issues](https://github.com/codelandev/serenata-de-amor/issues) | | ||
| **Data collection** | Once one agrees that a certain _possibility_ is worth it, one might want to start writing code to get the data (this script's go into `src/`). | Technical skills in scrapping data and using APIs. | `src/` and `data/` | | ||
| **Exploring** | Once data is ready to be used, one might want to start explore and analyze it. | Here what matters is mostly data science skills. | `develop/` | | ||
| **Reporting** | Once a relevante finding emerge from the previous stages, this finding might be gathered with similar other findings (e.g. put together explorations on air tickets, car rentals and geolocation under a report on transportation) on a report. | Contributions here requires good communication skills and very basic understanding of quantitative methods. | `report/` | | ||
|
||
## Jarbas | ||
|
||
As soon as we started _Serenata de Amor_ [we felt the need for a simple webservice](https://github.com/datasciencebr/serenata-de-amor/issues/34) to browser our data and refer to documens we analize. This is how [Jarbas](https://github.com/datasciencebr/jarbas) was created. | ||
|
||
If you fancy web developmemtn, feel free to check Jarba's source code, to check [Jarba's own Issues](https://github.com/datasciencebr/jarbas/issues) and to contribute there too. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When forking, the branch turns unnecessary since it's already owned by the contributor and no other user.