Skip to content

Commit

Permalink
Documentation: datasets, source files, clean up & mention to Jarbas (#54
Browse files Browse the repository at this point in the history
)

* Clean up

* 💅

* Add Jarbas

* Add documentation related to src/

* Add documentation ref. available datasets

* Typos & minor fixes

* Correct filename
  • Loading branch information
cuducos authored and Irio committed Sep 12, 2016
1 parent 07358fb commit 0c0bb1e
Show file tree
Hide file tree
Showing 3 changed files with 55 additions and 49 deletions.
65 changes: 55 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,19 @@

## The basic

A lot of discussions about ideas take place in the [Issues](https://github.com/datasciencebr/serenata-de-amor/issues) section. There you can catch up with what's going on and also suggest new ideas.

1. _Fork_ this repository
2. Create your feature branch: `$ git checkout -b my-new-feature`
3. Commit your changes: `$ git commit -am 'Add some feature'`
4. Push to the branch to your fork: `$ git push origin my-new-feature`
2. Create your branch: `$ git checkout -b new-stuff`
3. Commit your changes: `$ git commit -am 'My cool contribution'`
4. Push to the branch to your fork: `$ git push origin new-stuff`
5. Create a new _Pull Request_

## Environment

The recommended way of setting your environment up is with [Anaconda](https://www.continuum.io/downloads), a Python distribution with packages useful for Data Science already preinstalled. Download it from the link above and create an "environment" for the project.
The recommended way of setting your environment up is with [Anaconda](https://www.continuum.io/), a Python distribution with useful packages for Data Science. [Download it](https://www.continuum.io/downloads) and create an _environment_ for the project.

```
```console
$ conda update conda
$ conda create --name serenata_de_amor python=3
$ source activate serenata_de_amor
Expand All @@ -32,15 +34,58 @@ Basically we have four big directories with different purposes:
| **`develop/`** | This is where we _explore_ data, feel free to create your own notebook for your exploration. | `[ISO 8601 date]-[author-initials]-[2-4 word description].ipynb` (e.g. `2016-05-13-ec-air-tickets.ipynb`) |
|**`report/`** | This is where we write up the findings and results, here is where we put together different data, analysis and strategies to make a point, feel free to jump in. | Meaninful title for the report (e.g. `Transport-allowances.ipybn` |
| **`src/`** | This is where our auxiliary scripts lies, code to scrap data, to convert stuff etc. | Small caps, no special character, `-` instead of spaces. |
| **`data/`** | This is not suppose to be commit, but it is where saved databases will be stored locally (scripts from `src/` should be able to get this data for you); a copy of this data will be available elsewhere (_just in case_…). | Small caps, no special character, `-` instead of spaces. |
| **`data/`** | This is not suppose to be committed, but it is where saved databases will be stored locally (scripts from `src/` should be able to get this data for you); a copy of this data will be available elsewhere (_just in case_). | Small caps, no special character, `-` instead of spaces. |

### Source files (`src/`)

Here we explain what each script from `src/` does for you:

##### One script to rule them all

1. `src/fetch_datasets.py` dowloads all the available datasets to `data/` is `.xz` compressed CSV format with headers translated to English.


##### Quota for Exercising Parliamentary Activity (CEAP)

1. `src/fetch_datasets.py --from-source` dowloads all CEAP datasets to `data/` from the official source (in XML format in Portuguese) .
1. `src/fetch_datasets.py` dowloads the CEAP datasets into `data/`; it can download them from the official source (in XML format in Portuguese) or from our backup server (`.xz` compressed CSV format, with headers translated to English).
1. `src/xml2csv.py` converts the original XML datasets to `.xz` compressed CSV format.
1. `src/translate_datasets.py` translates the datasets file names and the labels of the variables within these files.
1. `src/translation_table.py` creates a `data/YYYY-MM-DD-ceap-datasets.md` file with details of the meaning and of the translation of each variable from the _Quota for Exercising Parliamentary Activity_ datasets.

##### Suppliers information (CNPJ)

1. `src/fetch_cnpj_info.py` iterate over the CEAP datasets looking for supplier unique documents (CNPJ) and create a local dataset with each supplier info.
1. `src/clean_cnpj_info_dataset.py` clean up and translate the supplier info dataset.
1. `src/geocode_addresses.py` iterate over the supplier info dataset and add geolocation data to it (it uses the Google Maps API set in `config.ini`).

##### Miscellaneous
1. `src/backup_data.py` uploads files from `data/` to a Amazon S3 bucket set on `config.ini` .

### Datasets (`data/`)

Here we explain what are the datasets inside `data/`. They are not part of this repository, but downloaded with the scripts from `src/`. Most files are `.xz` compressed CSV.
All files are named with a [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) date suffix.

1. `data/YYYY-MM-DD-current-year.xz`, `data/YYYY-MM-DD-last-year.xz` and `data/YYYY-MM-DD-previous-years.xz`: Datasets from the _Quota for Exercising Parliamentary Activity_; for details on its variables and meaning, check `data/YYYY-MM-DD-ceap-datasets.md`.
1. `data/datasets-format.html`: Original HTML in Portuguese from the Chamber of Deputies explaining CEAP dataset varibales.
1. `data/YYYY-MM-DD-ceap-datasets.md`: Table comparing contents from `data/YYYY-MM-DD-datasets_format.html` and our translation of varible names and descriptions.
1. `data/YYYY-MM-DD-companies.xz`: Dataset with suppliers info containing all the fields offered in the [Federal Revenue alternative API](http://receitaws.com.br) and complemented with geolocation (latitude and longitude) gathered from Google Maps.

## Four moments

The project basically happens in four moments, and contributions are welcomed in all of them:

| Moment | Description | Focus | Target |
|--------|-------------|-------|--------|
| **Possibilities** | To Structure hypothesis and strategies taking into account (a) the source of the data, (b) how feasible it is to get this data, and (c) what is the purpose bringing this data into the project.| Contributions here require more sagacity than technical skills.| [GitHub Issues](https://github.com/codelandev/serenata-de-amor/issues) |
| **Getting the data** | Once one agrees that a certain _possibility_ is worth it, one might want to start writing code to get the data; this script goes into the src directory. | Technical skills in scrapping data and using APIs. | `src/` and `data/` |
| **Exploring** | Once data is ready to be used, one might want to start to analyze it. | Here what matters is mostly data science skills. | `develop/` |
| **Reporting** | Once a relevante finding emerge from the previous stages, this finding might be gathered with similar other findings (e.g. put together explorations on air tickets, car rentals and geolocation under a report on transportation) on a report. | Contributions here requires basic understanding of quantitative methods and also communication skills. | `report/` |
| **Possibilities** | To structure hypothesis and strategies taking into account (a) the source of the data, (b) how feasible it is to get this data, and (c) what is the purpose of bringing this data into the project.| Contributions here require more sagacity than technical skills.| [GitHub Issues](https://github.com/codelandev/serenata-de-amor/issues) |
| **Data collection** | Once one agrees that a certain _possibility_ is worth it, one might want to start writing code to get the data (this script's go into `src/`). | Technical skills in scrapping data and using APIs. | `src/` and `data/` |
| **Exploring** | Once data is ready to be used, one might want to start explore and analyze it. | Here what matters is mostly data science skills. | `develop/` |
| **Reporting** | Once a relevante finding emerge from the previous stages, this finding might be gathered with similar other findings (e.g. put together explorations on air tickets, car rentals and geolocation under a report on transportation) on a report. | Contributions here requires good communication skills and very basic understanding of quantitative methods. | `report/` |

## Jarbas

As soon as we started _Serenata de Amor_ [we felt the need for a simple webservice](https://github.com/datasciencebr/serenata-de-amor/issues/34) to browser our data and refer to documens we analize. This is how [Jarbas](https://github.com/datasciencebr/jarbas) was created.

If you fancy web developmemtn, feel free to check Jarba's source code, to check [Jarba's own Issues](https://github.com/datasciencebr/jarbas/issues) and to contribute there too.

20 changes: 0 additions & 20 deletions README-en.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,5 @@ Unlike the politicians we investigate, we don't get fortunes in a daily basis. B

* [Bitcoin Wallet](bitcoin:1Gg9CVZNYmzMTAjGfMg62w3b6MM7D1UAUV?amount=0.01&message=Supporting%20project%20Serenata%20de%20Amor) `1Gg9CVZNYmzMTAjGfMg62w3b6MM7D1UAUV`


## Public Data Lists and Strategies for Mass Collection

### Data scraping
| Information | Reason | Strategy |
|------------|--------|------------|
| [Quota for Exercising Parliamentary Activity](http://www.camara.gov.br/cota-parlamentar/) (Federal Deputies) | List expenses of federal deputies | Data scraping |
| [Quota for Exercising Parliamentary Activity](http://www25.senado.leg.br/web/transparencia/sen/) (Senators) | List expenses of senators | Data scraping |
| [Lease sale estimation of real estate](ftp://ftp.ibge.gov.br/Contas_Nacionais/Sistema_de_Contas_Nacionais/Notas_Metodologicas_2010/06_aluguel.pdf) | Compare the price on invoices to the value of square meter in the area. | Data scraping from [a study by IBGE](http://seriesestatisticas.ibge.gov.br/series.aspx?vcodigo=PRECO415). |

### API
| Information | Reason | Strategy |
|------------|--------|------------|
| [Google Street View](https://developers.google.com/maps/documentation/streetview/) | Identify whether the address is residential or commercial | API |
| [CNPJ consultation with the Federal Revenue](http://www.receita.fazenda.gov.br/pessoajuridica/cnpj/cnpjreva/cnpjreva_solicitacao.asp) | Gather information on where public money was spent | Captcha. (test [“alternative” API without captcha](http://receitaws.com.br) and assess viability) |
| [Facebook Graph API](https://developers.facebook.com/docs/graph-api) | Identify if two people know eachother | _Pending_: [API shows just friends with the same app installed.](https://developers.facebook.com/docs/graph-api/reference/user/friends/).

## Contribute
Please read and follow our **[contributing guidelines](CONTRIBUTING.md)**.

## Joining the conversation
The conversation about the project (always in English) happens in a [Telegram](https://telegram.org/) group. [Click here](https://telegram.me/joinchat/AKDWcwgjD0QPd6KqEG11tg) to join and get to know everyone involved.
19 changes: 0 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,6 @@ Ao contrário dos políticos que investigamos, não ganhamos fortunas diárias.

* [Bitcoin Wallet](bitcoin:1Gg9CVZNYmzMTAjGfMg62w3b6MM7D1UAUV?amount=0.01&message=Supporting%20project%20Serenata%20de%20Amor) `1Gg9CVZNYmzMTAjGfMg62w3b6MM7D1UAUV`

## Lista de Dados Públicos e Estratégias para Coleta em Massa

### Scraping de dados
| Informação | Motivo | Estratégia |
|------------|--------|------------|
| [Cota para Exercício da Atividade Parlamentar](http://www.camara.gov.br/cota-parlamentar/) (Deputados Federais) | Listar gastos de deputados federais | Scraping de dados |
| [Cota para Exercício da Atividade Parlamentar](http://www25.senado.leg.br/web/transparencia/sen/) (Senadores) | Listar gastos de senadores | Scraping de dados |
| [Estimativa do aluguel de imóveis](ftp://ftp.ibge.gov.br/Contas_Nacionais/Sistema_de_Contas_Nacionais/Notas_Metodologicas_2010/06_aluguel.pdf) | Comparar o preço em notas fiscais com o valor do metro quadrado na área. | Scraping de dados de [estudo do IBGE](http://seriesestatisticas.ibge.gov.br/series.aspx?vcodigo=PRECO415). |

### API
| Informação | Motivo | Estratégia |
|------------|--------|------------|
| [Facebook Graph API](https://developers.facebook.com/docs/graph-api) | Identificar se duas pessoas se conhecem | _Pendente_: [API mostra apenas amigos com o mesmo _app_ instalado](https://developers.facebook.com/docs/graph-api/reference/user/friends/).
| [Consulta de CNPJ na Receita Federal](http://www.receita.fazenda.gov.br/pessoajuridica/cnpj/cnpjreva/cnpjreva_solicitacao.asp) | Levantar informações sobre onde foi gasto o dinheiro público | Captcha. (testar [API “alternativa” sem captcha](http://receitaws.com.br) e estudar viabilidade) |
| [Google Street View](https://developers.google.com/maps/documentation/streetview/) | Identificar se o endereço é comercial ou residencial | API |

## Contribuindo
Para contribuir baste seguir nosso **[guia da contribuição](CONTRIBUTING.md)**.

## Participando da conversa
A conversa sobre o projeto acontece em um grupo do [Telegram](https://telegram.org/)**tudo em inglês**, já que temos contribuidores de outros países e queremos contribuir com outros países também.

Expand Down

0 comments on commit 0c0bb1e

Please sign in to comment.