Feature request: Data aggregation #13

awerlang · 2016-09-22T02:43:34Z

One way to quickly explore huge amounts of data is through data aggregation. For instance, what are all CNPJ/CPFs found in expenses? How deputies spent the most money?

Does this API aims to provide such feature? If you intend to address this in any other fashion, please let me know.

cuducos · 2016-09-22T12:08:01Z

Hi @awerlang, many thanks for your request and questions ; )

The use case gave birth to Jarbas was a simple API to easily share documents found during some data exploration and analysis. With that in mind Jarbas wasn't designed to explore data, but to bring you data of receipts you found while exploring the datasets (presumably with Jupyter Notebooks inside Serenata de Amor's src/ directory).

IMHO Jupyter Notebooks and the quantitative analysis tools packed with Anaconda (the main Python at Serenata de Amor) is way better to explore data than the a standard Python distribution running Django (what we have here at Jarbas).

That said I'd address your questions like in these terms:

what are all CNPJ/CPFs found in expenses? How deputies spent the most money?

Fell free to create a Jupyter Notebook within Serenata de Amor repo to explore that — it will be a better choice in terms of performance.

Also, working with Jupyter Notebooks allows you to ponder on the bias of each exploration (e.g. asking which deputy expends more probably tends to highlight deputies from North states as they have a higher allowance).

Does this API aims to provide such feature?

I would say that's not in our radar. The API is useful to list the data of the receipt/s (and, in the future of the supplier/s) related to the documents found in Serenata de Amor exploration and analysis.

But, again, this is just my humble opinion.

In spite of all that I acknowledge that delegating this kind of analysis to Jupyter Notebooks is making these data less accessible. But I do believe it's a temporary condition: once we find relevant data from questions such as how deputies spent the most money? probably they will foster communication and PR material, and the numbers will then become accessible to anyone. And I also acknowledge that this route infers the bias of our own curatorial layer.

I would like to hear from more people (including you, André) about two specific thing related to the question raised by André:

Are there use cases for Jarbas that we (I?) might me missing in these comments? Which ones?
Should Jarbas focus be on internal usage (to support Serenata de Amor) or on a public interface to the datasets we have? If so, how different from OPS interface it would be?

awerlang · 2016-09-22T18:58:28Z

I have yet to try it out Junyper. Before I'm able to do that, I'll say that this API can keep its current focus. My original question was more about the API being able to crunch numbers than provinding a nice UI for exploring. I do agree we don't need to reinvent the wheel for that.

Also, I should have added it earlier, but anyways, I filled this issue because its more time efficient letting the database work on data than fetching every single result from the API then process in-memory. So, if we want to work on a dataset, should we load all data from the source files into memory inside Junyper, then do anything we want with a programming language, all in-memory? In case affirmative, have you got a good performance?

I did not know that OPS tool you linked, it's a good starting point. I would like to see charts. Maps to compare flight tickets expenses would be awesome too.

PS: As I said I have yet to try it out Junyper. Perhaps I'll be satisfied when I try it.

cuducos · 2016-09-22T22:39:10Z

In case affirmative, have you got a good performance?

Sure, that's what Anaconda (Python, NumPy, Pandas & cia.) is for : ) They are pretty cleaver in managing datasets — go for it. And Jupyter Notebooks are awesome to explore, to analyze and, last but not least, to share your work.

cuducos added enhancement question and removed enhancement labels Sep 22, 2016

Irio closed this as completed Sep 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Data aggregation #13

Feature request: Data aggregation #13

awerlang commented Sep 22, 2016

cuducos commented Sep 22, 2016

awerlang commented Sep 22, 2016

cuducos commented Sep 22, 2016

Feature request: Data aggregation #13

Feature request: Data aggregation #13

Comments

awerlang commented Sep 22, 2016

cuducos commented Sep 22, 2016

awerlang commented Sep 22, 2016

cuducos commented Sep 22, 2016