Skip to content
This repository has been archived by the owner on Feb 28, 2018. It is now read-only.

Feature request: Data aggregation #13

Closed
awerlang opened this issue Sep 22, 2016 · 3 comments
Closed

Feature request: Data aggregation #13

awerlang opened this issue Sep 22, 2016 · 3 comments
Labels

Comments

@awerlang
Copy link

One way to quickly explore huge amounts of data is through data aggregation. For instance, what are all CNPJ/CPFs found in expenses? How deputies spent the most money?

Does this API aims to provide such feature? If you intend to address this in any other fashion, please let me know.

@cuducos
Copy link
Collaborator

cuducos commented Sep 22, 2016

Hi @awerlang, many thanks for your request and questions ; )

The use case gave birth to Jarbas was a simple API to easily share documents found during some data exploration and analysis. With that in mind Jarbas wasn't designed to explore data, but to bring you data of receipts you found while exploring the datasets (presumably with Jupyter Notebooks inside Serenata de Amor's src/ directory).

IMHO Jupyter Notebooks and the quantitative analysis tools packed with Anaconda (the main Python at Serenata de Amor) is way better to explore data than the a standard Python distribution running Django (what we have here at Jarbas).

That said I'd address your questions like in these terms:

what are all CNPJ/CPFs found in expenses? How deputies spent the most money?

Fell free to create a Jupyter Notebook within Serenata de Amor repo to explore that — it will be a better choice in terms of performance.

Also, working with Jupyter Notebooks allows you to ponder on the bias of each exploration (e.g. asking which deputy expends more probably tends to highlight deputies from North states as they have a higher allowance).

Does this API aims to provide such feature?

I would say that's not in our radar. The API is useful to list the data of the receipt/s (and, in the future of the supplier/s) related to the documents found in Serenata de Amor exploration and analysis.

But, again, this is just my humble opinion.

In spite of all that I acknowledge that delegating this kind of analysis to Jupyter Notebooks is making these data less accessible. But I do believe it's a temporary condition: once we find relevant data from questions such as how deputies spent the most money? probably they will foster communication and PR material, and the numbers will then become accessible to anyone. And I also acknowledge that this route infers the bias of our own curatorial layer.

I would like to hear from more people (including you, André) about two specific thing related to the question raised by André:

  • Are there use cases for Jarbas that we (I?) might me missing in these comments? Which ones?
  • Should Jarbas focus be on internal usage (to support Serenata de Amor) or on a public interface to the datasets we have? If so, how different from OPS interface it would be?

@awerlang
Copy link
Author

I have yet to try it out Junyper. Before I'm able to do that, I'll say that this API can keep its current focus. My original question was more about the API being able to crunch numbers than provinding a nice UI for exploring. I do agree we don't need to reinvent the wheel for that.

Also, I should have added it earlier, but anyways, I filled this issue because its more time efficient letting the database work on data than fetching every single result from the API then process in-memory. So, if we want to work on a dataset, should we load all data from the source files into memory inside Junyper, then do anything we want with a programming language, all in-memory? In case affirmative, have you got a good performance?

I did not know that OPS tool you linked, it's a good starting point. I would like to see charts. Maps to compare flight tickets expenses would be awesome too.

PS: As I said I have yet to try it out Junyper. Perhaps I'll be satisfied when I try it.

@cuducos
Copy link
Collaborator

cuducos commented Sep 22, 2016

In case affirmative, have you got a good performance?

Sure, that's what Anaconda (Python, NumPy, Pandas & cia.) is for : ) They are pretty cleaver in managing datasets — go for it. And Jupyter Notebooks are awesome to explore, to analyze and, last but not least, to share your work.

@Irio Irio closed this as completed Sep 24, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants