-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to Download and include Supervised Learning #126
Comments
Hooray. Let's avoid reinventing the wheel. Actually these 3 points are already here.. We have methods for that in Also I don't think that by now we need a special directory, method or class for supervisioned learning files. Finally downloading the PDF is already automated. That said the focus of this issue is simply to foster the development of a script/method able to generate this file (i.e. port it from the other repository/notebook to the toolbox in case anyone willing to reproduce our steps wants to star for scratch). Basically the starting from |
Hi @cuducos
I totally agree with you that all methods to download a determined quantity of reimbursements or full datasets are already implemented as well the methods to upload it using proper credentials. Therefore, the focus of this issue is not to reinventing the wheel, but to integrate these
I agree that right now could be early to it. We only have this curated dataset and one classifier. However in case more people build such models, the future architecture of the serenata-toolboox can smoothly deal with this demand. Furthermore, It could open the opportunity to receive new contributions to Rosie.
As i mentioned in 2 and in my experience building these curated files, the removal of files in the chamber is always a surprise. While i was publishing the csv in google docs, me and felipe cabral have seen some reimbursements without the link for the pdf, even i had all these files in my computer because i downloaded it one month before. So, more than reproduce the experiments we are losing curated reimbursements. I mean, we only have the link, not the pdf or image to show it in the chamber and ask for clarifications. Maybe we could discuss and Restructure it better. |
Ok… first a minor comment:
I agree that we could have this receipts converted to PNG and stored. And we can upload anything we want to some storage like S3. The point is that anything uploaded there must be produced from scripts available in our repos ; ) That said, the issue is ver straightforward IMHO:
Am I missing anything? Many thanks for all the effort and clarification, @silviodc — let's tackle it! |
Continuing...
Do you mean, you would like to do this workflow in our machine in order to upload the result to storage?
Second workflow:
I'm okay with it. Even some recipes can not be download anymore, we still with a represent quantity of data.
I didn't get your point here. Could you explain it better? I mean, once i have the reference data to generate a new model in my machine is easy. The point i mentioned is: we should avoid the overhead to train references every time, since the model is hardly changed. I guess we have to allow people to download Rosie official models. I provided mine as base, however we can generate other.
It's sounds much better now.
The version control of trained model is perfectly fine. However, if we change the image we also have to change the reference. It means curate again some parts of our gold standard. It is a laborious task to demand people to curate again something they have done before. I think we could include new references, but never change the past curated. What do you think? |
Just clarifying the last part of my previous comment.
Considering this part, take a look in this suspicious reimbursement found by Rosie Probably at the beginning we will build the references classifying it as a positive suspicious. We will have a stored image and our models will use it during the training. What is fine. However, as you can see the chamber changed the recipe. The deputy included an new page clarifying that this recipe is not suspicious. It is completely fine for the chamber's API. However for our robot or contributors it can brings some problems in the future if we always download the pdfs or use the csv reference alone. That said, i also think we could provide the images in small blocks, e.g., [100MB]. To solve this:
Furthermore, we have to say that these images reflects the csv references files, not if a reimbursement is suspicious or not in the chamber. It will avoid future surprises we can have using unclear terminologies/affirmations and the direct data from chamber's API. We could go for a second phase of reclassification of these references, once a year and then change the entire reference. In IHMO it avoids inconsistence problems, that reflects in our training/predictions for supervised classifiers. |
Hi @willianpaixao, I suggest you to have a look in these 3 pull request: okfn-brasil/serenata-de-amor#238 okfn-brasil/serenata-de-amor#286 I think it is a good start to implement what we discussed here: Let me know if you have any questions, Best, |
After the contribution of many people we built a gold standard as reference to indicate if a reimbursement is a generalization or not.
Example of generalization:
5635048.pdf
Not a generalization:
5506259.pdf
Our sample of reference consists in 1691 suspicious, and 1691 not suspicious reimbursements link. It was manually curated as explained in this video made by Felipe Cabral apoia.se
The goal of this dataset is to deal with this part of CEAP:
Thus, this issue aims the following:
First objective, find hereafter the files i have:
PNG images
CSV reference
Regarding the CSV files, we have to include the direct link to chamber of deputies. Right now it only has the link to Jarbas.
To easily do that, you have to take the document id from CSV file, and get the full link using a method like that:
The Dataset i used was this one:
The first objective will allow more people to have access to these curated files in order to replicate and create new experiments!
Second objective: The goal is to call some method like:
It makes the integration of the mentioned files easily in other parts of project, e.g., Classifier using these files, Analyse using these data
Third objective, as you can see in the mentioned link: Classifier using these data](okfn-brasil/rosie#66)
To upload big files in git is not a good practice. Therefore to facilitate the contribution of new models to Rosie we have to create a method to specify which model we would like to retrieve.
Example right now:
Proposed:
It will allows us in the future to include more models and re-training the existent ones to be more robust.
To this task find hereafter my model:
Meal Generalization
PS: To upload files we have this method in the toolbox remote.py
The text was updated successfully, but these errors were encountered: