-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect multiple reimbursements using the same receipt #32
Comments
If you remember by heart (otherwise I look for it in the Just asking because if the real |
Sound great. SIFT is new for me but looks like something very effective for this kind of stuff. Awesome! |
I feel like SIFT is great for find similar stuff (e.g., receipts with the same layout), but is probably not going to be a good option to decide if 2 receipts are the same or not. |
Check the paper "Region Duplication Forgery Detection Technique Based on SURF and HAC" for references (https://sci-hub.cc/ is your friend). Here's an example of Python code to run SIFT. |
Came across 2 examples where 2 distinct reimbursements have the same On the first one the value that is presented as the Here are the A similar thing happens with these other 2 documents: 5780419 and 5880166. Where the operator's number (t00408151) of a highway toll is used as the |
I'm not sure this is a problem per se. I mean, AFAIK the That said, it seems to me that it's a matter of typing the wrong data, not sure if it's compromising… |
Understand. I didn't know that the number of the receipt can be duplicated just by coincidence. My thought at the time was that typing the wrong data might be a very common mistake thus making |
Good point! |
Good news for the detection of duplicate reimbursements. I did a notebook to convert pdf files to png and then to detect common regions with sift. Recipes used: 5645173 and 5645177. Match regions: So, with some experiments i found that use only sift will give us a lot of false positive. Here are the document_ids: 5886345 and 5886361. Sift keypoints: Common regions between them: So, I still working in the script to predict multiple reimbursements, i will try to combine sift output with the OCR data with have in this issue #188 to archive better results. As soon as possible i will share my other news with you guys :D |
That's awesome progress @silviodc! Many thanks for that. Even if the results are still lots of false positives IMHO it would be great to have this notebook of yours in our |
Hi @cuducos Yes i can open a PR, just let me play a little with this data in this weekend :D |
Hi everyone, The PR #238 about the conversion of pdf to image and the use of SIFT is up. |
…work-3.5.1-to-3.5.2 Update djangorestframework to 3.5.2
…print-improvements Simplifying InvalidCnpjCpfClassifier implementation
For instance, we detected the same receipt being used in 2 distinct reimbursements:
http://www.camara.gov.br/cota-parlamentar/documentos/publ/2437/2015/5645173.pdf
http://www.camara.gov.br/cota-parlamentar/documentos/publ/2437/2015/5645177.pdf
The text was updated successfully, but these errors were encountered: