Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partner list of companies receiving money from politicians #16

Closed
Irio opened this issue Aug 21, 2016 · 55 comments
Closed

Partner list of companies receiving money from politicians #16

Irio opened this issue Aug 21, 2016 · 55 comments

Comments

@Irio
Copy link
Collaborator

Irio commented Aug 21, 2016

No description provided.

@anapaulagomes
Copy link
Contributor

Hi @Irio! Could you please clarify which are the source to collect the data? Or we just need to be creative? :)
Thank you.

@cuducos
Copy link
Collaborator

cuducos commented Sep 6, 2016

Hi @anapaulagomes, the short answer is we just need to be creative hahaha…

The long answer is that we have talked about some possibilities: some info is available in the Federal Revenue (search for a CNPJ, then click on “Consulta QSA / Capital Social” or something like “Certidão de Baixa de Inscrição” if the company is inactive).

Unfortunately this if under a CAPTCHA. We have been in touch with people trying to code a workaround with a 10% of success rate if that helps. The juntas comerciais also have this information, but their are related to each state so their API might differ considerably.

We can try to scrap LinkedIn or Facebook trying to scrap some data, but that might be difficult (no CNPJ to match, different names, different job titles, outdated and unofficial info etc.).

And there is also alternative sites to look up CNPJ info, but not sure if they offer any info on the partners.

@Irio
Copy link
Collaborator Author

Irio commented Sep 6, 2016

Sure @anapaulagomes.

For main info about companies, we've been using ReceitaWS; it's fairly reliable, but as you can see in this example, does not include the partner list:

http://receitaws.com.br/v1/cnpj/02703510000150

{
    "atividade_principal": [{
        "text": "Restaurantes e similares",
        "code": "56.11-2-01"
    }],
    "data_situacao": "14/12/2002",
    "tipo": "MATRIZ",
    "nome": "FRANCISCO RESTAURANTE LTDA - EPP",
    "telefone": "(61) 3226-2626",
    "situacao": "ATIVA",
    "bairro": "ASA SUL",
    "logradouro": "Q SHC/SUL CL QUADRA 402 BLOCO B LOJA 05, 09, 15",
    "numero": "S/N",
    "cep": "70.237-500",
    "municipio": "BRASILIA",
    "uf": "DF",
    "abertura": "27/06/1988",
    "natureza_juridica": "206-2 - SOCIEDADE EMPRESARIA LIMITADA",
    "cnpj": "02.703.510/0001-50",
    "ultima_atualizacao": "2016-08-24T16:58:50.057Z",
    "status": "OK",
    "fantasia": "",
    "complemento": "",
    "email": "",
    "efr": "",
    "motivo_situacao": "",
    "situacao_especial": "",
    "data_situacao_especial": "",
    "atividades_secundarias": [{
        "code": "00.00-0-00",
        "text": "Não informada"
    }]
}

Here's a step by step to get them (as a user) from Federal Revenue's website, probably the best from the official sources:

  1. Fill form with CNPJ and captcha.
    https://cloud.githubusercontent.com/assets/667753/18273509/fa56d942-7413-11e6-8ac5-868f899aa5e5.png
  2. Click on button "Consulta QSA / Capital Social".
    https://cloud.githubusercontent.com/assets/667753/18273526/0b4e8d08-7414-11e6-8391-ffd7d23a1555.png
  3. All yours.
    https://cloud.githubusercontent.com/assets/667753/18273533/1352f818-7414-11e6-95ed-92acea5b9848.png

As mentioned by @cuducos, we know people breaking it using Tesseract (OCR), but frequently gets blocked by Federal Revenue's servers given its low accuracy of 10%. Another way I can think about breaking it is using Machine Learning; computer vision is one of the most researched areas in Deep Learning nowadays. e.g. https://deepmlblog.wordpress.com/2016/01/03/how-to-break-a-captcha-system/ (has a paper at the end)

@josircg
Copy link

josircg commented Sep 6, 2016

Hi folks, first of all, I didn't get yet why we are talking in english...

The main idea is to discover the biggest amounts: it will be for printer bureau, Video producers and adversting companies. Restaurants or brothels would result just on tips and cents and we will not discover anything interesting.

So my suggestion is first to discover who are the big suppliers and which activities they made. After that, try to discover the partners.

An interesting buzz post: https://www.facebook.com/teofb/posts/1186854511378646

He did by hand without any programming ;)

@lucasrcezimbra
Copy link
Contributor

lucasrcezimbra commented Sep 6, 2016

If we use the voice recognition on audio captcha instead of OCR on image, isn't it easier to recognize and more accurate?

It's just a idea, I don't know which is better.

@cuducos
Copy link
Collaborator

cuducos commented Sep 6, 2016

Hi @josircg,

Welcome to Serenata de Amor. I'll try to address all your points, but let me know if I forget any of them, ok?

Hi folks, first of all, I didn't get yet why we are talking in english...

It's on the bottom of our README.md, the homepage of the project here at GitHub: A conversa sobre o projeto acontece em um grupo do Telegram — tudo em inglês, já que temos contribuidores de outros países e queremos contribuir com outros países também.

Does that make sense?

The main idea is to discover the biggest amounts […]

To keep it short: the main idea is to use computing power to find more cases than humans, doing it manually, would be able find. That's the purpose of the project. Surely big cases are eye-catching, but we depart from the assumption that corruption starts small — according to that there is an important value of focusing also on the small cases.

He did by hand without any programming ;)

This post is amazing, as it is OPS: they, doing it all manually, denounced and succeed in cases summing more than R$5 million. We do not compete or replace these example. We are inspired by them and try to expand their investigative power ; )

@cuducos
Copy link
Collaborator

cuducos commented Sep 6, 2016

@Lrcezimbra's idea looks amazing! Is there any project/script using this strategy to break CAPTCHA?

@anapaulagomes
Copy link
Contributor

Good inputs! I'd like to work on it (I can't assign to myself). I was looking for different sources and I found this interesting tool called Câmara Transparente developed by FGV. I'll take a look on other sources and keep you updated.

@cuducos cuducos assigned cuducos and unassigned cuducos Sep 6, 2016
@urubatan
Copy link

not sure if we can get the data we want for this, but I just found this site http://www.consultasocio.com/

that allows listing what companies that someone is a partner, and the other partners in those companies for example http://www.consultasocio.com/q/sa/abel-salvador-mesquita-junior

we can try to scrap that data starting with the politician names, and create a database with what companies belongs to each deputy, who else is partner in those companies and the companies that belongs to those partners

if this will help, I can assign the issue to myself and write the scrapper

@anapaulagomes
Copy link
Contributor

@urubatan I'm starting to work on this issue but you can certainly help! :) Maybe you can do the scrapper for this website also and I'll work in the another.

@tomascamargo
Copy link

@Irio we have about 1 million companies with information about partners (QSA) in our database, such data is not acessible by an API but if you provide us with a list o names, we will be happy to run a search for it. All information was obtained from the Receita Federal website and is public. Please note that the QSA information does not include the CPF number, so searches are base on the name and therefore are subject to namesake.

@Irio
Copy link
Collaborator Author

Irio commented Sep 14, 2016

@urubatan That sounds very promising.

@tomascamargo Would be possible to query 433 names? 👀 These are all the unique congresspeople listed since the creation of CEAP.

@urubatan
Copy link

@Irio great, my python is not great (planning to help the project to learn python and data science :P ), but I'll write a scrapper for that and send a pull request.

@tomascamargo
Copy link

@Irio yes we try with this. Can you please provide us the names.

@mtrovo
Copy link
Contributor

mtrovo commented Sep 18, 2016

Hey guys, I was able to get the information from the main site without needing CAPTCHA, I'm finishing a poc here and will send you.

@awerlang
Copy link

@mtrovo have you published the results?

@awerlang
Copy link

awerlang commented Sep 24, 2016

I published to https://github.com/awerlang/cnpj-rfb a tool to fetch a company partner list (only names). There's manual step requiring you to visit RFB website then the rest is automated. I found some companies break the process, then you'll need to repeat. It would be best to filter this companies ("baixada", "natureza jurídica inválida", some "S.A.", "filial"). I guess this is our best shot atm.
If anyone have a list of all the companies we need to query let's run through this tool.

@awerlang
Copy link

This is a list with 8417 CNPJs and CPFs found on expenses in 2016 up to now. Also I'm attaching ~80 CNPJs I was able to fetch with the tool I anounced the other day. Currently I'm experiencing a connection reset every ~40 operations (ECONNRESET, and I'm not able to access the web page on RFB for some time), that would mean about 200 runs ~= several days. And we would want to run on 2015 data too.

Also I updated the tool to ignore CPFs, and CNPJ entries with certain conditions that would lead to error conditions.

Help is most appreciated! How can we unblock HTTP access in this situation?

https://gist.github.com/awerlang/3a8b3f286a0bcceb2ae367ad2e09af21

@cuducos
Copy link
Collaborator

cuducos commented Nov 8, 2016

Summarizing this topic

I'm sorry if my statements looks like I'm pointing to failures but I wan't to highlight what the pain-points really are.

@tomascamargo has a great private database. If we ask you for a query for a long list (60k) of CNPJ, would you generate a simple dump for us (cnpj, partner) — in case of more than one partner, one cnpj could be repeated in subsequent rows? If so, I'll export a list of CNPJ today.

@awerlang Your solution looks great — it's the best we have so far, but restaring it, plus latency due to probable server-side block, plus manual session ID copying and paste is still an issue. Maybe the barrier is as great as breaking the captcha (10% rate of success and successive blocks). If @tomascamargo can bootstrap this dataset with a query in his system, we can use your solution to update the dataset when we get new data from the Lower House.

@urubatan Consulta Sócio is useful to get companies registered in politicians' names, but it would also be very interesting to have the full list of partner (get companies held by politicians relatives as in #107 for example). That's why I'm still writing in this Issue ; )

I'm willing to put some effort in this issue to create this dataset. I'm glad for all the discussion, references and opportunities. Let's put the pieces together to make it work ; )

@josircg
Copy link

josircg commented Dec 7, 2016

Fresh news about Consulta Sócio:

http://convergenciadigital.uol.com.br/cgi/cgilua.exe/sys/start.htm?UserActiveTemplate=site&infoid=44189&sid=4&utm_source=dlvr.it&utm_medium=twitter

If we get this database, it should be obfuscated somehow or it can be a trojan horse against whole project.

@marcusrehm
Copy link
Contributor

Hey guys, is there anyone working on this issue?

@cuducos
Copy link
Collaborator

cuducos commented Apr 2, 2017

AFAIK there isn't. @jtemporal and I took a look on the two scripts collecting data from ReceitaWS we felt that they could be refactored before collecting data again — but this is not our priority right now. Feel free to jump in, coding or discussing ; )

@jtemporal
Copy link
Collaborator

same thing here, AFAIK there isn't. Feel free to adopt it @marcusrehm ;) it would be much appreciated

@marcusrehm
Copy link
Contributor

Yes! I can get this one. Actually I was waiting for this one to be done so we could play with that data on neo4j... 😄

Could you please point out which scripts are related to this issue? If you can, we can discuss about the refactoring also, then we can see what can be done along with the inclusion of the partners list.

/cc @jtemporal @cuducos

@cuducos
Copy link
Collaborator

cuducos commented Apr 3, 2017

Two scripts basically: fetch_cnpj_info.py and clean_cnpj_info_dataset.py. My comments in favor of a massive refactor:

  • The script is not so effective: it easily started to be blocked by ReceitaWS and we have to re-run it several times
  • Maybe no cleaning process could be done in a parser logic, not in an external script

@cuducos
Copy link
Collaborator

cuducos commented Apr 4, 2017

The script is not so effective: it easily started to be blocked by ReceitaWS and we have to re-run it several times

Just checked: Try to better handle 429 too many requests responses

Maybe no cleaning process could be done in a parser logic, not in an external script

And maybe move it to the toolbox

@marcusrehm
Copy link
Contributor

I think we can work on the data acquisition, coding the script to fetch the new data. After that we can work on the issue about request, in the end this one is needed in order to grab the data.

About the requests, do you know or have already tried use Tor? I was thinking about use it and in case it isn't allowed on the client network the script can then use batch processing to request in a specific time frame.

Then we can refactor the whole script, but we will already have the logic to get this running.

What you think?

@cuducos
Copy link
Collaborator

cuducos commented Apr 5, 2017

About the requests, do you know or have already tried use Tor?

TBH I just used Tor browser manually, that is to say, never integrated in a Python script. If this is possible and if this doesn't require a too specific setup (so contributors can get started in the project easily) I have no concerns about it. Otherwise I think that a proper handling of HTTP 429 and maybe some semaphore controlling the amount and frequency of requests might be enough.

I think we can work on the data acquisition, coding the script to fetch the new data.

Sure thing. I think it is a good start to start with data collection and once we can gather the information we're looking for we can look to the script and figure out the best way to deal with request traffic ; )

@marcusrehm
Copy link
Contributor

Hey Guys! Just got the script fetching and saving the partners list. It's available at my repo. It's the same script but saving the data of partners, now I'm working to better handle the 429 error related with too many requests.

About the 429error, I think Tor is not a good option as people need to install it apart from serenata de amor and I think it is out of scope for the project (as @cuducos mentioned, it should be easy for the contributor to setup). So I started to work in handling the requests errors and I got some concerns about the script:

  • I think it's better use a sequencial approach to make the requests rather than paralelize (as the script do now) because I think parallelism improves the chance to get 429 too fast and besides that I also getting [Errno 24] Too many open files after handle the 429.
  • If we remove the parallelism there's no need to save data in pkl files and then put it on a DataFrame, we could do it right after get the data from receitasWS.
  • Other point wold be put companaies.xz and companies-partners.xz dataset's using the naming format yyyy-mm-dd-datasetname.xz as we use in others datasets. It helps in getting versions of the data and preserving data used in some analysis. For this we need to now the impact on other scripts.

Do you have any concerns about the points above or I can continue with this approach?

/cc @cuducos @jtemporal

@jtemporal
Copy link
Collaborator

Yaay this is awesome! I believe you are about right on that approach, AFAIK the parallelism part was used so we could generate the dataset faster, right now if we remove de the parallelism and get it to work properly, we can think about parallelize it again later. =)

@marcusrehm
Copy link
Contributor

Cool @jtemporal ! I'm gonna do it this way then.

And about the file naming convention? Do you think that it is ok also?!

@cuducos
Copy link
Collaborator

cuducos commented Apr 13, 2017

Good points, @marcusrehm — I agree with you and @jtemporal on everything you've said. ABout the naming convention: we could have just YYYY-MM-DD-companies.xz I guess, addressing the partner names the same way we did with secondary_activity_XX[_code_]. Maybe the script wasn't versioning companies.xz, but we were doing it manually — it would be great to have it done automatically.

@jtemporal
Copy link
Collaborator

it would be great to have it done automatically

Automate all the things o/ 🎉

Maybe the script wasn't versioning companies.xz

That's about right! Versioning was happening when the dataset was being uploaded to the S3.

@marcusrehm
Copy link
Contributor

Hey folks!

Got a new version of fetch_cnpj_info.py script here.

I brought the threading back but with some improvements. Now we can pass as arguments how many threads to use and a list of http proxies, so each request uses a randomly chosen proxy from the list or none (use the local ip). The script still taking time but fetches faster than old version.

Also for each batch of 100 requests it saves the cnpj-info.xz dataframe so if we have some issue or it interrupts abruptly we don't loose the work and the script can restarts from where it stopped.

The script can be called the same way as it is now (so nothing will break because of that) or it can be called as python ./src/fetch_cnpj_info.py ./data/2016-11-19-current-year.xz -p 177.67.84.135:8080 177.67.82.80:8080 177.67.82.80:3128 179.185.54.114:8080 -t 20 where -p or --proxies receives the list of proxies and -t or --threads the number of threads.

For the clean_cnpj_info_dataset.py I put the naming file convention to save the companies.xz file and adjusted partner list as @cuducos suggested here:

I guess, addressing the partner names the same way we did with secondary_activity_XX[code]. Maybe the script wasn't versioning companies.xz, but we were doing it manually — it would be great to have it done automatically.

My only concern about it is that wouldn't be easier to analyze data if we got it in a separate file? Having multiple columns with the "same" data can increase the difficulty to make filter or joins, don't you think?

I can create a PR so you guys can review and use it. Now I think we can work in bring the clean logic to the fetch script and have just one script for handle all process ok?

/cc @cuducos @jtemporal

@cuducos
Copy link
Collaborator

cuducos commented Apr 21, 2017

Now we can pass as arguments how many threads to use and a list of http proxies

That's awesome!

The script still taking time but fetches faster than old version.

🎉 Many thanks, @marcusrehm!

My only concern about it is that wouldn't be easier to analyze data if we got it in a separate file?

That's totally fine IMHO ; )

I can create a PR so you guys can review and use it.

Sure thing. Once the PR is opened I can offer a proper code review, but in a quick look it looks as a very good improvement — again, many thanks for that.

Once you open the PR if you have a dataset generated you can add a link to it if you like ; )

@marcusrehm
Copy link
Contributor

Hey guys! It took more days than I thought but the PR #218 is there. When the dataset download finish I post the link here ok?

@marcusrehm
Copy link
Contributor

the link to download the dataset https://we.tl/Zsw7zPhV6a

jtemporal added a commit that referenced this issue May 3, 2017
Companies partners list in companies dataset - Issue #16
@cuducos
Copy link
Collaborator

cuducos commented May 10, 2017

Does #218 (merged) closes this Issue? cc @jtemporal

@cuducos
Copy link
Collaborator

cuducos commented May 10, 2017

Also the dataset is not available in the S3 yet, is it? Does anyone still have a copy so we can upload it? cc @jtemporal

@marcusrehm
Copy link
Contributor

I uploaded the file again https://we.tl/TTYzFTk5d8.

About close the issue, if this was just the acquisition of partners list, then I think it's done, but I didn't do any analysis with it (yet). :)

cc @cuducos @jtemporal

@cuducos
Copy link
Collaborator

cuducos commented May 11, 2017

I think this is mostly related to the partner list. I'm pondering on two issues about this dataset before bringin it to S3:

  • For some reason it has less 7% companies than the last one (no idea why)
  • It misses the geo coordinates we added using Google Places API

So before making it available I would like to know about best practices in versioning (arguably similar) datasets:

  1. Should we rename it to companies-no-geolocation?
  2. Should we add geolocation to it?
  3. Should we strip off everything but CNPJ and partner list (making it an complimentary dataset to the former companies dataset)?

What do you think @Irio?

@Irio
Copy link
Collaborator Author

Irio commented May 11, 2017

When we first generated it, the companies.xz file already had geolocation (using src/geocode_addresses.py). I'm good with option number 1 if we work on number 2 later. @cuducos

@jtemporal
Copy link
Collaborator

I'm good with option number 1 if we work on number 2 later

I agree with this approach ;)

@marcusrehm
Copy link
Contributor

For some reason it has less 7% companies than the last one (no idea why)

I think I ran it using only the reimbursements dataset. Another reason could be that the last script were filling lines with blank info only with the message error for "CNPJ inválido".

@cuducos
Copy link
Collaborator

cuducos commented May 11, 2017

Renaming it, opening an issue to add geolocation… and closing this issue! Hell yeah ; ) Thank you so much @marcusrehm 🎉

Closed by #218

@cuducos cuducos closed this as completed May 11, 2017
Irio pushed a commit that referenced this issue Feb 27, 2018
cuducos pushed a commit that referenced this issue Feb 28, 2018
Add Code Climate, Coveralls and badges
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests