Partner list of companies receiving money from politicians #16

Irio · 2016-08-21T02:48:31Z

No description provided.

anapaulagomes · 2016-09-05T20:29:26Z

Hi @Irio! Could you please clarify which are the source to collect the data? Or we just need to be creative? :)
Thank you.

cuducos · 2016-09-06T12:16:22Z

Hi @anapaulagomes, the short answer is we just need to be creative hahaha…

The long answer is that we have talked about some possibilities: some info is available in the Federal Revenue (search for a CNPJ, then click on “Consulta QSA / Capital Social” or something like “Certidão de Baixa de Inscrição” if the company is inactive).

Unfortunately this if under a CAPTCHA. We have been in touch with people trying to code a workaround with a 10% of success rate if that helps. The juntas comerciais also have this information, but their are related to each state so their API might differ considerably.

We can try to scrap LinkedIn or Facebook trying to scrap some data, but that might be difficult (no CNPJ to match, different names, different job titles, outdated and unofficial info etc.).

And there is also alternative sites to look up CNPJ info, but not sure if they offer any info on the partners.

Irio · 2016-09-06T12:40:42Z

Sure @anapaulagomes.

For main info about companies, we've been using ReceitaWS; it's fairly reliable, but as you can see in this example, does not include the partner list:

http://receitaws.com.br/v1/cnpj/02703510000150

{
    "atividade_principal": [{
        "text": "Restaurantes e similares",
        "code": "56.11-2-01"
    }],
    "data_situacao": "14/12/2002",
    "tipo": "MATRIZ",
    "nome": "FRANCISCO RESTAURANTE LTDA - EPP",
    "telefone": "(61) 3226-2626",
    "situacao": "ATIVA",
    "bairro": "ASA SUL",
    "logradouro": "Q SHC/SUL CL QUADRA 402 BLOCO B LOJA 05, 09, 15",
    "numero": "S/N",
    "cep": "70.237-500",
    "municipio": "BRASILIA",
    "uf": "DF",
    "abertura": "27/06/1988",
    "natureza_juridica": "206-2 - SOCIEDADE EMPRESARIA LIMITADA",
    "cnpj": "02.703.510/0001-50",
    "ultima_atualizacao": "2016-08-24T16:58:50.057Z",
    "status": "OK",
    "fantasia": "",
    "complemento": "",
    "email": "",
    "efr": "",
    "motivo_situacao": "",
    "situacao_especial": "",
    "data_situacao_especial": "",
    "atividades_secundarias": [{
        "code": "00.00-0-00",
        "text": "Não informada"
    }]
}

Here's a step by step to get them (as a user) from Federal Revenue's website, probably the best from the official sources:

Fill form with CNPJ and captcha.
https://cloud.githubusercontent.com/assets/667753/18273509/fa56d942-7413-11e6-8ac5-868f899aa5e5.png
Click on button "Consulta QSA / Capital Social".
https://cloud.githubusercontent.com/assets/667753/18273526/0b4e8d08-7414-11e6-8391-ffd7d23a1555.png
All yours.
https://cloud.githubusercontent.com/assets/667753/18273533/1352f818-7414-11e6-95ed-92acea5b9848.png

As mentioned by @cuducos, we know people breaking it using Tesseract (OCR), but frequently gets blocked by Federal Revenue's servers given its low accuracy of 10%. Another way I can think about breaking it is using Machine Learning; computer vision is one of the most researched areas in Deep Learning nowadays. e.g. https://deepmlblog.wordpress.com/2016/01/03/how-to-break-a-captcha-system/ (has a paper at the end)

josircg · 2016-09-06T16:45:37Z

Hi folks, first of all, I didn't get yet why we are talking in english...

The main idea is to discover the biggest amounts: it will be for printer bureau, Video producers and adversting companies. Restaurants or brothels would result just on tips and cents and we will not discover anything interesting.

So my suggestion is first to discover who are the big suppliers and which activities they made. After that, try to discover the partners.

An interesting buzz post: https://www.facebook.com/teofb/posts/1186854511378646

He did by hand without any programming ;)

lucasrcezimbra · 2016-09-06T17:48:11Z

If we use the voice recognition on audio captcha instead of OCR on image, isn't it easier to recognize and more accurate?

It's just a idea, I don't know which is better.

cuducos · 2016-09-06T17:55:02Z

Hi @josircg,

Welcome to Serenata de Amor. I'll try to address all your points, but let me know if I forget any of them, ok?

Hi folks, first of all, I didn't get yet why we are talking in english...

It's on the bottom of our README.md, the homepage of the project here at GitHub: A conversa sobre o projeto acontece em um grupo do Telegram — tudo em inglês, já que temos contribuidores de outros países e queremos contribuir com outros países também.

Does that make sense?

The main idea is to discover the biggest amounts […]

To keep it short: the main idea is to use computing power to find more cases than humans, doing it manually, would be able find. That's the purpose of the project. Surely big cases are eye-catching, but we depart from the assumption that corruption starts small — according to that there is an important value of focusing also on the small cases.

He did by hand without any programming ;)

This post is amazing, as it is OPS: they, doing it all manually, denounced and succeed in cases summing more than R$5 million. We do not compete or replace these example. We are inspired by them and try to expand their investigative power ; )

cuducos · 2016-09-06T17:56:09Z

@Lrcezimbra's idea looks amazing! Is there any project/script using this strategy to break CAPTCHA?

anapaulagomes · 2016-09-06T18:44:23Z

Good inputs! I'd like to work on it (I can't assign to myself). I was looking for different sources and I found this interesting tool called Câmara Transparente developed by FGV. I'll take a look on other sources and keep you updated.

urubatan · 2016-09-13T23:12:38Z

not sure if we can get the data we want for this, but I just found this site http://www.consultasocio.com/

that allows listing what companies that someone is a partner, and the other partners in those companies for example http://www.consultasocio.com/q/sa/abel-salvador-mesquita-junior

we can try to scrap that data starting with the politician names, and create a database with what companies belongs to each deputy, who else is partner in those companies and the companies that belongs to those partners

if this will help, I can assign the issue to myself and write the scrapper

anapaulagomes · 2016-09-14T03:06:58Z

@urubatan I'm starting to work on this issue but you can certainly help! :) Maybe you can do the scrapper for this website also and I'll work in the another.

tomascamargo · 2016-09-14T16:50:12Z

@Irio we have about 1 million companies with information about partners (QSA) in our database, such data is not acessible by an API but if you provide us with a list o names, we will be happy to run a search for it. All information was obtained from the Receita Federal website and is public. Please note that the QSA information does not include the CPF number, so searches are base on the name and therefore are subject to namesake.

Irio · 2016-09-14T19:45:36Z

@urubatan That sounds very promising.

@tomascamargo Would be possible to query 433 names? 👀 These are all the unique congresspeople listed since the creation of CEAP.

urubatan · 2016-09-14T19:54:05Z

@Irio great, my python is not great (planning to help the project to learn python and data science :P ), but I'll write a scrapper for that and send a pull request.

tomascamargo · 2016-09-14T20:08:03Z

@Irio yes we try with this. Can you please provide us the names.

mtrovo · 2016-09-18T11:07:05Z

Hey guys, I was able to get the information from the main site without needing CAPTCHA, I'm finishing a poc here and will send you.

awerlang · 2016-09-23T17:35:19Z

@mtrovo have you published the results?

awerlang · 2016-09-24T00:26:15Z

I published to https://github.com/awerlang/cnpj-rfb a tool to fetch a company partner list (only names). There's manual step requiring you to visit RFB website then the rest is automated. I found some companies break the process, then you'll need to repeat. It would be best to filter this companies ("baixada", "natureza jurídica inválida", some "S.A.", "filial"). I guess this is our best shot atm.
If anyone have a list of all the companies we need to query let's run through this tool.

awerlang · 2016-09-26T13:31:16Z

This is a list with 8417 CNPJs and CPFs found on expenses in 2016 up to now. Also I'm attaching ~80 CNPJs I was able to fetch with the tool I anounced the other day. Currently I'm experiencing a connection reset every ~40 operations (ECONNRESET, and I'm not able to access the web page on RFB for some time), that would mean about 200 runs ~= several days. And we would want to run on 2015 data too.

Also I updated the tool to ignore CPFs, and CNPJ entries with certain conditions that would lead to error conditions.

Help is most appreciated! How can we unblock HTTP access in this situation?

https://gist.github.com/awerlang/3a8b3f286a0bcceb2ae367ad2e09af21

cuducos · 2016-11-08T18:14:43Z

Summarizing this topic

I'm sorry if my statements looks like I'm pointing to failures but I wan't to highlight what the pain-points really are.

@tomascamargo has a great private database. If we ask you for a query for a long list (60k) of CNPJ, would you generate a simple dump for us (cnpj, partner) — in case of more than one partner, one cnpj could be repeated in subsequent rows? If so, I'll export a list of CNPJ today.

@awerlang Your solution looks great — it's the best we have so far, but restaring it, plus latency due to probable server-side block, plus manual session ID copying and paste is still an issue. Maybe the barrier is as great as breaking the captcha (10% rate of success and successive blocks). If @tomascamargo can bootstrap this dataset with a query in his system, we can use your solution to update the dataset when we get new data from the Lower House.

@urubatan Consulta Sócio is useful to get companies registered in politicians' names, but it would also be very interesting to have the full list of partner (get companies held by politicians relatives as in #107 for example). That's why I'm still writing in this Issue ; )

I'm willing to put some effort in this issue to create this dataset. I'm glad for all the discussion, references and opportunities. Let's put the pieces together to make it work ; )

josircg · 2016-12-07T14:50:04Z

Fresh news about Consulta Sócio:

http://convergenciadigital.uol.com.br/cgi/cgilua.exe/sys/start.htm?UserActiveTemplate=site&infoid=44189&sid=4&utm_source=dlvr.it&utm_medium=twitter

If we get this database, it should be obfuscated somehow or it can be a trojan horse against whole project.

marcusrehm · 2017-04-01T19:48:30Z

Hey guys, is there anyone working on this issue?

cuducos · 2017-04-02T00:36:42Z

AFAIK there isn't. @jtemporal and I took a look on the two scripts collecting data from ReceitaWS we felt that they could be refactored before collecting data again — but this is not our priority right now. Feel free to jump in, coding or discussing ; )

jtemporal · 2017-04-02T04:38:47Z

same thing here, AFAIK there isn't. Feel free to adopt it @marcusrehm ;) it would be much appreciated

marcusrehm · 2017-04-03T13:37:34Z

Yes! I can get this one. Actually I was waiting for this one to be done so we could play with that data on neo4j... 😄

Could you please point out which scripts are related to this issue? If you can, we can discuss about the refactoring also, then we can see what can be done along with the inclusion of the partners list.

/cc @jtemporal @cuducos

cuducos · 2017-04-03T15:25:17Z

Two scripts basically: fetch_cnpj_info.py and clean_cnpj_info_dataset.py. My comments in favor of a massive refactor:

The script is not so effective: it easily started to be blocked by ReceitaWS and we have to re-run it several times
Maybe no cleaning process could be done in a parser logic, not in an external script

cuducos · 2017-04-04T18:58:55Z

The script is not so effective: it easily started to be blocked by ReceitaWS and we have to re-run it several times

Just checked: Try to better handle 429 too many requests responses

Maybe no cleaning process could be done in a parser logic, not in an external script

And maybe move it to the toolbox

marcusrehm · 2017-04-05T12:30:36Z

I think we can work on the data acquisition, coding the script to fetch the new data. After that we can work on the issue about request, in the end this one is needed in order to grab the data.

About the requests, do you know or have already tried use Tor? I was thinking about use it and in case it isn't allowed on the client network the script can then use batch processing to request in a specific time frame.

Then we can refactor the whole script, but we will already have the logic to get this running.

What you think?

cuducos · 2017-04-05T20:33:48Z

About the requests, do you know or have already tried use Tor?

TBH I just used Tor browser manually, that is to say, never integrated in a Python script. If this is possible and if this doesn't require a too specific setup (so contributors can get started in the project easily) I have no concerns about it. Otherwise I think that a proper handling of HTTP 429 and maybe some semaphore controlling the amount and frequency of requests might be enough.

I think we can work on the data acquisition, coding the script to fetch the new data.

Sure thing. I think it is a good start to start with data collection and once we can gather the information we're looking for we can look to the script and figure out the best way to deal with request traffic ; )

marcusrehm · 2017-04-10T14:28:18Z

Hey Guys! Just got the script fetching and saving the partners list. It's available at my repo. It's the same script but saving the data of partners, now I'm working to better handle the 429 error related with too many requests.

About the 429error, I think Tor is not a good option as people need to install it apart from serenata de amor and I think it is out of scope for the project (as @cuducos mentioned, it should be easy for the contributor to setup). So I started to work in handling the requests errors and I got some concerns about the script:

I think it's better use a sequencial approach to make the requests rather than paralelize (as the script do now) because I think parallelism improves the chance to get 429 too fast and besides that I also getting [Errno 24] Too many open files after handle the 429.
If we remove the parallelism there's no need to save data in pkl files and then put it on a DataFrame, we could do it right after get the data from receitasWS.
Other point wold be put companaies.xz and companies-partners.xz dataset's using the naming format yyyy-mm-dd-datasetname.xz as we use in others datasets. It helps in getting versions of the data and preserving data used in some analysis. For this we need to now the impact on other scripts.

Do you have any concerns about the points above or I can continue with this approach?

/cc @cuducos @jtemporal

jtemporal · 2017-04-11T16:34:36Z

Yaay this is awesome! I believe you are about right on that approach, AFAIK the parallelism part was used so we could generate the dataset faster, right now if we remove de the parallelism and get it to work properly, we can think about parallelize it again later. =)

marcusrehm · 2017-04-11T22:07:17Z

Cool @jtemporal ! I'm gonna do it this way then.

And about the file naming convention? Do you think that it is ok also?!

cuducos · 2017-04-13T01:12:28Z

Good points, @marcusrehm — I agree with you and @jtemporal on everything you've said. ABout the naming convention: we could have just YYYY-MM-DD-companies.xz I guess, addressing the partner names the same way we did with secondary_activity_XX[_code_]. Maybe the script wasn't versioning companies.xz, but we were doing it manually — it would be great to have it done automatically.

jtemporal · 2017-04-13T13:03:49Z

it would be great to have it done automatically

Automate all the things o/ 🎉

Maybe the script wasn't versioning companies.xz

That's about right! Versioning was happening when the dataset was being uploaded to the S3.

marcusrehm · 2017-04-15T20:29:48Z

Hey folks!

Got a new version of fetch_cnpj_info.py script here.

I brought the threading back but with some improvements. Now we can pass as arguments how many threads to use and a list of http proxies, so each request uses a randomly chosen proxy from the list or none (use the local ip). The script still taking time but fetches faster than old version.

Also for each batch of 100 requests it saves the cnpj-info.xz dataframe so if we have some issue or it interrupts abruptly we don't loose the work and the script can restarts from where it stopped.

The script can be called the same way as it is now (so nothing will break because of that) or it can be called as python ./src/fetch_cnpj_info.py ./data/2016-11-19-current-year.xz -p 177.67.84.135:8080 177.67.82.80:8080 177.67.82.80:3128 179.185.54.114:8080 -t 20 where -p or --proxies receives the list of proxies and -t or --threads the number of threads.

For the clean_cnpj_info_dataset.py I put the naming file convention to save the companies.xz file and adjusted partner list as @cuducos suggested here:

I guess, addressing the partner names the same way we did with secondary_activity_XX[code]. Maybe the script wasn't versioning companies.xz, but we were doing it manually — it would be great to have it done automatically.

My only concern about it is that wouldn't be easier to analyze data if we got it in a separate file? Having multiple columns with the "same" data can increase the difficulty to make filter or joins, don't you think?

I can create a PR so you guys can review and use it. Now I think we can work in bring the clean logic to the fetch script and have just one script for handle all process ok?

/cc @cuducos @jtemporal

cuducos · 2017-04-21T23:02:02Z

Now we can pass as arguments how many threads to use and a list of http proxies

That's awesome!

The script still taking time but fetches faster than old version.

🎉 Many thanks, @marcusrehm!

My only concern about it is that wouldn't be easier to analyze data if we got it in a separate file?

That's totally fine IMHO ; )

I can create a PR so you guys can review and use it.

Sure thing. Once the PR is opened I can offer a proper code review, but in a quick look it looks as a very good improvement — again, many thanks for that.

Once you open the PR if you have a dataset generated you can add a link to it if you like ; )

marcusrehm · 2017-04-27T23:44:13Z

Hey guys! It took more days than I thought but the PR #218 is there. When the dataset download finish I post the link here ok?

marcusrehm · 2017-04-28T03:54:26Z

the link to download the dataset https://we.tl/Zsw7zPhV6a

Companies partners list in companies dataset - Issue #16

cuducos · 2017-05-10T16:09:20Z

Does #218 (merged) closes this Issue? cc @jtemporal

cuducos · 2017-05-10T16:23:02Z

Also the dataset is not available in the S3 yet, is it? Does anyone still have a copy so we can upload it? cc @jtemporal

marcusrehm · 2017-05-11T01:35:13Z

I uploaded the file again https://we.tl/TTYzFTk5d8.

About close the issue, if this was just the acquisition of partners list, then I think it's done, but I didn't do any analysis with it (yet). :)

cc @cuducos @jtemporal

cuducos · 2017-05-11T12:32:32Z

I think this is mostly related to the partner list. I'm pondering on two issues about this dataset before bringin it to S3:

For some reason it has less 7% companies than the last one (no idea why)
It misses the geo coordinates we added using Google Places API

So before making it available I would like to know about best practices in versioning (arguably similar) datasets:

Should we rename it to companies-no-geolocation?
Should we add geolocation to it?
Should we strip off everything but CNPJ and partner list (making it an complimentary dataset to the former companies dataset)?

What do you think @Irio?

Irio · 2017-05-11T12:47:15Z

When we first generated it, the companies.xz file already had geolocation (using src/geocode_addresses.py). I'm good with option number 1 if we work on number 2 later. @cuducos

jtemporal · 2017-05-11T12:49:36Z

I'm good with option number 1 if we work on number 2 later

I agree with this approach ;)

marcusrehm · 2017-05-11T13:03:38Z

For some reason it has less 7% companies than the last one (no idea why)

I think I ran it using only the reimbursements dataset. Another reason could be that the last script were filling lines with blank info only with the message error for "CNPJ inválido".

cuducos · 2017-05-11T13:25:08Z

Renaming it, opening an issue to add geolocation… and closing this issue! Hell yeah ; ) Thank you so much @marcusrehm 🎉

Closed by #218

Initial Update

Add Code Climate, Coveralls and badges

Irio added the data collection label Aug 21, 2016

cuducos mentioned this issue Sep 6, 2016

Exploratory Analysis on companies with the same name #38

Open

cuducos assigned cuducos and unassigned cuducos Sep 6, 2016

cuducos mentioned this issue Sep 27, 2016

List of beneficiaries of Bolsa Família #77

Closed

cuducos added hacktoberfest high priority labels Nov 3, 2016

cuducos modified the milestone: Roadmap: Nepotism Nov 7, 2016

cuducos mentioned this issue Nov 10, 2016

Hypothesis: there are politicians spending mostly with people from the same political party #121

Open

cuducos mentioned this issue Apr 21, 2017

Update companies dataset #214

Closed

jtemporal added a commit that referenced this issue May 3, 2017

Merge pull request #218 from marcusrehm/companies-partners-list

5c4bab2

Companies partners list in companies dataset - Issue #16

cuducos closed this as completed May 11, 2017

Irio pushed a commit that referenced this issue Feb 27, 2018

Merge pull request #16 from datasciencebr/pyup-initial-update

08ce86b

Initial Update

cuducos pushed a commit that referenced this issue Feb 28, 2018

Merge pull request #16 from datasciencebr/cuducos-add-badges

bd88b05

Add Code Climate, Coveralls and badges

Partner list of companies receiving money from politicians #16

Partner list of companies receiving money from politicians #16

Comments

Irio commented Aug 21, 2016

anapaulagomes commented Sep 5, 2016

cuducos commented Sep 6, 2016 • edited Loading

Irio commented Sep 6, 2016

josircg commented Sep 6, 2016

lucasrcezimbra commented Sep 6, 2016 • edited by cuducos Loading

cuducos commented Sep 6, 2016

cuducos commented Sep 6, 2016

anapaulagomes commented Sep 6, 2016

urubatan commented Sep 13, 2016

anapaulagomes commented Sep 14, 2016

tomascamargo commented Sep 14, 2016

Irio commented Sep 14, 2016

urubatan commented Sep 14, 2016

tomascamargo commented Sep 14, 2016

mtrovo commented Sep 18, 2016

awerlang commented Sep 23, 2016

awerlang commented Sep 24, 2016 • edited Loading

awerlang commented Sep 26, 2016

cuducos commented Nov 8, 2016

josircg commented Dec 7, 2016

marcusrehm commented Apr 1, 2017

cuducos commented Apr 2, 2017

jtemporal commented Apr 2, 2017

marcusrehm commented Apr 3, 2017

cuducos commented Apr 3, 2017

cuducos commented Apr 4, 2017

marcusrehm commented Apr 5, 2017

cuducos commented Apr 5, 2017

marcusrehm commented Apr 10, 2017

jtemporal commented Apr 11, 2017

marcusrehm commented Apr 11, 2017

cuducos commented Apr 13, 2017

jtemporal commented Apr 13, 2017

marcusrehm commented Apr 15, 2017

cuducos commented Apr 21, 2017

marcusrehm commented Apr 27, 2017

marcusrehm commented Apr 28, 2017

cuducos commented May 10, 2017 • edited Loading

cuducos commented May 10, 2017

marcusrehm commented May 11, 2017

cuducos commented May 11, 2017

Irio commented May 11, 2017

jtemporal commented May 11, 2017

marcusrehm commented May 11, 2017

cuducos commented May 11, 2017

cuducos commented Sep 6, 2016 •

edited

Loading

lucasrcezimbra commented Sep 6, 2016 •

edited by cuducos

Loading

awerlang commented Sep 24, 2016 •

edited

Loading

cuducos commented May 10, 2017 •

edited

Loading