Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Revisão retroativa] Campos dos Goytacazes-RJ #637

Open
rennerocha opened this issue Sep 5, 2022 · 9 comments · May be fixed by #1323
Open

[Revisão retroativa] Campos dos Goytacazes-RJ #637

rennerocha opened this issue Sep 5, 2022 · 9 comments · May be fixed by #1323
Assignees
Labels
dificuldade:media Dificuldade média de desenvolver incompativel Esta Issue ou Pull Request não de adequa ao atual estado do projeto spider Adiciona robô raspador para município(s)

Comments

@rennerocha
Copy link
Collaborator

rennerocha commented Sep 5, 2022

Spider existente funciona, porém não possui filtro de datas (start_date e end_date) para reduzir a quantidade de requests e extrair apenas os períodos solicitados.

@rennerocha rennerocha added the spider Adiciona robô raspador para município(s) label Sep 11, 2022
@rennerocha rennerocha added the dificuldade:media Dificuldade média de desenvolver label Sep 20, 2022
@ayharano
Copy link
Contributor

ayharano commented Oct 4, 2022

Ainda não abri PR mas gostaria de avisar que estou trabalhando nessa issue

ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
@ayharano
Copy link
Contributor

ayharano commented Oct 5, 2022

Conforme comentado, abri o PR #702. No caso, fui além do solicitado pelo fato de que algumas mudanças poderiam ser feitas para cobrir alguns casos que talvez ainda não existissem em 2020, data da versão anterior às mudanças do PR, com exemplos de referência via comentário para justificar a escolha do tratamento.

@trevineju trevineju linked a pull request Oct 5, 2022 that will close this issue
5 tasks
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

The existing code did not address the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

The existing code did not address the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

The existing code did not address the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

The existing code did not address the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 5, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

The existing code did not address the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling.

resolve okfn-brasil#637
@trevineju trevineju moved this to 🔴 Não desenvolvido in [Querido Diário] Municípios Oct 8, 2022
@trevineju trevineju moved this from 🔴 Não desenvolvido to 🟡 Em revisão in [Querido Diário] Municípios Oct 8, 2022
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 9, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 9, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 9, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
@ayharano
Copy link
Contributor

ayharano commented Oct 9, 2022

Finalmente reescrevi o PR com as mudanças que eu gostaria de fazer. Como a reescrita ficou bem diferente da implementação usual dos Spiders desse repo, peço para fazer o review com tempo.

@trevineju trevineju moved this from 🟡 Em revisão to 🟠 Revisão retroativa in [Querido Diário] Municípios Oct 10, 2022
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 11, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 13, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 13, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 13, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 17, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano pushed a commit to ayharano/querido-diario that referenced this issue Oct 17, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
@ayharano
Copy link
Contributor

Repetindo o comentário que deixei no PR:

Conforme conversado com @trevineju e @giuliocc, o spider precisa acertar a questão dos arquivos .rar de Outubro de 2012 a Outubro de 2013.

ayharano added a commit to ayharano/querido-diario that referenced this issue Oct 19, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano added a commit to ayharano/querido-diario that referenced this issue Oct 19, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
@trevineju trevineju moved this from 🟠 Revisão retroativa to 🔴 Não desenvolvido in [Querido Diário] Municípios Oct 23, 2022
ayharano added a commit to ayharano/querido-diario that referenced this issue Nov 1, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano added a commit to ayharano/querido-diario that referenced this issue Dec 31, 2022
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
@trevineju trevineju moved this from 🔴 Não desenvolvido to 🟡 Em revisão in [Querido Diário] Municípios Feb 10, 2023
ayharano added a commit to ayharano/querido-diario that referenced this issue Oct 10, 2023
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
ayharano added a commit to ayharano/querido-diario that referenced this issue Oct 17, 2023
The way the spider was implemented assumed that there could only be a single file_url per day per is_extra_edition value, which was not always true.

This refactoring gathers all the various files per day and is_extra_edition.

We addressed the text format for Saturday gazettes to be considered is_extra_edition.

We also included the start_date and end_date handling, and edition_number when applicable.

resolve okfn-brasil#637
@trevineju trevineju changed the title Campos dos Goytacazes-RJ [Revisão retroativa] Campos dos Goytacazes-RJ Mar 10, 2024
@samueldsiqueira
Copy link
Contributor

samueldsiqueira commented Oct 23, 2024

Olá, fiz uma pesquisa pela cidade de Campos dos Goytacazes e não obtive sucesso no retorno.
Posso ajudar nessa issue?

@trevineju
Copy link
Member

@samueldsiqueira, esta issue já tem uma PR vinculada, então não teria como ajudar pq tá "feito", estava aguardando revisão

@trevineju
Copy link
Member

porém, vou fechar a PR e a issue por incompatibilidade.

O comentário de @ayharano sobre parte dos documentos estarem em .rar impossibilita adicionarmos o raspador. Verifiquei o período que havia mencionado (outubro de 2012 a 2013) e segue do mesmo jeito... vou deixar uma issue para discutirmos se faz sentido ou temos como adicionar uma solução pra essa situação, e aí podemos retomar a task a partir do acumulo

@trevineju trevineju added the incompativel Esta Issue ou Pull Request não de adequa ao atual estado do projeto label Oct 23, 2024
@trevineju trevineju closed this as not planned Won't fix, can't repro, duplicate, stale Oct 23, 2024
@trevineju trevineju reopened this Oct 30, 2024
@github-project-automation github-project-automation bot moved this from em revisão to novo in [Querido Diário] Municípios Oct 30, 2024
@trevineju
Copy link
Member

@slfabio trouxe a sugestão de ignorarmos o intervalo e seguirmos com a integração do raspador. Reabro a issue para conversarmos sobre a ideia.

Fabio pode argumentar mais a sua sugestão, claro, mas a princípio, não concordo muito, pois vai incorporar ao raspador (e no limite, ao projeto) uma lógica de ficar deixando de lado certos trechos de diários oficiais de propósito, "hardcodando" esses contornos. E temos o comprometimento de oferecer a base de dados de maneira confiável e sequencial.

Porém, de forma provisória, penso que podemos assumir o start_date do raspador para outubro/2013. Assim, o intervalo incompatível fica de fora e tudo antes também, mas de outubro/2013 até hoje fica consistente.

Forçar o start_date "errado" seria uma decisão nova no projeto, mas teria uma natureza próxima ao que fazemos com sites descontinuados: importa ter o intervalo vigente primeiro, e depois ir expandindo a cobertura rumo aos diários antigos. Aí poderíamos retomar a PR que foi fechada...

O que vocês acham?
@ayharano @slfabio @rennerocha @ogecece

@slfabio
Copy link
Collaborator

slfabio commented Oct 30, 2024

Prefiro sua proposta também, @trevineju. Já vai trazer os últimos 11 anos para a plataforma.

Por enquanto estamos sem estagiário, eu também não estou conseguindo tempo pra puxar nenhuma issue.
Mas estamos selecionando novos estagiários, e acredito que a partir do meio de novembro vamos voltar a puxar as issues lá do quadro.

Muito obrigado por reabrir a issue, esse é um dos maiores municípios do Estado, temos bastante interesse de incluí-lo no QD.

@trevineju trevineju moved this from novo to em revisão in [Querido Diário] Municípios Nov 10, 2024
@slfabio slfabio self-assigned this Nov 13, 2024
@slfabio slfabio linked a pull request Nov 13, 2024 that will close this issue
14 tasks
slfabio added a commit to slfabio/querido-diario that referenced this issue Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dificuldade:media Dificuldade média de desenvolver incompativel Esta Issue ou Pull Request não de adequa ao atual estado do projeto spider Adiciona robô raspador para município(s)
Projects
Status: em revisão
5 participants