Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4/4] Rétrofit des helpers pg / s3 / macros etc #173

Merged
merged 7 commits into from
Dec 14, 2023

Conversation

vperron
Copy link
Contributor

@vperron vperron commented Dec 11, 2023

Tu mets le petit doigt dedans, ça t'attrape le bras.

Je rétrofit les helpers introduits dans 2/4, dans les autres DAGs et modèles.

Au passage j'ai vu passer deux ou trois autres trucs mais c'est juste du code réécrit autrement pour etre plus DRY.

Copy link
Contributor

@vmttn vmttn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

pipeline/dags/import_admin_express.py Outdated Show resolved Hide resolved
pipeline/dags/import_brevo.py Outdated Show resolved Hide resolved
pipeline/dags/import_insee_firstnames.py Outdated Show resolved Hide resolved
pipeline/dags/import_odspep.py Show resolved Hide resolved
pipeline/dags/import_sources.py Show resolved Hide resolved
pipeline/dags/import_sources.py Outdated Show resolved Hide resolved
@@ -292,7 +226,7 @@ def _load(
extract = python.ExternalPythonOperator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pour éviter de se poser trop de questions, est-ce qu'on peut garder synchro :

  • le nom de la task python.ExternalPythonOperator
  • le nom de la fonction passée à python_callable
  • l'id de la task

Copy link
Contributor Author

@vperron vperron Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oui et "non"... J'ai essayé mais en réalité le souci c'est que je ne peux pas les garder EXACTEMENT synchro (conflit de nommage entre le nom de la méthode et la task sinon)

Je dois donc faire comme toi et mettre "au moins un underscore".

Sauf que en faisant ça, je trouve que entre avoir:

  • load pour le nom / ID de la task et _load pour le callable
  • load pour le nom / ID de la task et load_s3_to_datawarehouse
  • load_s3_to_datawarehouse pour le nom / ID de la task et _load_s3_to_datawarehouse

... l'option 2 reste la plus jolie / concise dans l'interface Airflow et le code du DAG, tout en restant plus explicite sur le nom du callable que l'option 1 (et en commençant par le meme identifiant)

L'option 3 me semble polluer le nom des tasks Airflow et alourdir le code du DAG inutilement, on est dejà explicite quelque part.

@vperron
Copy link
Contributor Author

vperron commented Dec 13, 2023

@vmttn si ça te va je suis OK pour merger ça j'ai implémenté les corrections demandées (sauf une, j'ai expliqué pourquoi)

@vperron vperron marked this pull request as ready for review December 13, 2023 16:30
'pass' cannot be used, it's 'password'. I don't entirely understand how
it could work so far.
TODOS:
- rewrite settings.py to use a DataSource(class) defining HTTP
  extractors and loaders, streams and schedule intervals, etc.
- split the DAG.
- place mediation numerique somewhere else

About the tests:
- split the tests that should be run on CI and the others.
- if we want to test the DAGs we want:
  * a specific test database just for the testing moment
  * some cleanup before and after
  * etc etc
  ==> "just running" pytest without orchestration has zero chance to
  work, so we should split the tests run on CI and the others.

Don't forget to re-run the DAGs before and after the changes and check
if the sources or datawarehouses have changed...
Some of the DAGs (for instance all the INSEE ones)
could maybe be combined ?

I am a little surprised that for those ones we don't
have the datalake layer. I agree those are mostly
for "sources" in the DI sense but technically, it
would be nice to assume that we can reproduce everything
from a single S3 dump.

Maybe that's too much ? I still think I would sleep
better at night if the Extract + Load was always the same.

In which case we could separate the subfolders though.
- di-sources
- seeds
- you name it
@vperron vperron merged commit 54fc160 into main Dec 14, 2023
6 checks passed
@vperron vperron deleted the vperron/retrofit-helpers branch December 14, 2023 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants