-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4/4] Rétrofit des helpers pg / s3 / macros etc #173
Conversation
5e2ff5a
to
faede54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -292,7 +226,7 @@ def _load( | |||
extract = python.ExternalPythonOperator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pour éviter de se poser trop de questions, est-ce qu'on peut garder synchro :
- le nom de la task python.ExternalPythonOperator
- le nom de la fonction passée à python_callable
- l'id de la task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oui et "non"... J'ai essayé mais en réalité le souci c'est que je ne peux pas les garder EXACTEMENT synchro (conflit de nommage entre le nom de la méthode et la task sinon)
Je dois donc faire comme toi et mettre "au moins un underscore".
Sauf que en faisant ça, je trouve que entre avoir:
load
pour le nom / ID de la task et_load
pour le callableload
pour le nom / ID de la task etload_s3_to_datawarehouse
load_s3_to_datawarehouse
pour le nom / ID de la task et_load_s3_to_datawarehouse
... l'option 2 reste la plus jolie / concise dans l'interface Airflow et le code du DAG, tout en restant plus explicite sur le nom du callable que l'option 1 (et en commençant par le meme identifiant)
L'option 3 me semble polluer le nom des tasks Airflow et alourdir le code du DAG inutilement, on est dejà explicite quelque part.
faede54
to
ce0b06a
Compare
@vmttn si ça te va je suis OK pour merger ça j'ai implémenté les corrections demandées (sauf une, j'ai expliqué pourquoi) |
'pass' cannot be used, it's 'password'. I don't entirely understand how it could work so far.
TODOS: - rewrite settings.py to use a DataSource(class) defining HTTP extractors and loaders, streams and schedule intervals, etc. - split the DAG. - place mediation numerique somewhere else About the tests: - split the tests that should be run on CI and the others. - if we want to test the DAGs we want: * a specific test database just for the testing moment * some cleanup before and after * etc etc ==> "just running" pytest without orchestration has zero chance to work, so we should split the tests run on CI and the others. Don't forget to re-run the DAGs before and after the changes and check if the sources or datawarehouses have changed...
Some of the DAGs (for instance all the INSEE ones) could maybe be combined ? I am a little surprised that for those ones we don't have the datalake layer. I agree those are mostly for "sources" in the DI sense but technically, it would be nice to assume that we can reproduce everything from a single S3 dump. Maybe that's too much ? I still think I would sleep better at night if the Extract + Load was always the same. In which case we could separate the subfolders though. - di-sources - seeds - you name it
In the name of the DRY.
ce0b06a
to
b394b01
Compare
Tu mets le petit doigt dedans, ça t'attrape le bras.
Je rétrofit les helpers introduits dans 2/4, dans les autres DAGs et modèles.
Au passage j'ai vu passer deux ou trois autres trucs mais c'est juste du code réécrit autrement pour etre plus DRY.