-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random "duplicate key value violates unique constraint" errors when initializing the postgres database #23512
Comments
@ephraimbuddy sorry man, i actually have no idea what could cause the issue. i have only wild guesses 😅 |
Oh, I assigned it to you because you checked that you are willing to submit a PR. I'll unassign you now :) |
further investigation shows that we are removing the same "permissions" and "permission views" repeatedly and trying to add them back repeatedly:
notice the 2 I don't understand why we The database screenshots which shows the permissions and permission views are being deleted and added back hence their IDs are increasing interestingly previous airflow versions also have the same behavior but there were no errors |
This sounds like a bug in sync_appbuilder_roles implementation - when we have more than 1 webserver or when they are crashing during first start they then subsequently fail on starting another webserver via wsgi. @jhtimmins - maybe you can take a look at that ? |
i see those repeated log messages right after booting up 4 worker sub-processes, are these workers trying to update the database at the same time using the same |
Yeah. I think the problem is that the webserver spawns 4 parallel webserver processes (workers) and each process tries to update the permissions at the same time) |
Which library I should downgrade to get my upgrade going. Any suggestions, please? |
I don't think downgrade library will work? Seems like we have to wait for v2.3.1. I am also facing exact thing for v2.3.0 using docker-compose installation |
both Line 136 in ee100a5
and Line 146 in ee100a5
are doing clashing reads/writes to the database once this line
spins up 4 (by default) gunicorn workers. so i guess if i do this os.environ['SKIP_DAGS_PARSING'] = 'True'
app = create_app(None)
os.environ.pop('SKIP_DAGS_PARSING')
# set webserver.update_fab_perms to "False" here
# would cause the workers not to do the clashing reads/writes
with subprocess.Popen(run_args, close_fds=True) as gunicorn_master_proc:
monitor_gunicorn(gunicorn_master_proc.pid) i know it's weird to alter it on the fly. an alternative is to set an env var such as os.environ['UPDATE_FAB_PERMS_DONE'] = 'True' and then check it along side with Line 67 in ee100a5
i tried the env var solution, it worked. but it's janky. it requires an extra env var 🤔 |
This seems somewhat related to #23926. cc @jhtimmins |
aha, i think it's exactly the same as #23926: all gunicorn workers are executing the same code causing clashes on the database maybe i'll move my comments there |
@jhtimmins #24065 didn't fix this issue:
|
still seeing
and
|
Seeing this in 2.4.1 also. Workaround is to set config |
@hterik sadly sometimes we do need that config to be |
I believe This is going to be fixed in 2.4.2 #26885 |
how does #26885 solve the clashing reads/writes? 🤔 |
I've seen this from time to time and this is a separate issue to the null constraints violation. |
Ah. My mistake. |
so |
I also see this issue every time when the web server starts up. Env |
I can see the same problem after enabling Okta. Unfortunately, for me it happens every time Kubernetes tries to start a new webserver pod. Airflow 2.3.3. Logs: https://gist.github.com/mjkonarski-b/8dd973b5a5ed99446508bf2ca74f967e |
Saw this after upgrading from 2.2.5 to 2.3.2. Dropping |
i checked out wget -qO - https://github.com/apache/airflow/pull/27297.patch | git apply -v -3 then built my own airflow breeze release-management prepare-airflow-package --package-format=wheel --verbose then installed it pip install apache_airflow-2.4.2-py3-none-any.whl[...] --constraint ... no more "duplicate key value violates unique constraint" errors |
Thanks for the good news. I tried to test in our Azure deployment (using the official Helm chart) and Airflow 2.4.2 and it seems in our deployment it is still showing the same constraint errors. I am a bit confused because the milestoone was now modfied to be Airflow 2.4.3 but above the success is reported on 2.4.2... which release it it assumed to be in? Our logs still look like attached (don't want to spoil the message thread with hundrets of lines...,):
|
No confusion. There were two similar issues - one of them (for some users) fixed in 2.4.2 and another similar issue for some other users that will be released in 2.4.3. The milestone is always "current" thinking and if it moves, it moves. |
@jens-scheffler-bosch also you'd have to patch 2.4.2 manually |
if the preload fix is indeed the fix, you can suppress this for now by setting the |
Great point @mdering ! |
Thanks for the clarification, looking forward for 2.4.3 :-D |
We are still having this issue in 2.4.3. We set [webserver].workers to 1 for a workaround. Here were the logs we were seeing: airflow-web 127.0.0.1 - - [23/Nov/2022:15:35:59 +0000] "GET /login/ HTTP/1.1" 200 16330 "https://login.microsoftonline.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52" |
Can you please open a new issue for 2.4.3 @zachliu with more details? I have a feeling this is somehow different issue. |
@potiuk you probably @ahipp13 Can you please open a new issue for 2.4.3 with more details? I also have a feeling this is a different issue. issue #23512 is centered around the |
Not sure why - but yes. I ping the wrong person :) |
Issue submitted #27977 |
Apache Airflow version
2.3.0 (latest released)
What happened
while testing airflow 2.3.0 locally (using postgresql 12.4), the webserver container shows random errors:
notes:
permission_view_id
but always the same'role_id': 1
)permission_view_id
each timeHowever such exceptions are rare and pure random, i can't find a way to reproduce them consistently.
What you think should happen instead
prior to 2.3.0 there were no such errors
How to reproduce
No response
Operating System
Linux Mint 20.3
Versions of Apache Airflow Providers
No response
Deployment
Docker-Compose
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: