Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conda-forge repodata is missing new packages as of 2024-09-07 #1024

Closed
2 tasks done
beckermr opened this issue Sep 8, 2024 · 10 comments
Closed
2 tasks done

conda-forge repodata is missing new packages as of 2024-09-07 #1024

beckermr opened this issue Sep 8, 2024 · 10 comments
Labels
type::bug describes erroneous operation, use severity::* to classify the type

Comments

@beckermr
Copy link

beckermr commented Sep 8, 2024

Checklist

  • I added a descriptive title
  • I searched open reports and couldn't find a duplicate

What happened?

New packages are not appearing in the conda-forge repodata. See for example https://anaconda.org/conda-forge/conda-forge-feedstock-ops/files which is at 0.5.0 but conda search yeilds

% conda search conda-forge-feedstock-ops
Loading channels: done
# Name                       Version           Build  Channel             
conda-forge-feedstock-ops           0.3.0    pyhd8ed1ab_1  conda-forge       

Additional Context

No response

@beckermr beckermr added the type::bug describes erroneous operation, use severity::* to classify the type label Sep 8, 2024
@beckermr
Copy link
Author

beckermr commented Sep 8, 2024

cc @jezdez @chenghlee

@jezdez
Copy link
Member

jezdez commented Sep 8, 2024

Looking

@jezdez
Copy link
Member

jezdez commented Sep 8, 2024

I've raised an incident internally to investigate ASAP, will keep you up-to-date

@morremeyer
Copy link

Hey everyone,

at 2024-09-07T01:29:29 UTC, due to reasons we could not yet determine, the cloning process got stuck, leading to the clone job not terminating and therefore, no new jobs were started.

We have terminated the stuck job, and the next job updated the clone to 2024-09-08T18:52:26 UTC.

@morremeyer
Copy link

The cloning has stabilized now. We will continue our root cause investigation tomorrow and implement measures to prevent this from happening again.

@morremeyer
Copy link

To prevent the process from getting stuck and blocking subsequent runs, we have implemented a mechanism that terminates stuck jobs after 30 minutes earlier today, to ensure that if a job gets stuck again, the next job is unblocked.

We have not yet identified the root cause for this issue and are continuing to look into it.

@morremeyer
Copy link

We finished the root cause analysis for this issue and came to the conclusion that the likely cause was a network issue. Combined with requests for repodata.json files not having configured timeouts explicitly, this led to the clone job waiting for an HTTP response that never arrived.

When starting at 2024-09-07T01:24:09Z, the clone started downloading the repodata.json files for all 19 default subdirs configured in conda-index.
17 of these downloads were successful and processed, but 2 of them never show up as processed in the logs: osx-arm64 and win-64.
The first step of the clone process, downloading the repodata, only finishes when all subdirs are processed.

The anaconda.org backend logs the requests for the repodata of these two subdirs as successful with an HTTP 200, however, the clone job never seems to have received these responses and was therefore waiting for them indefinitely.

We have already implemented measures to terminate stuck jobs after 30 minutes for conda-forge, and have applied this to all other cloned channels, too.

Additionally, have introduced timeouts to the HTTP requests for the repodata.json files, so that the job aborts if these responses time out, leading to a new job being started.

@dholth
Copy link
Contributor

dholth commented Sep 10, 2024

We used to have a 60/90s timeout for repodata.json dowloads, we've shuffled the implementation?

@morremeyer
Copy link

Yes, with the code changes for the current version (v3) of the cloning, the implementation was changed to python requests, where we did not implement the timeouts initially.

This has been fixed with the changes I mentioned above.

@jezdez
Copy link
Member

jezdez commented Sep 12, 2024

Closing as resolved

@jezdez jezdez closed this as completed Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::bug describes erroneous operation, use severity::* to classify the type
Projects
Archived in project
Development

No branches or pull requests

4 participants