Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Codegen] Codegen fails to finish with many nested fragments #3434

Open
pm-dev opened this issue Aug 29, 2024 · 10 comments
Open

[Codegen] Codegen fails to finish with many nested fragments #3434

pm-dev opened this issue Aug 29, 2024 · 10 comments
Labels
bug Generally incorrect behavior codegen Issues related to or arising from code generation

Comments

@pm-dev
Copy link

pm-dev commented Aug 29, 2024

Summary

When running codegen on a large project with many operations with nested fragments, codegen hangs and never finishes. We were hoping the disable fragment merging feature would solve this, sadly it does not.

In an email exchange with @AnthonyMDev it sounded like this is a known issue:

This is also a known issue. It’s pretty rare, but some users with a LOT of nested fragments, we’re seeing this issue currently.

It’s due to the collections that hold all the computed merged fragment fields being so large that they are causing these issues. My current hypothesis is that the performance problem is mostly seen when the collection exceeds its capacity and is copied into a larger buffer.

I’ve been dealing with other issues, so I haven’t gotten around to this one yet, but my thought is that I can replace the array with a more suitable data structure. I think a custom tree implementation is going to be able to resolve it. Once I get the fragment merging resolved, I’ll try to work on that...

Wanted to open an issue so we can track it and prioritize it! Thanks, looking forward to finally adopting v1!

Version

1.0

Steps to reproduce the behavior

I've previously submitted a schema and .graphql operation files that allow reproducing this. Happy to send it again.

Logs

No response

Anything else?

No response

@pm-dev pm-dev added bug Generally incorrect behavior needs investigation labels Aug 29, 2024
@calvincestari
Copy link
Member

Thanks for creating the issue @pm-dev. I thought we already had an issue open for this work but I can't seem to find it so I'll add this one to the relevant milestone for tracking the work.

@calvincestari calvincestari added codegen Issues related to or arising from code generation and removed needs investigation labels Aug 29, 2024
@calvincestari calvincestari added this to the Minor Releases (1.x) milestone Aug 29, 2024
@pm-dev
Copy link
Author

pm-dev commented Dec 12, 2024

Any progress on this? The roadmap has said this is in progress for a while.

(in progress) Fix retain cycles and memory issues causing code generation to take very long on certain large, complex schemas with deeply nested fragment composition

@AnthonyMDev
Copy link
Contributor

Well, I've made progress in the Edisonian sense. Which is to say, I've continued to rule out things that have don't solve the problem. I just recently had a meeting with another team that was exploring some possibilities for resolving this, but nothing that they have tried has worked either.

I do continue to come back to this issue as I have new ideas for things to try though.

One idea that I have is to develop some sort of short-lived SQLLite database to hold the data that is currently stored in-memory in the EntitySelectionTree and DefinitionEntityStorage. It seems like the bottleneck is likely related to the exponential growth of those data structures. This is not a small effort though, and I'm currently focusing on other projects.

As soon as I get another chance to take a crack at this, that is what I'll be exploring next.

@pm-dev
Copy link
Author

pm-dev commented Dec 13, 2024

Thanks for keeping this high priority. Considering the js version runs in about a second, do you know what the difference in algorithm is?

@AnthonyMDev
Copy link
Contributor

@pm-dev I cannot believe it... but I think I finally cracked it!!!

This only works if you disable fragment field merging in your generated models, but I imagine that's not a problem for you. Can you try running the codegen again using the codegen libary in This PR: apollographql/apollo-ios-dev#571

@pm-dev
Copy link
Author

pm-dev commented Jan 6, 2025

@AnthonyMDev ON IT!!

@pm-dev
Copy link
Author

pm-dev commented Jan 6, 2025

It finished! Nice work!
It took 1min and 15secs on my M1 with 32GB ram and 10 CPU cores. The memory usage was low ~50mb and wasn't increasing so it doesn't seem like there are any leaks. The CPU usage was maxing out my 10 cores for the entire duration. Our schema file is 275,000 lines when in json format and we have around 400 operations that make use of some heavily nested fragments, so this is a pretty large/complex codegen, but I'm still curious how the js cli tool (from before v1) can run in 1 second and this takes over a minute..?

@AnthonyMDev
Copy link
Contributor

I don't think I can give you a precise answer on why that is. But the JS codegen tool was very naive and did not create models that are nearly as feature rich as the new ones. The templating code was also written with a focus on making it easier for us to modify the templates in the future over performance. That said, most projects finish code generation in less than 15 seconds, so it hasn't even something we've felt the need to work on performance tuning for. The issues you were experiencing before were a separate issue that is now resolved finally.

@pm-dev
Copy link
Author

pm-dev commented Jan 8, 2025

Thanks again for all the work on this issue. The reason I ask is to understand the tradeoffs before committing to an upgrade from 0.50 to v1. If codegen takes 75x longer it's important to know what additional features come with. My understanding is the main goal/benefit of v1 was the smart merging of fields from parents and siblings in model objects. Since we have this turned off and there's still a slowdown I was curious what else might be going on and whether there are benefits outside the reduced code footprint.
And the reduction in generated code is real.. the amount of generated code in my project goes from ~430,000 lines to ~150,000 lines. I haven't yet compared how this affects compile time or binary size, but I can guess it's not trivial and would be a nice win.
I'm using v1 in some personal side projects and codegen runs great, but wondering if maybe there should be some warnings published about using this library for larger projects, if that's not a priority.

@AnthonyMDev
Copy link
Contributor

You are correct that the primary new feature offered in 1.0 was the field merging. In addition to that, it also made multi-module projects work with codegen a lot better. But I'd say that more importantly, the benefit of upgrading is that you are on the actively maintained version of the library. There are a lot of features and bug fixes that we have implemented since 1.0 and a ton more on the way.

Since 1.0 we have added a test mocking library, @defer support, and a ton of other minor features, improvements, and bug fixes.

Right now, we are in the process of adding the @typePolicy directive for cache control, support for the Swift 6 concurrency model, and beginning exploration on the next generation of the GraphQL Normalized Cache which will bring support for facet search of local data, cache eviction with TTL and cascading deletes, hanging entity detection and cleanup, and more.

The legacy version is EOL, so you aren't going to get any non-critical bug fixes or any of the new features.


This is not necessarily a problem with "large projects" we work closely with many customers with very large projects at some of the biggest companies in the world. There are always some complexities and frustrations that come with scale, but Apollo iOS does support these projects. Codegen time is higher than it was in the legacy version, but even with full field merging enabled, these projects completed their code generation runs in 5-15 minutes. Once we implemented parallelism in the codegen engine, along with some performance improvements we found, that time was reduced significantly. And now, if they turn off field merging, it should be even faster. The problem you were experiencing is specifically with projects that use the pattern of many deeply nested fragments, which just overloads the field merging algorithm.

We've thought about this a lot internally, and we think it's unlikely that projects that use this pattern would actually want to have fragment field merging enabled anyways. It creates massive, difficult to understand operation models. Its more likely that you are using the individual fragments themselves as models in this design, so having isolated fragments models without merging would probably be preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Generally incorrect behavior codegen Issues related to or arising from code generation
Projects
None yet
Development

No branches or pull requests

3 participants