-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor indexer #25174
Refactor indexer #25174
Conversation
I don't see anytime soon having index changed so much that it would be completely incompatible with old previous version - most probably we would add fields etc not completely change them because for large instances that would mean completely reindex all code search that could probably take multiple days and in that time would not work at all. So code changes must be made so that it is not so. With alias if ever we need to make full reindex (like populating new fields etc) while it is in progress it can still work with old index (even if in limited functionality with data that is in old index) and that is way better than not having code search functionality for multiple days... |
Hmm, I never thought it would take days. Yes, it could happen. I think you are right. I need some time to keep the aliases, because the old issue indexer used the index name (which should be alias name) directly to create index, so I need do some thing like: rename Update: Got some trouble, it's not easy to "rename" an index for ES.
Update: Wait, I found that the old code doesn't work this way: "while it is in progress it can still work with old index". Gitea creates a new index and points the alias to it immediately. gitea/modules/indexer/code/elastic_search.go Lines 144 to 176 in 51c2aeb
So it use the new index even "it is in progress". @lafriks I think your ideas are reasonable. But maybe we could do something for it in another PR because the old code doesn't work too. Maybe:
All logic can be implemented within GItea, and it doesn't require alias. @lafriks What do you think? |
@lafriks Kindly ping. 👀 |
I think this refactoring PR can't wait too long. Since it has already received two approvals, I will wait another 2 days. If no one objects, I will merge it. If this PR causes any regression, please let me know by @ me. |
@lafriks Please review again |
I think @lafriks had the chance to review already and didn't use it. |
Fix regression of #5363 (so long ago). The old code definded a document mapping for `issueIndexerDocType`, and assigned it to `BleveIndexerData` as its type. (`BleveIndexerData` has been renamed to `IndexerData` in #25174, but nothing more.) But the old code never used `BleveIndexerData`, it wrote the index with an anonymous struct type. Nonetheless, bleve would use the default auto-mapping for struct it didn't know, so the indexer still worked. This means the custom document mapping was always dead code. The custom document mapping is not useless, it can reduce index storage, this PR brings it back and disable default mapping to prevent it from happening again. Since `IndexerData`(`BleveIndexerData`) has JSON tags, and bleve uses them first, so we should use `repo_id` as the field name instead of `RepoID`. I did a test to compare the storage size before and after this, with about 3k real comments that were migrated from some public repos. Before: ```text [ 160] . ├── [ 42] index_meta.json ├── [ 13] rupture_meta.json └── [ 128] store ├── [6.9M] 00000000005d.zap └── [256K] root.bolt ``` After: ```text [ 160] . ├── [ 42] index_meta.json ├── [ 13] rupture_meta.json └── [ 128] store ├── [3.5M] 000000000065.zap └── [256K] root.bolt ``` It saves about half the storage space. --------- Co-authored-by: Giteabot <teabot@gitea.io>
Refactor
modules/indexer
to make it more maintainable. And it can be easier to support more features. I'm trying to solve some of issue searching, this is a precursor to making functional changes.Current supported engines and the index versions:
Changes
Split
Splited it into mutiple packages
indexer/interanal
: Internal shared package for indexer.indexer/interanal/[engine]
: Internal shared package for each engine (bleve/db/elasticsearch/meilisearch).indexer/code
: Implementations for code indexer.indexer/code/internal
: Internal shared package for code indexer.indexer/code/[engine]
: Implementation via each engine for code indexer.indexer/issues
: Implementations for issues indexer.Deduplication
Init/Ping/Close
for code indexer and issues indexer.CombineRemove it, use dummy indexer instead when the indexer is not ready.issues.indexerHolder
andcode.wrappedIndexer
tointernal.IndexHolder
.indexerID()
.Enhancement
elastic_search/ElasticSearch
, it should beElasticsearch
.Aliases
:Ping
has been called, don't ping periodically and cache the status.