Refactor indexer #25174

wolfogre · 2023-06-09T10:55:12Z

Refactor modules/indexer to make it more maintainable. And it can be easier to support more features. I'm trying to solve some of issue searching, this is a precursor to making functional changes.

Current supported engines and the index versions:

engines	issues	code
db	Just a wrapper for database queries, doesn't need version	-
bleve	The version of index is 2	The version of index is 6
elasticsearch	The old index has no version, will be treated as version 0 in this PR	The version of index is 1
meilisearch	The old index has no version, will be treated as version 0 in this PR	-

Changes

Split

Splited it into mutiple packages

indexer
├── internal
│   ├── bleve
│   ├── db
│   ├── elasticsearch
│   └── meilisearch
├── code
│   ├── bleve
│   ├── elasticsearch
│   └── internal
└── issues
    ├── bleve
    ├── db
    ├── elasticsearch
    ├── internal
    └── meilisearch

indexer/interanal: Internal shared package for indexer.
indexer/interanal/[engine]: Internal shared package for each engine (bleve/db/elasticsearch/meilisearch).
indexer/code: Implementations for code indexer.
indexer/code/internal: Internal shared package for code indexer.
indexer/code/[engine]: Implementation via each engine for code indexer.
indexer/issues: Implementations for issues indexer.

Deduplication

Combine Init/Ping/Close for code indexer and issues indexer.
~~Combine issues.indexerHolder and code.wrappedIndexer to internal.IndexHolder.~~ Remove it, use dummy indexer instead when the indexer is not ready.
Duplicate two copies of creating ES clients.
Duplicate two copies of indexerID().

Enhancement

Support index version for elasticsearch issues indexer, the old index without version will be treated as version 0.
Fix spell of elastic_search/ElasticSearch, it should be Elasticsearch.
Improve versioning of ES index. We don't need Aliases:
- Gitea does't need aliases for "Zero Downtime" because it never delete old indexes.
- The old code of issues indexer uses the orignal name to create issue index, so it's tricky to convert it to an alias.
Support index version for meilisearch issues indexer, the old index without version will be treated as version 0.
Do "ping" only when Ping has been called, don't ping periodically and cache the status.
Support the context parameter whenever possible.
Fix outdated example config.
Give up the requeue logic of issues indexer: When indexing fails, call Ping to check if it was caused by the engine being unavailable, and only requeue the task if the engine is unavailable.
- It is fragile and tricky, could cause data losing (It did happen when I was doing some tests for this PR). And it works for ES only.
- Just always requeue the failed task, if it caused by bad data, it's a bug of Gitea which should be fixed.

lafriks · 2023-06-18T15:36:33Z

I don't see anytime soon having index changed so much that it would be completely incompatible with old previous version - most probably we would add fields etc not completely change them because for large instances that would mean completely reindex all code search that could probably take multiple days and in that time would not work at all. So code changes must be made so that it is not so.

With alias if ever we need to make full reindex (like populating new fields etc) while it is in progress it can still work with old index (even if in limited functionality with data that is in old index) and that is way better than not having code search functionality for multiple days...

wolfogre · 2023-06-19T03:30:05Z

I don't see anytime soon having index changed so much that it would be completely incompatible with old previous version - most probably we would add fields etc not completely change them because for large instances that would mean completely reindex all code search that could probably take multiple days and in that time would not work at all. So code changes must be made so that it is not so.

With alias if ever we need to make full reindex (like populating new fields etc) while it is in progress it can still work with old index (even if in limited functionality with data that is in old index) and that is way better than not having code search functionality for multiple days...

Hmm, I never thought it would take days. Yes, it could happen. I think you are right.

I need some time to keep the aliases, because the old issue indexer used the index name (which should be alias name) directly to create index, so I need do some thing like: rename index_name to index_name.v0, remove index_name, create alias index_name, bind the alias to index_name.v0.

Update:

Got some trouble, it's not easy to "rename" an index for ES.

Reindex: It copies data to a new index. I'm worried that it would take too long for a larger instance.
Clone: It's unsupported before v7.4.

Update:

Wait, I found that the old code doesn't work this way: "while it is in progress it can still work with old index". Gitea creates a new index and points the alias to it immediately.

gitea/modules/indexer/code/elastic_search.go

Lines 144 to 176 in 51c2aeb

    
           	createIndex, err := b.client.CreateIndex(b.realIndexerName()).BodyString(mapping).Do(ctx) 
        
           	if err != nil { 
        
           		return false, b.checkError(err) 
        
           	} 
        
           	if !createIndex.Acknowledged { 
        
           		return false, fmt.Errorf("create index %s with %s failed", b.realIndexerName(), mapping) 
        
           	} 
        
           } 
        
           // check version 
        
           r, err := b.client.Aliases().Do(ctx) 
        
           if err != nil { 
        
           	return false, b.checkError(err) 
        
           } 
        
           realIndexerNames := r.IndicesByAlias(b.indexerAliasName) 
        
           if len(realIndexerNames) < 1 { 
        
           	res, err := b.client.Alias(). 
        
           		Add(b.realIndexerName(), b.indexerAliasName). 
        
           		Do(ctx) 
        
           	if err != nil { 
        
           		return false, b.checkError(err) 
        
           	} 
        
           	if !res.Acknowledged { 
        
           		return false, fmt.Errorf("create alias %s to index %s failed", b.indexerAliasName, b.realIndexerName()) 
        
           	} 
        
           } else if len(realIndexerNames) >= 1 && realIndexerNames[0] < b.realIndexerName() { 
        
           	log.Warn("Found older gitea indexer named %s, but we will create a new one %s and keep the old NOT DELETED. You can delete the old version after the upgrade succeed.", 
        
           		realIndexerNames[0], b.realIndexerName()) 
        
           	res, err := b.client.Alias(). 
        
           		Remove(realIndexerNames[0], b.indexerAliasName). 
        
           		Add(b.realIndexerName(), b.indexerAliasName). 
        
           		Do(ctx)

So it use the new index even "it is in progress".

@lafriks I think your ideas are reasonable. But maybe we could do something for it in another PR because the old code doesn't work too.

Maybe:

Find a way to determine whether the new index is in progress.
If it is, find an older version index to read (just searching with name v6, v5, v4 ...), so it works with all engines even they don't support aliases like meilisearch.
When the new index is ready, read from it.

All logic can be implemented within GItea, and it doesn't require alias.

@lafriks What do you think?

wolfogre · 2023-06-20T02:15:08Z

@lafriks Kindly ping. 👀

wolfogre · 2023-06-21T06:35:21Z

I think this refactoring PR can't wait too long. Since it has already received two approvals, I will wait another 2 days. If no one objects, I will merge it. If this PR causes any regression, please let me know by @ me.

lunny · 2023-06-23T03:34:06Z

@lafriks Please review again

delvh · 2023-06-23T07:55:41Z

I think @lafriks had the chance to review already and didn't use it.
The "2 day deadline" was already three days ago.

Fix regression of #25174. The `Init` of the db indexer should return true to indicate that the index was opened/existed, or the indexer will try to populate the index (not really populate, just fill the queue, `Index` method of the db indexer is a dummy).

Fix regression of #5363 (so long ago). The old code definded a document mapping for `issueIndexerDocType`, and assigned it to `BleveIndexerData` as its type. (`BleveIndexerData` has been renamed to `IndexerData` in #25174, but nothing more.) But the old code never used `BleveIndexerData`, it wrote the index with an anonymous struct type. Nonetheless, bleve would use the default auto-mapping for struct it didn't know, so the indexer still worked. This means the custom document mapping was always dead code. The custom document mapping is not useless, it can reduce index storage, this PR brings it back and disable default mapping to prevent it from happening again. Since `IndexerData`(`BleveIndexerData`) has JSON tags, and bleve uses them first, so we should use `repo_id` as the field name instead of `RepoID`. I did a test to compare the storage size before and after this, with about 3k real comments that were migrated from some public repos. Before: ```text [ 160] . ├── [ 42] index_meta.json ├── [ 13] rupture_meta.json └── [ 128] store ├── [6.9M] 00000000005d.zap └── [256K] root.bolt ``` After: ```text [ 160] . ├── [ 42] index_meta.json ├── [ 13] rupture_meta.json └── [ 128] store ├── [3.5M] 000000000065.zap └── [256K] root.bolt ``` It saves about half the storage space. --------- Co-authored-by: Giteabot <teabot@gitea.io>

wolfogre added 25 commits June 8, 2023 14:08

fix: mark stats

22777af

feat: bleve issue basic

534945b

feat: elasticsearch issue

15d6247

fix: better composite

afde928

feat: db issue

94a5736

fix: inner

242095f

fix: move gitea_bleve

e3190e6

feat: code indexer

3dbc2e4

feat: new IndexerHolder

3a2873a

fix: remove generic

5748343

feat: use holder in code indexer

e609a36

fix: be safe with nil indexer

c43676e

fix: Get never return nil

72b1a39

chore: meilisearch

34d136d

fix: spell of Elasticsearch

94a6620

chore: remove useless code for DB

7e033b9

feat: bleve mapping

f7aff40

fix: elasticsearch with version

b6f20d3

feat: base32

a1ce99b

fix: esRepoIndexerLatestVersion

d062296

fix: split issues

4f1d3b7

fix: split codes

a29ecee

chore: issues/internal

cc6d0aa

fix: indexer_internal for code

a1d48a6

fix: meilisearch

1bc8d3e

wolfogre added the type/refactoring Existing code has been cleaned up. There should be no new functionality. label Jun 9, 2023

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jun 9, 2023

pull-request-size bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 9, 2023

wolfogre added 2 commits June 9, 2023 18:57

Merge branch 'main' into feature/refactor_indexer

ecfbf77

fix: test

419ea3f

wolfogre added pr/wip This PR is not ready for review and removed pr/wip This PR is not ready for review labels Jun 19, 2023

delvh approved these changes Jun 22, 2023

View reviewed changes

delvh added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jun 22, 2023

GiteaBot added 4 commits June 22, 2023 17:59

Merge branch 'main' into feature/refactor_indexer

0fb08ff

Merge branch 'main' into feature/refactor_indexer

bf63432

Merge branch 'main' into feature/refactor_indexer

5f2f088

Merge branch 'main' into feature/refactor_indexer

3ef0f64

Merge branch 'main' into feature/refactor_indexer

1c19169

GiteaBot added 3 commits June 23, 2023 06:33

Merge branch 'main' into feature/refactor_indexer

c98a801

Merge branch 'main' into feature/refactor_indexer

ebf89ad

Merge branch 'main' into feature/refactor_indexer

a5a8244

silverwind enabled auto-merge (squash) June 23, 2023 12:25

silverwind merged commit 375fd15 into go-gitea:main Jun 23, 2023

GiteaBot removed the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jun 23, 2023

wolfogre mentioned this pull request Jun 23, 2023

Fix Init of db indexer #25470

Merged

wolfogre mentioned this pull request Jul 1, 2023

Fix issues indexer document mapping #25619

Merged

lunny mentioned this pull request Jul 5, 2023

code search with meilisearch #24149

Closed

go-gitea locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor indexer #25174

Refactor indexer #25174

wolfogre commented Jun 9, 2023 •

edited

Loading

lafriks commented Jun 18, 2023 •

edited

Loading

wolfogre commented Jun 19, 2023 •

edited

Loading

wolfogre commented Jun 20, 2023

wolfogre commented Jun 21, 2023

lunny commented Jun 23, 2023

delvh commented Jun 23, 2023

Refactor indexer #25174

Refactor indexer #25174

Conversation

wolfogre commented Jun 9, 2023 • edited Loading

Changes

Split

Deduplication

Enhancement

lafriks commented Jun 18, 2023 • edited Loading

wolfogre commented Jun 19, 2023 • edited Loading

wolfogre commented Jun 20, 2023

wolfogre commented Jun 21, 2023

lunny commented Jun 23, 2023

delvh commented Jun 23, 2023

wolfogre commented Jun 9, 2023 •

edited

Loading

lafriks commented Jun 18, 2023 •

edited

Loading

wolfogre commented Jun 19, 2023 •

edited

Loading