Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor indexer #25174

Merged
merged 54 commits into from
Jun 23, 2023
Merged

Refactor indexer #25174

merged 54 commits into from
Jun 23, 2023

Conversation

wolfogre
Copy link
Member

@wolfogre wolfogre commented Jun 9, 2023

Refactor modules/indexer to make it more maintainable. And it can be easier to support more features. I'm trying to solve some of issue searching, this is a precursor to making functional changes.

Current supported engines and the index versions:

engines issues code
db Just a wrapper for database queries, doesn't need version -
bleve The version of index is 2 The version of index is 6
elasticsearch The old index has no version, will be treated as version 0 in this PR The version of index is 1
meilisearch The old index has no version, will be treated as version 0 in this PR -

Changes

Split

Splited it into mutiple packages

indexer
├── internal
│   ├── bleve
│   ├── db
│   ├── elasticsearch
│   └── meilisearch
├── code
│   ├── bleve
│   ├── elasticsearch
│   └── internal
└── issues
    ├── bleve
    ├── db
    ├── elasticsearch
    ├── internal
    └── meilisearch
  • indexer/interanal: Internal shared package for indexer.
  • indexer/interanal/[engine]: Internal shared package for each engine (bleve/db/elasticsearch/meilisearch).
  • indexer/code: Implementations for code indexer.
  • indexer/code/internal: Internal shared package for code indexer.
  • indexer/code/[engine]: Implementation via each engine for code indexer.
  • indexer/issues: Implementations for issues indexer.

Deduplication

  • Combine Init/Ping/Close for code indexer and issues indexer.
  • Combine issues.indexerHolder and code.wrappedIndexer to internal.IndexHolder. Remove it, use dummy indexer instead when the indexer is not ready.
  • Duplicate two copies of creating ES clients.
  • Duplicate two copies of indexerID().

Enhancement

  • Support index version for elasticsearch issues indexer, the old index without version will be treated as version 0.
  • Fix spell of elastic_search/ElasticSearch, it should be Elasticsearch.
  • Improve versioning of ES index. We don't need Aliases:
    • Gitea does't need aliases for "Zero Downtime" because it never delete old indexes.
    • The old code of issues indexer uses the orignal name to create issue index, so it's tricky to convert it to an alias.
  • Support index version for meilisearch issues indexer, the old index without version will be treated as version 0.
  • Do "ping" only when Ping has been called, don't ping periodically and cache the status.
  • Support the context parameter whenever possible.
  • Fix outdated example config.
  • Give up the requeue logic of issues indexer: When indexing fails, call Ping to check if it was caused by the engine being unavailable, and only requeue the task if the engine is unavailable.
    • It is fragile and tricky, could cause data losing (It did happen when I was doing some tests for this PR). And it works for ES only.
    • Just always requeue the failed task, if it caused by bad data, it's a bug of Gitea which should be fixed.

@wolfogre wolfogre added the type/refactoring Existing code has been cleaned up. There should be no new functionality. label Jun 9, 2023
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jun 9, 2023
@pull-request-size pull-request-size bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 9, 2023
@lafriks
Copy link
Member

lafriks commented Jun 18, 2023

I don't see anytime soon having index changed so much that it would be completely incompatible with old previous version - most probably we would add fields etc not completely change them because for large instances that would mean completely reindex all code search that could probably take multiple days and in that time would not work at all. So code changes must be made so that it is not so.

With alias if ever we need to make full reindex (like populating new fields etc) while it is in progress it can still work with old index (even if in limited functionality with data that is in old index) and that is way better than not having code search functionality for multiple days...

@wolfogre
Copy link
Member Author

wolfogre commented Jun 19, 2023

I don't see anytime soon having index changed so much that it would be completely incompatible with old previous version - most probably we would add fields etc not completely change them because for large instances that would mean completely reindex all code search that could probably take multiple days and in that time would not work at all. So code changes must be made so that it is not so.

With alias if ever we need to make full reindex (like populating new fields etc) while it is in progress it can still work with old index (even if in limited functionality with data that is in old index) and that is way better than not having code search functionality for multiple days...

Hmm, I never thought it would take days. Yes, it could happen. I think you are right.

I need some time to keep the aliases, because the old issue indexer used the index name (which should be alias name) directly to create index, so I need do some thing like: rename index_name to index_name.v0, remove index_name, create alias index_name, bind the alias to index_name.v0.


Update:

Got some trouble, it's not easy to "rename" an index for ES.

  • Reindex: It copies data to a new index. I'm worried that it would take too long for a larger instance.
  • Clone: It's unsupported before v7.4.

Update:

Wait, I found that the old code doesn't work this way: "while it is in progress it can still work with old index". Gitea creates a new index and points the alias to it immediately.

createIndex, err := b.client.CreateIndex(b.realIndexerName()).BodyString(mapping).Do(ctx)
if err != nil {
return false, b.checkError(err)
}
if !createIndex.Acknowledged {
return false, fmt.Errorf("create index %s with %s failed", b.realIndexerName(), mapping)
}
}
// check version
r, err := b.client.Aliases().Do(ctx)
if err != nil {
return false, b.checkError(err)
}
realIndexerNames := r.IndicesByAlias(b.indexerAliasName)
if len(realIndexerNames) < 1 {
res, err := b.client.Alias().
Add(b.realIndexerName(), b.indexerAliasName).
Do(ctx)
if err != nil {
return false, b.checkError(err)
}
if !res.Acknowledged {
return false, fmt.Errorf("create alias %s to index %s failed", b.indexerAliasName, b.realIndexerName())
}
} else if len(realIndexerNames) >= 1 && realIndexerNames[0] < b.realIndexerName() {
log.Warn("Found older gitea indexer named %s, but we will create a new one %s and keep the old NOT DELETED. You can delete the old version after the upgrade succeed.",
realIndexerNames[0], b.realIndexerName())
res, err := b.client.Alias().
Remove(realIndexerNames[0], b.indexerAliasName).
Add(b.realIndexerName(), b.indexerAliasName).
Do(ctx)

So it use the new index even "it is in progress".

@lafriks I think your ideas are reasonable. But maybe we could do something for it in another PR because the old code doesn't work too.

Maybe:

  • Find a way to determine whether the new index is in progress.
  • If it is, find an older version index to read (just searching with name v6, v5, v4 ...), so it works with all engines even they don't support aliases like meilisearch.
  • When the new index is ready, read from it.

All logic can be implemented within GItea, and it doesn't require alias.

@lafriks What do you think?

@wolfogre wolfogre added pr/wip This PR is not ready for review and removed pr/wip This PR is not ready for review labels Jun 19, 2023
@wolfogre
Copy link
Member Author

@lafriks Kindly ping. 👀

@wolfogre
Copy link
Member Author

I think this refactoring PR can't wait too long. Since it has already received two approvals, I will wait another 2 days. If no one objects, I will merge it. If this PR causes any regression, please let me know by @ me.

@delvh delvh added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jun 22, 2023
@lunny
Copy link
Member

lunny commented Jun 23, 2023

@lafriks Please review again

@delvh
Copy link
Member

delvh commented Jun 23, 2023

I think @lafriks had the chance to review already and didn't use it.
The "2 day deadline" was already three days ago.

@silverwind silverwind enabled auto-merge (squash) June 23, 2023 12:25
@silverwind silverwind merged commit 375fd15 into go-gitea:main Jun 23, 2023
@GiteaBot GiteaBot removed the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jun 23, 2023
techknowlogick pushed a commit that referenced this pull request Jun 23, 2023
Fix regression of #25174.

The `Init` of the db indexer should return true to indicate that the
index was opened/existed, or the indexer will try to populate the index
(not really populate, just fill the queue, `Index` method of the db
indexer is a dummy).
silverwind pushed a commit that referenced this pull request Jul 4, 2023
Fix regression of #5363 (so long ago).

The old code definded a document mapping for `issueIndexerDocType`, and
assigned it to `BleveIndexerData` as its type. (`BleveIndexerData` has
been renamed to `IndexerData` in #25174, but nothing more.) But the old
code never used `BleveIndexerData`, it wrote the index with an anonymous
struct type. Nonetheless, bleve would use the default auto-mapping for
struct it didn't know, so the indexer still worked. This means the
custom document mapping was always dead code.

The custom document mapping is not useless, it can reduce index storage,
this PR brings it back and disable default mapping to prevent it from
happening again. Since `IndexerData`(`BleveIndexerData`) has JSON tags,
and bleve uses them first, so we should use `repo_id` as the field name
instead of `RepoID`.

I did a test to compare the storage size before and after this, with
about 3k real comments that were migrated from some public repos.

Before:

```text
[ 160]  .
├── [  42]  index_meta.json
├── [  13]  rupture_meta.json
└── [ 128]  store
    ├── [6.9M]  00000000005d.zap
    └── [256K]  root.bolt
```

After:

```text
[ 160]  .
├── [  42]  index_meta.json
├── [  13]  rupture_meta.json
└── [ 128]  store
    ├── [3.5M]  000000000065.zap
    └── [256K]  root.bolt
```

It saves about half the storage space.

---------

Co-authored-by: Giteabot <teabot@gitea.io>
@go-gitea go-gitea locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/refactoring Existing code has been cleaned up. There should be no new functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants