-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(search): support code search by zoekt #33850
base: main
Are you sure you want to change the base?
feat(search): support code search by zoekt #33850
Conversation
There are already so many search engines builtin into Gitea. Many of them have various bugs. So the questions are:
|
To be honest I prefer this zoekt search engine compared to the existing search engine |
maybe this can replace bleve but we need some comparsion tests. |
That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet. I do not mean objection to introduce improvements. But actually it needs to:
So a clear roadmap about the "search engine plan" is necessary. |
In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered. |
Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance" |
you don't need to worry about this: zoekt is a popular code search engine, currently used by code platforms like Gerrit, Sourcegraph, and GitLab, wrote by Gerrit author, and maintained by Sourcegraph. Zoekt has advantages that traditional search engines (like ES) do not possess: support for regex matching, substring search, etc. I don't think any new open-source code search engines will be able to replace it in the short term.
You are right, where should the roadmap be written? I don't have experience with this. I will supplement its documentation when the zoekt functionality is more complete |
Yep, if zoekt wins, we need to drop some others. |
Sure, it's regrettable that this part of the content is unmaintained. However, for the zoekt code search, I can commit to maintaining it thoroughly. |
Yeah, I hope this can be divided into at least two steps:
Zoekt may also have some issues, as GitLab has not completely deprecated ES and fully switched to Zoekt... |
17d7c30
to
212fc79
Compare
To make the code clear, we need to refactor the related code first: Refactor issue & code search #33860 Each "indexer" should provide the "search modes" they support by themselves. And we need to remove the "fuzzy" search for code. |
Please note that I have many other commitments over the next two weeks and may only be able to dedicate time to this MR in a couple of weeks |
783ee0e
to
374ce10
Compare
850a16a
to
86ef977
Compare
86ef977
to
9906c5f
Compare
@wxiaoguang @lunny I think this CR is ready for an initial review. :) |
2cd4e64
to
d29b696
Compare
Can you simply tell me how to enable zoekt? I would like to try it. |
write this to [indexer]
REPO_INDEXER_TYPE = zoekt
REPO_INDEXER_ENABLED = true
REPO_INDEXER_PATH = indexers/repos.zoekt |
d29b696
to
ec58b39
Compare
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
ec58b39
to
f7d1e58
Compare
Abstract
Zoekt is an open-source search engine specifically designed for code search, utilizing 3-gram indexing for efficient segmentation. By replacing Elasticsearch/Bleve with Zoekt, it provides Gitea with precise code search capabilities and support for regular expression searches.
Motivation
The existing code search functionality is implemented using Elasticsearch/bleve. Although Elasticsearch/bleve excels in general search domains, its disadvantages in code search are obvious:
Proposal
Goals
Support precise substring searches
Support regex searches
Non-Goals
Support multi-branch searches
Support code symbol syntax searches
Competitive Product Analysis
Design
Index
Since Zoekt is written in Golang, its API can be directly integrated through its Go package using indexBuilder.Add() and indexBuilder.MarkFileAsChangedOrRemoved() to add or remove indexed files. The fundamental processes for implementing full and incremental repository indexing in Zoekt do not differ significantly from those in Elasticsearch (ES) or Bleve.
Search
We can use shards.NewDirectorySearcher() or shards.NewDirectorySearcherFast() to build a searcher for searching. The search modes will support:
Since the search is currently limited to a single repository, we will retrieve all the content first and then handle pagination.
Use Method
enable this in app.ini
Resource Usage
Building the index in Zoekt requires 1.2 times the corpus size in RAM, and the index storage size is about three times the corpus size. Maybe we should expose some of Zoekt's internal Prometheus metrics in the future?
Exists Issues
Try to support #33702