Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic Storage Indexer #814

Merged
merged 34 commits into from
Jun 7, 2022
Merged

Implement basic Storage Indexer #814

merged 34 commits into from
Jun 7, 2022

Conversation

mtojek
Copy link
Contributor

@mtojek mtojek commented May 19, 2022

Issue: #767

This PR implements a storage indexer which periodically hits the Cloud Storage for cursor updates. When the cursor changes, the indexer fetches the latest search-index-all index and updates in-memory data. The marshaled index weighs ~21MB.

Follow-ups:

  • implement integration tests (main_test.go)
  • adjust ServeFile to stream file content (HTTP 200, HTTP 303, HTTP 503)

@mtojek mtojek self-assigned this May 19, 2022
@elasticmachine
Copy link

elasticmachine commented May 19, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-06-06T18:19:20.677+0000

  • Duration: 7 min 35 sec

Test stats 🧪

Test Results
Failed 0
Passed 167
Skipped 0
Total 167

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@mtojek mtojek changed the title Implement basic storage indexer Implement basic Storage Indexer May 19, 2022
}
logger.Info("Downloaded new search-index-all index", zap.String("index.packages.size", fmt.Sprintf("%d", len(anIndex.Packages))))

refreshedList, err := transformSearchIndexAllToPackages(*anIndex)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano I was wondering if you can take a look in the middle of implementation to challenge/comment on the idea.

Storage Indexer uses a goroutine to regularly hit the cloud bucket for cursor changes. When the cursor changes, the indexer pulls a new index from the bucket and transforms into packages.Packages.

I wanted to skip writing yet another model class and depend on json.RawMessage. I will try to apply the same approach to the Package Storage Infra. With manifests present in the manifest, the indexer will create a ProxyPackageFileSystem to use the packages.NewPackage to build the Package structure (a hackish method, but no need to rewrite the model).

I'm not sure what to do about streaming other files. I plan for the ProxyPackageFileSystem to contain a streaming logic so that it can pull stuff from the public bucket. So far, the hardest part is to wire the spaghetti through packages.NewPackage and ServeFile.

Let me know what you think and please share other ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage Indexer uses a goroutine to regularly hit the cloud bucket for cursor changes. When the cursor changes, the indexer pulls a new index from the bucket and transforms into packages.Packages.

Oh, after #813 (comment) I understood that we were going to use some kind of watch API 🙂 If we are going to be polling the cursor, I would think again about using authenticated HTTP instead of the storage-specific client.

I wanted to skip writing yet another model class and depend on json.RawMessage. I will try to apply the same approach to the Package Storage Infra. With manifests present in the manifest, the indexer will create a ProxyPackageFileSystem to use the packages.NewPackage to build the Package structure (a hackish method, but no need to rewrite the model).

Umm, the current model we have for packages is quite similar to the model in the index, have you tried to unmarshal directly to them? Maybe it works. I think I would prefer that to faking a file system over the index json.
Even if it doesn't work as is now, we could still evaluate the changes needed to adapt the current model, or the index, so the index works with current model in the registry.

If the model in the index cannot be aligned with the model here in the registry, I think I would still prefer to keep both models, unmarshal to the index one, and then convert it to the one in the registry. It may be cumbersome at first but easier to understand and maintain in the future.

I'm not sure what to do about streaming other files. I plan for the ProxyPackageFileSystem to contain a streaming logic so that it can pull stuff from the public bucket. So far, the hardest part is to wire the spaghetti through packages.NewPackage and ServeFile.

For files in the storage, ideally the registry could answer with a redirect to the resource in GCP, so it doesn't serve files directly. We would need to check though if the http client in current versions of Kibana would follow these redirects.
We may need to add something in the current model to differentiate local from "external" packages, "external" packages wouldn't have a fs builder, but would have some way of translating the file name to urls, and do the redirect.

If http client in Kibana follows redirects, I would try to follow this approach: try to use the current model, but adapt it for "external" packages to serve files with redirects instead of from fake filesystems. This change to serve files with redirects could be done in a separated PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me share an update on this:

  1. I used marshalers to load the index fetched from the storage bucket.
  2. I will focus on streaming package data in a follow-up. Otherwise, the size of this PR will grow up.
  3. Re watch API:

Should you use Object change notification?
Generally, you should not use Object change notification. The recommended tool for generating notifications that track changes to objects in your Cloud Storage buckets is Pub/Sub notifications for Cloud Storage, because it's faster, more flexible, easier to set up, and more cost-effective. For a step-by-step guide to setting up Pub/Sub notifications for Cloud Storage, see Configuring Pub/Sub notifications for Cloud Storage.

I understand that it's recommended to use a Pub/Sub mechanism, but not sure if we want to use another cloud service (stronger coupling with GCP). Do you think that we should rather use a basic HTTP client with Oauth? I admit that using the GCP Go client is really convenient and we may introduce potential bugs if we go with signed HTTP requests. Is it an attempt to make it universal/storage independent?

Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that using the GCP Go client is really convenient and we may introduce potential bugs if we go with signed HTTP requests. Is it an attempt to make it universal/storage independent?

Using more generic HTTP would allow us to use more generic HTTP tooling and libraries, reducing dependencies and possibly simplifying development and testing. We could still use the library to initiate the HTTP client, keep its tokens updated and so on, but the operations could be generic.

A consequence of this would be effectively that this would be more storage independent, yes. I wouldn't consider this a priority, or a motivation for this, but this would be a nice to have 🙂

But I agree that if this complicates things, and the GCP library doesn't expose an authenticated client we can reuse, the effort is probably not worth it.

Regarding pub/sub, we can leave this for a future optimization, if anytime needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for ideas!

storage/proxypackage.go Outdated Show resolved Hide resolved
@mtojek mtojek marked this pull request as ready for review June 6, 2022 16:04
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's go on with this if GCP client doesn't expose an authenticated generic HTTP client.

@mtojek mtojek merged commit 9785e9b into elastic:main Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants