Implement basic Storage Indexer #814

mtojek · 2022-05-19T12:13:23Z

Issue: #767

This PR implements a storage indexer which periodically hits the Cloud Storage for cursor updates. When the cursor changes, the indexer fetches the latest search-index-all index and updates in-memory data. The marshaled index weighs ~21MB.

Follow-ups:

implement integration tests (main_test.go)
adjust ServeFile to stream file content (HTTP 200, HTTP 303, HTTP 503)

elasticmachine · 2022-05-19T12:16:21Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-06-06T18:19:20.677+0000
Duration: 7 min 35 sec

Test stats 🧪

Test	Results
Failed	0
Passed	167
Skipped	0
Total	167

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

mtojek · 2022-05-20T12:50:24Z

storage/indexer.go

+	}
+	logger.Info("Downloaded new search-index-all index", zap.String("index.packages.size", fmt.Sprintf("%d", len(anIndex.Packages))))
+
+	refreshedList, err := transformSearchIndexAllToPackages(*anIndex)


@jsoriano I was wondering if you can take a look in the middle of implementation to challenge/comment on the idea.

Storage Indexer uses a goroutine to regularly hit the cloud bucket for cursor changes. When the cursor changes, the indexer pulls a new index from the bucket and transforms into packages.Packages.

I wanted to skip writing yet another model class and depend on json.RawMessage. I will try to apply the same approach to the Package Storage Infra. With manifests present in the manifest, the indexer will create a ProxyPackageFileSystem to use the packages.NewPackage to build the Package structure (a hackish method, but no need to rewrite the model).

I'm not sure what to do about streaming other files. I plan for the ProxyPackageFileSystem to contain a streaming logic so that it can pull stuff from the public bucket. So far, the hardest part is to wire the spaghetti through packages.NewPackage and ServeFile.

Let me know what you think and please share other ideas.

Storage Indexer uses a goroutine to regularly hit the cloud bucket for cursor changes. When the cursor changes, the indexer pulls a new index from the bucket and transforms into packages.Packages.

Oh, after #813 (comment) I understood that we were going to use some kind of watch API 🙂 If we are going to be polling the cursor, I would think again about using authenticated HTTP instead of the storage-specific client.

I wanted to skip writing yet another model class and depend on json.RawMessage. I will try to apply the same approach to the Package Storage Infra. With manifests present in the manifest, the indexer will create a ProxyPackageFileSystem to use the packages.NewPackage to build the Package structure (a hackish method, but no need to rewrite the model).

Umm, the current model we have for packages is quite similar to the model in the index, have you tried to unmarshal directly to them? Maybe it works. I think I would prefer that to faking a file system over the index json.
Even if it doesn't work as is now, we could still evaluate the changes needed to adapt the current model, or the index, so the index works with current model in the registry.

If the model in the index cannot be aligned with the model here in the registry, I think I would still prefer to keep both models, unmarshal to the index one, and then convert it to the one in the registry. It may be cumbersome at first but easier to understand and maintain in the future.

I'm not sure what to do about streaming other files. I plan for the ProxyPackageFileSystem to contain a streaming logic so that it can pull stuff from the public bucket. So far, the hardest part is to wire the spaghetti through packages.NewPackage and ServeFile.

For files in the storage, ideally the registry could answer with a redirect to the resource in GCP, so it doesn't serve files directly. We would need to check though if the http client in current versions of Kibana would follow these redirects.
We may need to add something in the current model to differentiate local from "external" packages, "external" packages wouldn't have a fs builder, but would have some way of translating the file name to urls, and do the redirect.

If http client in Kibana follows redirects, I would try to follow this approach: try to use the current model, but adapt it for "external" packages to serve files with redirects instead of from fake filesystems. This change to serve files with redirects could be done in a separated PR.

Let me share an update on this:

I used marshalers to load the index fetched from the storage bucket.

I will focus on streaming package data in a follow-up. Otherwise, the size of this PR will grow up.

Re watch API:

Should you use Object change notification?
Generally, you should not use Object change notification. The recommended tool for generating notifications that track changes to objects in your Cloud Storage buckets is Pub/Sub notifications for Cloud Storage, because it's faster, more flexible, easier to set up, and more cost-effective. For a step-by-step guide to setting up Pub/Sub notifications for Cloud Storage, see Configuring Pub/Sub notifications for Cloud Storage.

I understand that it's recommended to use a Pub/Sub mechanism, but not sure if we want to use another cloud service (stronger coupling with GCP). Do you think that we should rather use a basic HTTP client with Oauth? I admit that using the GCP Go client is really convenient and we may introduce potential bugs if we go with signed HTTP requests. Is it an attempt to make it universal/storage independent?

Let me know what you think.

I admit that using the GCP Go client is really convenient and we may introduce potential bugs if we go with signed HTTP requests. Is it an attempt to make it universal/storage independent?

Using more generic HTTP would allow us to use more generic HTTP tooling and libraries, reducing dependencies and possibly simplifying development and testing. We could still use the library to initiate the HTTP client, keep its tokens updated and so on, but the operations could be generic.

A consequence of this would be effectively that this would be more storage independent, yes. I wouldn't consider this a priority, or a motivation for this, but this would be a nice to have 🙂

But I agree that if this complicates things, and the GCP library doesn't expose an authenticated client we can reuse, the effort is probably not worth it.

Regarding pub/sub, we can leave this for a future optimization, if anytime needed.

Thanks for ideas!

storage/proxypackage.go

jsoriano

Looks good, let's go on with this if GCP client doesn't expose an authenticated generic HTTP client.

mtojek added 17 commits May 12, 2022 13:55

Prepare code for storage indexer

Unverified

This user has not yet uploaded their public signing key.

GPG key ID: 09541BBFF0C4C795

Learn about vigilant mode

fdd89ce

Fix

fd8da3b

Changelog

f902ef3

Import cloud.google.com/go/storage

bbe901d

Prepare tests

0bc7227

FakeServer prepared

373c894

WIP

cfb805a

Tests

b824d2a

Merge branch 'main' into fix-767-b

290af55

Fix

3babe31

Changelog

a6380fa

fix

e985aa6

WIP

db11d30

Merge branch 'main' into fix-767-c

41dffa0

WIP

d1b28ba

WIP

e5b1cef

WIP

aff42d3

mtojek self-assigned this May 19, 2022

Fix: changelog

245d7f5

Typo

3785e51

mtojek changed the title ~~Implement basic storage indexer~~ Implement basic Storage Indexer May 19, 2022

mtojek added 7 commits May 19, 2022 14:19

Fix: imports

a64b8d1

Load cursor

f88e3a2

Load package index

f641a94

Fixes

2868566

WIP

d575dab

Use json.RawMessage

4df4a52

Lack of data stream manifest name

7458964

mtojek commented May 20, 2022

View reviewed changes

mtojek added 6 commits June 6, 2022 12:09

Merge branch 'main' into fix-767-c

dbab525

Fix: go.mod

724418f

Fix tests

f1f4313

Fix

39038a6

Fix

73182fe

Fix

16b799f

mtojek marked this pull request as ready for review June 6, 2022 16:04

mtojek added 2 commits June 6, 2022 18:12

Fix: go.mod

152bc6a

Fix

c5ba421

jsoriano approved these changes Jun 7, 2022

View reviewed changes

mtojek merged commit 9785e9b into elastic:main Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement basic Storage Indexer #814

Implement basic Storage Indexer #814

mtojek commented May 19, 2022 •

edited

Loading

elasticmachine commented May 19, 2022 •

edited

Loading

Build stats

Test stats 🧪

mtojek May 20, 2022

jsoriano May 23, 2022

mtojek Jun 6, 2022

jsoriano Jun 7, 2022

mtojek Jun 7, 2022

jsoriano left a comment

Implement basic Storage Indexer #814

Implement basic Storage Indexer #814

Conversation

mtojek commented May 19, 2022 • edited Loading

elasticmachine commented May 19, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

mtojek May 20, 2022

Choose a reason for hiding this comment

jsoriano May 23, 2022

Choose a reason for hiding this comment

mtojek Jun 6, 2022

Choose a reason for hiding this comment

jsoriano Jun 7, 2022

Choose a reason for hiding this comment

mtojek Jun 7, 2022

Choose a reason for hiding this comment

jsoriano left a comment

Choose a reason for hiding this comment

mtojek commented May 19, 2022 •

edited

Loading

elasticmachine commented May 19, 2022 •

edited

Loading