Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query crawler #686

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Query crawler #686

wants to merge 3 commits into from

Conversation

kompotkot
Copy link
Contributor

Changes

Fetch query from journal, validate query, execute query and push to bucket if required.

How to test these changes?

Tested locally

Related issues

@kompotkot kompotkot added the crawlers Crawlers module label Oct 25, 2022
@kompotkot kompotkot requested a review from a team October 25, 2022 09:57
@kompotkot
Copy link
Contributor Author

kompotkot commented Oct 25, 2022

@bugout-dev check

  • Add env MOONSTREAM_S3_DATA_BUCKET
  • Add env MOONSTREAM_S3_DATA_BUCKET_PREFIX

Copy link
Contributor

@zomglings zomglings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconsidered crawler architecture.

logger = logging.getLogger(__name__)


def parser_queries_execute_handler(args: argparse.Namespace) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @kompotkot and @Andrei-Dolgolev:

We realized that we want this crawler to go through the Query API like any other user would.

The plan is to remove this mooncrawl/queries_crawler code and update the Moonstream Python client to provide this functionality similarly to how we used it in autocorns biologist:
https://github.com/bugout-dev/autocorns/blob/cf00fb492de254821730a256d238d5a332810db6/autocorns/biologist.py#L371

We can remove the existing Moonstream Python client, bump the client version, and publish the new client.

@kompotkot kompotkot closed this Oct 25, 2022
@kompotkot kompotkot deleted the crawler-queries-cu branch October 25, 2022 11:55
@kompotkot
Copy link
Contributor Author

We need to cherry pick from this PR later.

@zomglings zomglings restored the crawler-queries-cu branch October 25, 2022 11:57
@zomglings
Copy link
Contributor

Although we will not use queries_crawler to crawl public data, we will use it as a replacement for the existing Query API data producer (which is currently in mooncrawl/stats_worker/queries.py.

The queries_crawler should be a CLI which exposes the same functionality but in a more modular way (and should be invoked from systemd on prod).

We will revisit this after our current batch of urgent work.

@zomglings zomglings reopened this Oct 25, 2022
@kompotkot kompotkot marked this pull request as draft October 28, 2022 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawlers Crawlers module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants