Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use timestamp in aggregate table #2

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zachthompson
Copy link
Contributor

No longer need the crawl ID since we are aggregating directly. We usually need to decode it into a timestamp anyhow.

Copy link

@aaronharsh aaronharsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link

@aaronharsh aaronharsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zach - Sorry to retract my approval, but I just realized we're still using max_https_crawl_id in the internal repo. Is there a ddg.git PR to go along with this one?

Also, CONTRIBUTING.md in this repo still references max_https_crawl_id.

@zachthompson
Copy link
Contributor Author

We will alter table and restart the service(s). Shouldn't need an additional PR.

Updated CONTRIBUTING.

@aaronharsh
Copy link

aaronharsh commented Nov 21, 2019

My point was that some internal code still works with the removed field. If you drop https_crawl_aggregate.max_https_crawl_id, won't that code break? For example:

https://dub.duckduckgo.com/duckduckgo/ddg/blob/bttf/components/https/create-https-queue.pl#L75

@zachthompson
Copy link
Contributor Author

That specific script is on deck for an update beyond the ID. We can run it prior to merging the internal PR to give us a little leeway but, if necessary, the queue can be populated manually.

@aaronharsh
Copy link

It sounds like it's expected that this change makes the schema incompatible with the internal scripts like create-https-queue.pl. And you're trying to rush this change out to the public repo so that anyone else who forks the repo will get the updated scheme. Am I reading that right?

I get the motivation. It would be nice if the code was in a working state, at the very least because it makes it tougher for me to review. Would you like help cleaning it up? In particular, I'm thinking things like:

  • Get rid of the duplicate https_crawl.pl in components/https
  • Make all the scripts work with the new schema
  • Get rid of any unused scripts

@zachthompson
Copy link
Contributor Author

@aaronharsh Let me see if I can clear up the confusion.

  1. The code is in a working state. You can test this on a local Postgresql instance and it works fine.
  2. It doesn't affect anything internal yet. The integration PR has the submodule pointed at a previous commit with crawl ID still there.
  3. Once this is merged, I will open another PR internally to address any legacy references to crawl ID, bump the submodule commit, and we can verify it end-to-end internally.

Copy link

@Ishcode63 Ishcode63 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link

@tonkla10032533 tonkla10032533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [ ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants