Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

klockla · 2024-08-08T14:08:41Z

Added method ListURLs in API and client to list all URLs known in the frontier with their next fetch date
(RocksDB & MemoryFrontier implementations only)

jnioche · 2024-08-27T17:15:36Z

I see that this contains the changes from #92, it would be better to make it independent if possible (I appreciate that this has a draft status)

klockla · 2024-09-04T14:03:02Z

I rebased and updated this PR, think it's ready to be reviewed now.

jnioche · 2024-09-04T18:10:45Z

I had misread what this does I thought it would return all the URLs within a queue, which would be useful for debugging.
This will stream the entire content of the frontier by pages of e.g. 100 URLs. What do you want to be able to achieve with this? backups? debugs?
Should we consider an export endpoint instead? If so, what format would it use to represent all the data including the metadata? Should the service implementations be responsible for writing the data and communicating the location of the file to the client?

klockla · 2024-09-05T07:05:23Z

It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)

jnioche · 2024-09-05T07:52:13Z

It's main purpose was for debug but we will also probably use it in our UI for the user to browse the frontier, I didn't consider it an export/backup feature (which would mean, we would also need an import feature)

thanks for the explanation @klockla. in a sense PutURLs is the import feature but we could have one where a file is made available to the server locally and it could ingest it as a batch. This would be quicker than streaming from the client; I think @michaeldinzinger did something similar with the OpenSearch backend he uses at OpenWebSearch.

Going back to our primary topic, would it be OK to list the URLs per queue only? From a client's perspective you can list the queues separately and get the URLs for each one. This would be equivalent to paging in a sense.

Doing so would make more sense to me as in most cases you'd want to debug per queue. What do you think?

klockla · 2024-09-05T09:58:21Z

What do you think of the following:

We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).

Parameters would look something like:

message ListUrlParams {
// position of the first result in the list; defaults to 0
uint32 start = 1;
// max number of values; defaults to 100
uint32 size = 2;
/** ID for the queue **/
string key = 3;
// crawl ID
string crawlID = 4;
// only for the current local instance
bool local = 5;
}

jnioche · 2024-09-05T16:58:28Z

What do you think of the following:

We keep the pagination and add the possibility to restrict to a given queue (if none specified, we will go over all of them).

Parameters would look something like:

message ListUrlParams { // position of the first result in the list; defaults to 0 uint32 start = 1; // max number of values; defaults to 100 uint32 size = 2; /** ID for the queue **/ string key = 3; // crawl ID string crawlID = 4; // only for the current local instance bool local = 5; }

yes that would work I think

klockla · 2024-09-06T15:39:16Z

Updated the PR to include queue parameter

jnioche

Looks great but given how similar the code is for the memory and rocks implementations, should we have the logic in the abstract super class instead? This is what we do when we retrieve URLs. What do you think?

API/urlfrontier.proto

jnioche · 2024-09-07T14:35:31Z

client/src/main/java/crawlercommons/urlfrontier/client/ListURLs.java

+ fetchDate = String.valueOf(item.getKnown().getRefetchableFromDate());
+ }
+
+ outstream.println(item.getKnown().getInfo().getUrl() + ";" + fetchDate);


it would be good to be able to see any metadata associated with the URL.
IIRC it is possible to de-ser in JSON pretty easily, see PutURL clients. It would be good to have a way of specifying the output format between JSON or char separated.
Maybe we could have a utility class to share the logic of reading to / from various formats? This could be done later

it would be good to be able to see any metadata associated with the URL.

I see one problem though, whether it is in MemoryFrontier or in RocksDB, we lose the Metadata once a URL is completed

it would be good to be able to see any metadata associated with the URL.

I see one problem though, whether it is in MemoryFrontier or in RocksDB, we lose the Metadata once a URL is completed

Ah, maybe something to fix separately. I suppose we could still display the metadata where possible

Refactored to have (small) common logic in AbstractFrontierService.
Added the option to print output in JSON format.

jnioche

That's great, thanks. Please see comments.

jnioche · 2024-09-10T19:12:21Z

client/src/main/java/crawlercommons/urlfrontier/client/ListURLs.java

+ names = {"-c", "--crawlID"},
+ defaultValue = "DEFAULT",
+ paramLabel = "STRING",
+ description = "crawl to get the queues for")


tell what the default value is in the description?

jnioche · 2024-09-10T19:22:15Z

service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java

+ * @return
+ */
+ public static URLItem buildURLItem(
+ URLItem.Builder builder, KnownURLItem.Builder kbuilder, URLInfo info, long refetch) {


this is a nice way of simplifying the code in the sub classes + calling clear is also a safe way of making sure no data is carried through from a previous entry

jnioche · 2024-09-10T19:25:30Z

service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java

+ if (key != null && !key.isEmpty() && !e.getKey().getQueue().equals(key)) {
+ continue;
+ }
+ Iterator<URLItem> iter = urlIterator(e, start, maxURLs);


I see the maxURLs is within a queue, I thought it was in general
That's fine but let's make this more explicit in the .proto and the client and perhaps add a maxNumberQueues param? For instance I have 1M queues in my test, calling getURLs takes forever.
The same seems to apply to start it is within a queue, which makes sense but the doc should make that clear

Alternatively, we go back to my original suggestion of returning the list only for a specific queue, in which case the pagination as it is would be fine

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

Added JSON output option for ListURLs Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

klockla · 2024-09-16T15:28:22Z

I have rebased the PR following your merge and updated it so that pagination is global over all queues

jnioche

Looks good - see comment

service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

jnioche

Could we reuse these iterators in other parts of the code?
(These are my last comments on this issue - I promise)

jnioche · 2024-09-17T19:29:14Z

service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java

+ responseObserver.onCompleted();
+ }
+
+ protected abstract Iterator<URLItem> urlIterator(


This is only used once I think. Can't we simply have the one below with 0, and Integer.maxInt as values when called?

Yes, I removed the redundant iterator constructors

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

klockla · 2024-09-18T13:23:11Z

Could we reuse these iterators in other parts of the code? (These are my last comments on this issue - I promise)

I didn't see any direct opportunity for reuse , maybe in the future...

jnioche

Thanks @klockla

klockla force-pushed the listurl_github branch 3 times, most recently from 50ece65 to b314367 Compare September 4, 2024 13:59

klockla marked this pull request as ready for review September 4, 2024 14:03

jnioche requested changes Sep 7, 2024

View reviewed changes

jnioche requested changes Sep 10, 2024

View reviewed changes

jnioche reviewed Sep 10, 2024

View reviewed changes

klockla added 4 commits September 16, 2024 12:03

Added method ListURLs

6877951

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

Updated ListURLs to handle queue key parameter

dcb5753

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

Refactored to handle common logic in AbstractFrontierService

eb3d236

Added JSON output option for ListURLs Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

Update to handle start & maxURLs parameters globally and not per queue

095433d

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

klockla force-pushed the listurl_github branch from 4c33643 to 095433d Compare September 16, 2024 14:51

jnioche requested changes Sep 16, 2024

View reviewed changes

service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java Show resolved Hide resolved

Fixed breaking out of main loop in listURLs

92466f6

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

jnioche added this to the 2.3 milestone Sep 17, 2024

jnioche added enhancement New feature or request API Client labels Sep 17, 2024

jnioche reviewed Sep 17, 2024

View reviewed changes

Removed redundant iterator constructors

ea9090e

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>

jnioche approved these changes Sep 18, 2024

View reviewed changes

jnioche merged commit 247b201 into crawler-commons:master Sep 18, 2024
2 checks passed

klockla deleted the listurl_github branch September 18, 2024 16:34

klockla restored the listurl_github branch September 18, 2024 16:34

klockla deleted the listurl_github branch September 18, 2024 16:35

jnioche modified the milestones: 2.3, 2.4.0 Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

klockla commented Aug 8, 2024

jnioche commented Aug 27, 2024

klockla commented Sep 4, 2024

jnioche commented Sep 4, 2024

klockla commented Sep 5, 2024

jnioche commented Sep 5, 2024

klockla commented Sep 5, 2024

jnioche commented Sep 5, 2024

klockla commented Sep 6, 2024

jnioche left a comment

jnioche Sep 7, 2024

klockla Sep 9, 2024

jnioche Sep 9, 2024

klockla Sep 10, 2024

jnioche left a comment

jnioche Sep 10, 2024

jnioche Sep 10, 2024

jnioche Sep 10, 2024

jnioche Sep 11, 2024

klockla commented Sep 16, 2024

jnioche left a comment

jnioche left a comment

jnioche Sep 17, 2024

klockla Sep 18, 2024

klockla commented Sep 18, 2024

jnioche left a comment

Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

Add method ListURLs to list all URLs known in the frontier with their next fetch date #93

Conversation

klockla commented Aug 8, 2024

jnioche commented Aug 27, 2024

klockla commented Sep 4, 2024

jnioche commented Sep 4, 2024

klockla commented Sep 5, 2024

jnioche commented Sep 5, 2024

klockla commented Sep 5, 2024

jnioche commented Sep 5, 2024

klockla commented Sep 6, 2024

jnioche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnioche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klockla commented Sep 16, 2024

jnioche left a comment

Choose a reason for hiding this comment

jnioche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klockla commented Sep 18, 2024

jnioche left a comment

Choose a reason for hiding this comment