Skip to content

Scroll search with long scroll_id #1044

Closed
@c-leroy

Description

@c-leroy

Reference documentation from https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/search_operations.html tells us to use scroll_id parameter in scroll() parameters. This leads to a Elasticsearch error in case scroll_id is too large, since - in this case - it is sent in request URI, which overcomes elasticsearch limits.

Example from https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/search_operations.html is :

$client = ClientBuilder::create()->build();
$params = [
    'scroll' => '30s',          // how long between scroll requests. should be small!
    'size'   => 50,             // how many results *per shard* you want back
    'index'  => 'my_index',
    'body'   => [
        'query' => [
            'match_all' => new \stdClass()
        ]
    ]
];

// Execute the search
// The response will contain the first batch of documents
// and a scroll_id
$response = $client->search($params);

// Now we loop until the scroll "cursors" are exhausted
while (isset($response['hits']['hits']) && count($response['hits']['hits']) > 0) {

    // **
    // Do your work here, on the $response['hits']['hits'] array
    // **

    // When done, get the new scroll_id
    // You must always refresh your _scroll_id!  It can change sometimes
    $scroll_id = $response['_scroll_id'];

    // Execute a Scroll request and repeat
    $response = $client->scroll([
            'scroll_id' => $scroll_id,  //...using our previously obtained _scroll_id
            'scroll'    => '30s'        // and the same timeout window
        ]
    );
}

After viewing src/Elasticsearch/Endpoints/Scroll.php, we can see that :

  1. there is a warning about using scroll_id main parameter, that we didn't see (we are using kubernetes, maybe is it misconfigured?)
  2. after a little reading inside, we guessed we could use the "body" parameter which would then end up with a sane URI using POST method and a body containing the needed scroll_id. It worked.

Correct form of scroll() with current version to avoid URI overflow :

      $response = $client->scroll([
        'body' => [
          'scroll_id' => $scroll_id,
          'scroll' => '30s'
        ]
      ]);

The same goes for deletion, the only way to delete a large scroll_id is :

      $client->clearScroll([
        'body' => [
          'scroll_id' => $scroll_id
        ]
      ]);

We are using elasticsearch-php 7.4.1

Regards

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions