Skip to content

Updated ES search functions and authentication #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vitaliyok
Copy link

Hi,

Here are proposed changes to ES search and authentications options:

  1. Renamed and refactored the original search function
  2. Added a function which would allow users to reuse scroll id if the the search fails
  3. Added a function which sorting options, as recommended by ES
  4. Added support for API key object (with encoded API key) as generated by ES
  5. Added some more options for users to view index mapping and available indices in the search template

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few minor issues I see.
Mostly to do with the removed methods having been used in other parts of the project.

Though the gist of it seems to be fine. I haven't tested it out, but I'm sure it'll work if you've been using it at GSTT.


def get_docs_generator(self, index: List, query: Dict, es_gen_size: int=800, request_timeout: Optional[int] = 300):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method has been removed as but can be brought back.
All of the new methods return a Pandas DataFrame but currently this function returns raw JSON which is converted to a list of tuples. It also uses ES _source object. In the new functions, I have excluded _source object from search results and only returning "fields", as recommended by Elastic. The problem is that all fields are arrays and values need to be joined in for the resulting DataFrame.
I think, it would be possible to change the implementation to use new methods but create tuples from DataFrame instead of brining the old function back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the right approach here.

I'm not saying that the methods I've tagged need to be reimplemented. All I'm trying to do is make sure that the code that uses them (i.e in the other notebooks and/or scripts) gets updated alongside the changes to cogstack.py. I.e if someone uses the scripts we provide (after this change), they don't error out because the they are out of sync from the loaded module(s).

df = pd.DataFrame(temp_results)
return df

def DataFrame(self, index: str, columns: Optional[List[str]] = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using eland DataFrame which is the same as Pandas DataFrame and can be re-implemented without eland.

The username to use when connecting to Elasticsearch. If not provided, the user will be prompted to enter a username.
password : str, optional
The password to use when connecting to Elasticsearch. If not provided, the user will be prompted to enter a password.
apiKey : Dict, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generaly, we want snake_case names for variables. So api_key would make more sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I have changed this.

api_key=api_key,
verify_certs=False,
timeout=timeout)
apiKey: Dict = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generaly, we want snake_case names for variables. So api_key would make more sense.


if api_key and api:
self.elastic = elasticsearch.Elasticsearch(hosts=hosts,
api_key=api_key,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed to match the new parameters. I can implement this too.

@vitaliyok
Copy link
Author

It looks like the current implementation is still using the old CogStack text field: "body_analysed". It should probably be renamed to "document_Content" or not use any specific field names in the code here.

@mart-r
Copy link
Collaborator

mart-r commented Jul 15, 2025

It looks like the current implementation is still using the old CogStack text field: "body_analysed". It should probably be renamed to "document_Content" or not use any specific field names in the code here.

The exepctation is generally that the user provides the correct fields they're interested in. I'm pretty sure body_analysed serves as just an example.

With that said, if there's a more relevant, up to date example, we'd be better off using that indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants