-
Notifications
You must be signed in to change notification settings - Fork 8
Updated ES search functions and authentication #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few minor issues I see.
Mostly to do with the removed methods having been used in other parts of the project.
Though the gist of it seems to be fine. I haven't tested it out, but I'm sure it'll work if you've been using it at GSTT.
|
||
def get_docs_generator(self, index: List, query: Dict, es_gen_size: int=800, request_timeout: Optional[int] = 300): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit that's using this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method has been removed as but can be brought back.
All of the new methods return a Pandas DataFrame but currently this function returns raw JSON which is converted to a list of tuples. It also uses ES _source object. In the new functions, I have excluded _source object from search results and only returning "fields", as recommended by Elastic. The problem is that all fields are arrays and values need to be joined in for the resulting DataFrame.
I think, it would be possible to change the implementation to use new methods but create tuples from DataFrame instead of brining the old function back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's the right approach here.
I'm not saying that the methods I've tagged need to be reimplemented. All I'm trying to do is make sure that the code that uses them (i.e in the other notebooks and/or scripts) gets updated alongside the changes to cogstack.py
. I.e if someone uses the scripts we provide (after this change), they don't error out because the they are out of sync from the loaded module(s).
df = pd.DataFrame(temp_results) | ||
return df | ||
|
||
def DataFrame(self, index: str, columns: Optional[List[str]] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few bits that are using this:
https://github.com/CogStack/working_with_cogstack/blob/main/medcat/3_run_model/run_model.py#L32
https://github.com/CogStack/working_with_cogstack/blob/main/medcat/2_train_model/1_unsupervised_training/unsupervised_medcattraining.py#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using eland DataFrame which is the same as Pandas DataFrame and can be re-implemented without eland.
The username to use when connecting to Elasticsearch. If not provided, the user will be prompted to enter a username. | ||
password : str, optional | ||
The password to use when connecting to Elasticsearch. If not provided, the user will be prompted to enter a password. | ||
apiKey : Dict, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generaly, we want snake_case
names for variables. So api_key
would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I have changed this.
api_key=api_key, | ||
verify_certs=False, | ||
timeout=timeout) | ||
apiKey: Dict = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generaly, we want snake_case
names for variables. So api_key
would make more sense.
|
||
if api_key and api: | ||
self.elastic = elasticsearch.Elasticsearch(hosts=hosts, | ||
api_key=api_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed to match the new parameters. I can implement this too.
It looks like the current implementation is still using the old CogStack text field: "body_analysed". It should probably be renamed to "document_Content" or not use any specific field names in the code here. |
The exepctation is generally that the user provides the correct fields they're interested in. I'm pretty sure With that said, if there's a more relevant, up to date example, we'd be better off using that indeed. |
Hi,
Here are proposed changes to ES search and authentications options: