-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retrieval evaluation pipeline #2
Conversation
) | ||
tokens_user_prompt = get_total_tokens_from_string(user_prompt_formatted) | ||
|
||
if tokens_system_prompt + tokens_user_prompt < 10000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 10k could be a constant defined at the top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
fhir_file = '../data/Abraham100_Oberbrunner298_9dbb826d-0be6-e8f9-3254-dbac25d83be6.json' | ||
flat_file_path = '../data/flat_files' | ||
|
||
output_json_file = '../data/FHIR_chunks.json' | ||
output_report_file = '../data/FHIR_report.txt' | ||
output_openai_batch_requests_file = '../data/openai_requests.jsonl' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these could be parameters parsed using argparse. You can set these values as defaults
Total openai queries: 1915 | ||
Total aproximate costs: 0.23 | ||
Total aproximate input costs: 0.054 | ||
Total aproximate output costs: 0.172 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could update your estimate of output tokens to reflect the current costs
metrics["Total contexts"] = total_contexts | ||
|
||
for series_name, value in metrics.items(): | ||
task.get_logger().report_scalar( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use logger.report_single_value(metric, value) and results will be shown in a table of metrics, instead of a chart by iteration, as you have no iterations here
@@ -0,0 +1,109 @@ | |||
from dotenv import load_dotenv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to test the speed of the generation process by using different context lenghts
app/services/search_documents.py
Outdated
def search_query(query_text, embedding_model, | ||
es_client, index_name=settings.index_name, | ||
k=3, threshold=0.2): | ||
k=5, threshold=0.2, | ||
text_boost=1.0, embedding_boost=1.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be updated to the tuned values
with open(output_file, "w") as out: | ||
json.dump(documents, out, indent=2) | ||
|
||
measure_tokens_lenghts("../data/tokens_lenghts.png", tokens_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path should be a config, constant or parameter
chunk_size=500, | ||
chunk_overlap=50, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are relevant parameters which should also be stored in the experiment tracker, which also means that you should pass them in as parameters
@@ -0,0 +1,140 @@ | |||
import base64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this logic is taken from another open source code then you should add a comment here saying from where you got the code
fhir_file = '../data/Abraham100_Oberbrunner298_9dbb826d-0be6-e8f9-3254-dbac25d83be6.json' | ||
flat_file_path = '../data/flat_files' | ||
|
||
output_json_file = '../data/FHIR_chunks.json' | ||
output_report_file = '../data/FHIR_report.txt' | ||
output_openai_batch_requests_file = '../data/openai_requests.jsonl' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have all of these as argparse parameters
return len(encoding.encode(string)) | ||
|
||
|
||
def measure_tokens_lenghts(file_path, tokens_lengths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function seems to be duplicated several times in different files. You should prefer to implement it once and reuse
return resource | ||
|
||
|
||
def remove_urls_from_fhir(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also looks like a function which could be reused in multiple strategies
Retrieval evaluation pipeline in scripts folder