Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retrieval evaluation pipeline #2

Merged
merged 15 commits into from
Aug 30, 2024
Merged

Add retrieval evaluation pipeline #2

merged 15 commits into from
Aug 30, 2024

Conversation

dgbaenar
Copy link
Contributor

Retrieval evaluation pipeline in scripts folder

app/db/index_documents.py Outdated Show resolved Hide resolved
llm/llama.py Outdated Show resolved Hide resolved
llm/llama.py Outdated Show resolved Hide resolved
llm/main.py Outdated Show resolved Hide resolved
)
tokens_user_prompt = get_total_tokens_from_string(user_prompt_formatted)

if tokens_system_prompt + tokens_user_prompt < 10000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 10k could be a constant defined at the top of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 8 to 13
fhir_file = '../data/Abraham100_Oberbrunner298_9dbb826d-0be6-e8f9-3254-dbac25d83be6.json'
flat_file_path = '../data/flat_files'

output_json_file = '../data/FHIR_chunks.json'
output_report_file = '../data/FHIR_report.txt'
output_openai_batch_requests_file = '../data/openai_requests.jsonl'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these could be parameters parsed using argparse. You can set these values as defaults

scripts/rag_evaluation/data/tokens_lenghts.png Outdated Show resolved Hide resolved
Total openai queries: 1915
Total aproximate costs: 0.23
Total aproximate input costs: 0.054
Total aproximate output costs: 0.172
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could update your estimate of output tokens to reflect the current costs

metrics["Total contexts"] = total_contexts

for series_name, value in metrics.items():
task.get_logger().report_scalar(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use logger.report_single_value(metric, value) and results will be shown in a table of metrics, instead of a chart by iteration, as you have no iterations here

@@ -0,0 +1,109 @@
from dotenv import load_dotenv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to test the speed of the generation process by using different context lenghts

def search_query(query_text, embedding_model,
es_client, index_name=settings.index_name,
k=3, threshold=0.2):
k=5, threshold=0.2,
text_boost=1.0, embedding_boost=1.0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be updated to the tuned values

with open(output_file, "w") as out:
json.dump(documents, out, indent=2)

measure_tokens_lenghts("../data/tokens_lenghts.png", tokens_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should be a config, constant or parameter

Comment on lines 13 to 14
chunk_size=500,
chunk_overlap=50,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are relevant parameters which should also be stored in the experiment tracker, which also means that you should pass them in as parameters

@@ -0,0 +1,140 @@
import base64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this logic is taken from another open source code then you should add a comment here saying from where you got the code

Comment on lines 8 to 13
fhir_file = '../data/Abraham100_Oberbrunner298_9dbb826d-0be6-e8f9-3254-dbac25d83be6.json'
flat_file_path = '../data/flat_files'

output_json_file = '../data/FHIR_chunks.json'
output_report_file = '../data/FHIR_report.txt'
output_openai_batch_requests_file = '../data/openai_requests.jsonl'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have all of these as argparse parameters

return len(encoding.encode(string))


def measure_tokens_lenghts(file_path, tokens_lengths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function seems to be duplicated several times in different files. You should prefer to implement it once and reuse

return resource


def remove_urls_from_fhir(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks like a function which could be reused in multiple strategies

@dgbaenar dgbaenar merged commit e4b518b into main Aug 30, 2024
2 checks passed
@dgbaenar dgbaenar deleted the llamaindex branch September 3, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants