-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added extra experiments - mainly around macro chunking #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contributing all the code, I added some comments. Would be nice if you could address them before we merge this
# overwrite_results=True, | ||
# batch_size=BATCH_SIZE, | ||
# encode_kwargs={'batch_size': BATCH_SIZE}, | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this script needs a bit of a clean up, maybe we can also integrate it into the main script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have integrated into the main script and also renamed macro chunking to long late chunking. I've got long late chunking off by default but when defining e.g. --long-late-chunking-embed-size 8192
, hope this is what you meant
Co-authored-by: Michael Günther <michael.guenther@jina.ai>
all comments hopefully addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added some minor comments
explanatory_contextual_retrieval.py
Outdated
# self.llm = pipeline( | ||
# "text-generation", model=llm_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", | ||
# max_length = 1000 | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# self.llm = pipeline( | |
# "text-generation", model=llm_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", | |
# max_length = 1000 | |
# ) |
explanatory_contextual_retrieval.py
Outdated
# to late chunking to see if the similarities are similar (which they appear to be) | ||
# | ||
# pip requirements: | ||
# accelerate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this to the pyproject.toml if necessary?
explanatory_contextual_retrieval.py
Outdated
""".strip().replace("\n", "") | ||
|
||
|
||
# llm_model_name = "microsoft/Phi-3.5-mini-instruct" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# llm_model_name = "microsoft/Phi-3.5-mini-instruct" |
Overview
Experiments added:
LongEmbed Examples against chunk size (nDCG@10 and mAP@10)
Similarly to
run_chunked_eval.py
,run_chunked_eval_with_macro_chunking.py
can just be run on the command line with e.g.To reproduce easily
I recommend the bash file
to run them all at once. Then the results can be displayed graphically in a matplotlib plot via running the file
plot_chunk_size_experiments.py
.Macro chunking approach vs 'hard' boundary approach with 0 overlap
Similar to the above - comparing macro chunking to non-macro chunking, with experiment file
run_macro_chunking_experiments.py
and plot fileplot_macro_chunking_experiments.py
.Example with Anthropics contextual retrieval
You can run the
explanatory_contextual_retrieval.py
to see a comparison between Anthropics contextual retrieval, which manually adds context to each chunk, late chunking, and naive chunking. This is performed via a running on a generated document which deliberately has context missing in later sentences (via 'Its' instead of a company name). The comparison is via cosine similarities on the chunks and the corresponding embeddings onjina-embeddings-v2-base-en
.