-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF-IDF based search for ZK init #1
Conversation
Nice, let me check this out :) Funny, I was just reading about cosine similarity 30 min ago (reading up on search). 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so simple. Did not know you could do semantic similarity this easily. You must have some data science experience since you were able to write it this concisely?
I've been contemplating moving zk
to Lucene instead of SQLite. It's so powerful and does things like this out of the box... I just worry about the startup time for that. Would ruin some of the beautiful simplicity.
I ran this against a few notes, and were delighted with the results. I'll move my Vim script over to use this when it's in master.
bin/tfidf_search.py
Outdated
print(file) | ||
|
||
def run(self): | ||
parser = argparse.ArgumentParser(description='Process some integers.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpful if it prints --help
when no args are passed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
bin/tfidf_search.py
Outdated
@@ -0,0 +1,97 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to zksim
(zettelkasten-similar
) or zkadj
(zettelkasten-adjacent
) or something of that nature, to make it more consistent. You can then also add it to the README. FWIW, I'm fine with multiple languages (Ruby, Python, ..)
bin/tfidf_search.py
Outdated
import sqlite3 | ||
import pandas as pd | ||
from sklearn.metrics.pairwise import linear_kernel | ||
from sklearn.feature_extraction.text import TfidfVectorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you can do this in Python, but in Ruby you can rescue on an import and print a message, like: Please run pip3 install scikit-learn pandas
before using this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into this... It definitely would be helpful!
bin/tfidf_search.py
Outdated
self.num_files_to_show = 20 | ||
|
||
def application_logic(self, filename): | ||
df = pd.read_sql("select * from zettelkasten", con=self.conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's supports for importing raw highlights (may get renamed to "literatures notes" or "litnotes" later, you'll see this in fts-search). Can you add NOT LIKE 'highlights/%'
here so it doesn't match that directory? 🆗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the NOT LIKE operator operating on the title column for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sorry!
bin/tfidf_search.py
Outdated
def run(self): | ||
parser = argparse.ArgumentParser(description='Process some integers.') | ||
parser.add_argument('--filename', dest='filename', action=CustomAction, type=str, nargs='+', | ||
help='filename to search for similarity') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make it just take a filename as the default argument, instead of as an option?
bin/tfidf_search.py
Outdated
vectors = vectorize_text(df[text_col]) | ||
searching_index = index_from_title(df[title_col], title) | ||
sim_index = similarity_index(searching_index, vectors) | ||
return df.iloc[sim_index][title_col].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing the cosine similarity score would be sweet, similar to what zkt
does with the count of tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be cool! I don't have a ton of experience with bash and the tools you're using, just dug in enough to implement what you wrote and re-wire a few things. I'll take a look at your zkt-raw and see if I can apply that somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a requirement to get this in I'd say :)
bin/tfidf_search.py
Outdated
:param series: Pandas Series object | ||
:return: matrix of tf-idf features | ||
""" | ||
return TfidfVectorizer().fit_transform(series.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this is the most expensive operation. This takes about ~1s on my machine with roughly 1,000 notes. That's OK for me, but if you wanted to nerd out, I think you can use fts4aux
to get the raw tokenization from Sqlite's FTS to avoid computing it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definitely was a next step for me to try to make this a bit faster. There are a few things we can store so they don't have to be recomputed. I'll probably need to get lower-level and program the TFIDF preprocessing stuff myself instead of using the sklearn vectorizer (which is fine just will take a bit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, not a requirement to get this in
bin/tfidf_search.py
Outdated
I currently utilize sirupsen's search tool to use this (https://github.com/sirupsen/zk/blob/master/bin/zks). | ||
I have a second search function for when I want to try TF-IDF that puts 'python --filename "zk filename.md" |' in | ||
front of the fzf. This will give you a view to scroll through related files with the same opening and linking commands | ||
as zk-search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just making this part of zks
by adding another --bind
, e.g. Alt-S
will show similar notes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Again, not super experienced with this but looks straightforward to go based on the binds you have already.
Another thing I was thinking is that this tool could take into consideration the entire link hierarchy. E.g.
Then My script for finding related notes did it based on tags only. There's lots of things to use here in the long-term. It's a fascinating search problem :) I know this is more about finding notes that may not be tagged right, just floating some ideas. |
Yeah, I have a little data science experience so know where to find some of those library functions that make things super easy. Fairly new to the bash/terminal stuff so looking to use projects like this to learn! I haven't used Lucene but would be willing to get into it if you decide to switch over. Really liking zk so far I'm going to be doing all my personal note-taking in it. I'm with you on the startup time, the fact that it is so snappy makes it fun to use. Even that 1 second where the similarity search is calculated gives me a little sigh every time... lol I like the link hierarchy idea! I have been thinking about some graph-related analysis like that. Maybe a list of notes in order of "steps away" from the note in question. I want to do that and also some unsupervised clustering of notes at some point. Lots of interesting ways we can go! |
Let me know when you're ready for a final look :) Make sure to remove the |
bin/zksim.py
Outdated
for module in ("pandas", "sklearn"): | ||
if not util.find_spec(module) is not None: | ||
raise ModuleNotFoundError('Missing {0}! Please run pip3 install scikit-learn pandas'.format(module)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a more python-esque way to do this (similar to the rescue import) is:
try:
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
except ImportError as e:
print(f"Missing {e.name}! Please run pip3 install scikit-learn pandas")
exit()
the ImportError
is usually a ModuleNotFoundError
like you were doing manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I was looking for what the "pythonic" way to do this would be... I had a feeling what I did was a little awkward.
Hey @sirupsen check it out - think I have covered the "must-haves". I also took a shot at adding a --bind to zks. I have an issue that if type a query to find a note, then hit alt-s it does do the similarity search but the preview will not show text unless the .md file has the keyword in its body. Not sure if there is a way to erase that query when you do the alt-s similarity search. All I did was add the last bind here: fzf --ansi --height 100% --preview 'zk-fts-search -f {} {q} | bat --language md --style=plain --color always' \
--bind "ctrl-o:execute-silent@tmux send-keys -t \{left\} Escape :read Space ! Space echo Space && \
tmux send-keys -t \{left\} -l '\"'[[{}]]'\"' && \
tmux send-keys -t \{left\} Enter@" \
--bind "enter:execute-silent[ \
tmux send-keys -t \{left\} Escape :e Space && \
tmux send-keys -t \{left\} -l {} && \
tmux send-keys -t \{left\} Enter \
]" \
--bind "change:reload:zk-fts-search '{q}'" \
--bind "alt-s:reload:zksim {}" \
--phony --preview-window=top:65% --no-info --no-multi Let me know if you see anything else that should be changed! |
I think |
This is awesome. Will add the bind and a README entry. |
Wrote this script to do a TF-IDF based search. Think it's interesting to use to find notes that could be related to a note recently written but didn't come up through tag/keyword searches. I wrote a little blurb on how I use it in the file.