Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-IDF based search for ZK init #1

Merged
merged 5 commits into from
May 18, 2020
Merged

TF-IDF based search for ZK init #1

merged 5 commits into from
May 18, 2020

Conversation

jcd13d
Copy link

@jcd13d jcd13d commented May 10, 2020

Wrote this script to do a TF-IDF based search. Think it's interesting to use to find notes that could be related to a note recently written but didn't come up through tag/keyword searches. I wrote a little blurb on how I use it in the file.

@sirupsen
Copy link
Owner

Nice, let me check this out :) Funny, I was just reading about cosine similarity 30 min ago (reading up on search). 👍

Copy link
Owner

@sirupsen sirupsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so simple. Did not know you could do semantic similarity this easily. You must have some data science experience since you were able to write it this concisely?

I've been contemplating moving zk to Lucene instead of SQLite. It's so powerful and does things like this out of the box... I just worry about the startup time for that. Would ruin some of the beautiful simplicity.

I ran this against a few notes, and were delighted with the results. I'll move my Vim script over to use this when it's in master.

print(file)

def run(self):
parser = argparse.ArgumentParser(description='Process some integers.')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful if it prints --help when no args are passed!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@@ -0,0 +1,97 @@
import argparse
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to zksim (zettelkasten-similar) or zkadj (zettelkasten-adjacent) or something of that nature, to make it more consistent. You can then also add it to the README. FWIW, I'm fine with multiple languages (Ruby, Python, ..)

import sqlite3
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you can do this in Python, but in Ruby you can rescue on an import and print a message, like: Please run pip3 install scikit-learn pandas before using this script.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this... It definitely would be helpful!

self.num_files_to_show = 20

def application_logic(self, filename):
df = pd.read_sql("select * from zettelkasten", con=self.conn)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's supports for importing raw highlights (may get renamed to "literatures notes" or "litnotes" later, you'll see this in fts-search). Can you add NOT LIKE 'highlights/%' here so it doesn't match that directory? 🆗

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the NOT LIKE operator operating on the title column for that?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sorry!

def run(self):
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--filename', dest='filename', action=CustomAction, type=str, nargs='+',
help='filename to search for similarity')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it just take a filename as the default argument, instead of as an option?

vectors = vectorize_text(df[text_col])
searching_index = index_from_title(df[title_col], title)
sim_index = similarity_index(searching_index, vectors)
return df.iloc[sim_index][title_col].values
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing the cosine similarity score would be sweet, similar to what zkt does with the count of tags.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be cool! I don't have a ton of experience with bash and the tools you're using, just dug in enough to implement what you wrote and re-wire a few things. I'll take a look at your zkt-raw and see if I can apply that somehow.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement to get this in I'd say :)

:param series: Pandas Series object
:return: matrix of tf-idf features
"""
return TfidfVectorizer().fit_transform(series.values)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is the most expensive operation. This takes about ~1s on my machine with roughly 1,000 notes. That's OK for me, but if you wanted to nerd out, I think you can use fts4aux to get the raw tokenization from Sqlite's FTS to avoid computing it again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely was a next step for me to try to make this a bit faster. There are a few things we can store so they don't have to be recomputed. I'll probably need to get lower-level and program the TFIDF preprocessing stuff myself instead of using the sklearn vectorizer (which is fine just will take a bit).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, not a requirement to get this in

I currently utilize sirupsen's search tool to use this (https://github.com/sirupsen/zk/blob/master/bin/zks).
I have a second search function for when I want to try TF-IDF that puts 'python --filename "zk filename.md" |' in
front of the fzf. This will give you a view to scroll through related files with the same opening and linking commands
as zk-search.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just making this part of zks by adding another --bind, e.g. Alt-S will show similar notes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Again, not super experienced with this but looks straightforward to go based on the binds you have already.

@sirupsen
Copy link
Owner

sirupsen commented May 10, 2020

Another thing I was thinking is that this tool could take into consideration the entire link hierarchy. E.g.

# Note A

[[B]]
[[C]]
# Note

[[D]]
[[A]]

Then D should probably show up for A, even if they have low similarity.

My script for finding related notes did it based on tags only. There's lots of things to use here in the long-term. It's a fascinating search problem :)

I know this is more about finding notes that may not be tagged right, just floating some ideas.

@jcd13d
Copy link
Author

jcd13d commented May 12, 2020

Yeah, I have a little data science experience so know where to find some of those library functions that make things super easy. Fairly new to the bash/terminal stuff so looking to use projects like this to learn!

I haven't used Lucene but would be willing to get into it if you decide to switch over. Really liking zk so far I'm going to be doing all my personal note-taking in it. I'm with you on the startup time, the fact that it is so snappy makes it fun to use. Even that 1 second where the similarity search is calculated gives me a little sigh every time... lol

I like the link hierarchy idea! I have been thinking about some graph-related analysis like that. Maybe a list of notes in order of "steps away" from the note in question. I want to do that and also some unsupervised clustering of notes at some point. Lots of interesting ways we can go!

@sirupsen
Copy link
Owner

sirupsen commented May 13, 2020

Let me know when you're ready for a final look :)

Make sure to remove the py extension too. You can replace it with a Python 3 shebang, probably #!/usr/bin/env python3

bin/zksim.py Outdated
Comment on lines 7 to 9
for module in ("pandas", "sklearn"):
if not util.find_spec(module) is not None:
raise ModuleNotFoundError('Missing {0}! Please run pip3 install scikit-learn pandas'.format(module))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more python-esque way to do this (similar to the rescue import) is:

try:
    from sklearn.metrics.pairwise import linear_kernel
    from sklearn.feature_extraction.text import TfidfVectorizer
    import pandas as pd
except ImportError as e:
    print(f"Missing {e.name}! Please run pip3 install scikit-learn pandas")
    exit()

the ImportError is usually a ModuleNotFoundError like you were doing manually

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I was looking for what the "pythonic" way to do this would be... I had a feeling what I did was a little awkward.

@jcd13d
Copy link
Author

jcd13d commented May 17, 2020

Hey @sirupsen check it out - think I have covered the "must-haves".

I also took a shot at adding a --bind to zks. I have an issue that if type a query to find a note, then hit alt-s it does do the similarity search but the preview will not show text unless the .md file has the keyword in its body.

Not sure if there is a way to erase that query when you do the alt-s similarity search. All I did was add the last bind here:

fzf --ansi --height 100% --preview 'zk-fts-search -f {} {q} | bat --language md --style=plain --color always' \
  --bind "ctrl-o:execute-silent@tmux send-keys -t \{left\} Escape :read Space ! Space echo Space && \
          tmux send-keys -t \{left\} -l '\"'[[{}]]'\"' && \
          tmux send-keys -t \{left\} Enter@" \
  --bind "enter:execute-silent[ \
    tmux send-keys -t \{left\} Escape :e Space && \
    tmux send-keys -t \{left\} -l {} && \
    tmux send-keys -t \{left\} Enter \
  ]" \
  --bind "change:reload:zk-fts-search '{q}'" \
  --bind "alt-s:reload:zksim {}" \
  --phony --preview-window=top:65% --no-info --no-multi

Let me know if you see anything else that should be changed!

@sirupsen
Copy link
Owner

Not sure if there is a way to erase that query when you do the alt-s similarity search. All I did was add the last bind here:

I think alt-s:execute is the right one, I'll do that in mater :)

@sirupsen sirupsen merged commit f5a6d3b into sirupsen:master May 18, 2020
@sirupsen
Copy link
Owner

This is awesome. Will add the bind and a README entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants