TF-IDF based search for ZK init #1

jcd13d · 2020-05-10T19:22:38Z

Wrote this script to do a TF-IDF based search. Think it's interesting to use to find notes that could be related to a note recently written but didn't come up through tag/keyword searches. I wrote a little blurb on how I use it in the file.

sirupsen · 2020-05-10T19:24:05Z

Nice, let me check this out :) Funny, I was just reading about cosine similarity 30 min ago (reading up on search). 👍

sirupsen

This is so simple. Did not know you could do semantic similarity this easily. You must have some data science experience since you were able to write it this concisely?

I've been contemplating moving zk to Lucene instead of SQLite. It's so powerful and does things like this out of the box... I just worry about the startup time for that. Would ruin some of the beautiful simplicity.

I ran this against a few notes, and were delighted with the results. I'll move my Vim script over to use this when it's in master.

sirupsen · 2020-05-10T19:47:26Z

bin/tfidf_search.py

+            print(file)
+
+    def run(self):
+        parser = argparse.ArgumentParser(description='Process some integers.')


Helpful if it prints --help when no args are passed!

sirupsen · 2020-05-10T19:48:23Z

bin/tfidf_search.py

@@ -0,0 +1,97 @@
+import argparse


Let's rename this to zksim (zettelkasten-similar) or zkadj (zettelkasten-adjacent) or something of that nature, to make it more consistent. You can then also add it to the README. FWIW, I'm fine with multiple languages (Ruby, Python, ..)

sirupsen · 2020-05-10T19:49:09Z

bin/tfidf_search.py

+import sqlite3
+import pandas as pd
+from sklearn.metrics.pairwise import linear_kernel
+from sklearn.feature_extraction.text import TfidfVectorizer


Not sure if you can do this in Python, but in Ruby you can rescue on an import and print a message, like: Please run pip3 install scikit-learn pandas before using this script.

Will look into this... It definitely would be helpful!

sirupsen · 2020-05-10T19:49:54Z

bin/tfidf_search.py

+        self.num_files_to_show = 20
+
+    def application_logic(self, filename):
+        df = pd.read_sql("select * from zettelkasten", con=self.conn)


There's supports for importing raw highlights (may get renamed to "literatures notes" or "litnotes" later, you'll see this in fts-search). Can you add NOT LIKE 'highlights/%' here so it doesn't match that directory? 🆗

Is the NOT LIKE operator operating on the title column for that?

Yep, sorry!

sirupsen · 2020-05-10T19:50:26Z

bin/tfidf_search.py

+    def run(self):
+        parser = argparse.ArgumentParser(description='Process some integers.')
+        parser.add_argument('--filename', dest='filename', action=CustomAction, type=str, nargs='+',
+                            help='filename to search for similarity')


Can you make it just take a filename as the default argument, instead of as an option?

sirupsen · 2020-05-10T19:51:31Z

bin/tfidf_search.py

+    vectors = vectorize_text(df[text_col])
+    searching_index = index_from_title(df[title_col], title)
+    sim_index = similarity_index(searching_index, vectors)
+    return df.iloc[sim_index][title_col].values


Exposing the cosine similarity score would be sweet, similar to what zkt does with the count of tags.

That would be cool! I don't have a ton of experience with bash and the tools you're using, just dug in enough to implement what you wrote and re-wire a few things. I'll take a look at your zkt-raw and see if I can apply that somehow.

Not a requirement to get this in I'd say :)

sirupsen · 2020-05-10T19:52:47Z

bin/tfidf_search.py

+    :param series: Pandas Series object
+    :return: matrix of tf-idf features
+    """
+    return TfidfVectorizer().fit_transform(series.values)


Presumably this is the most expensive operation. This takes about ~1s on my machine with roughly 1,000 notes. That's OK for me, but if you wanted to nerd out, I think you can use fts4aux to get the raw tokenization from Sqlite's FTS to avoid computing it again.

Yeah definitely was a next step for me to try to make this a bit faster. There are a few things we can store so they don't have to be recomputed. I'll probably need to get lower-level and program the TFIDF preprocessing stuff myself instead of using the sklearn vectorizer (which is fine just will take a bit).

Ya, not a requirement to get this in

sirupsen · 2020-05-10T19:53:14Z

bin/tfidf_search.py

+I currently utilize sirupsen's search tool to use this (https://github.com/sirupsen/zk/blob/master/bin/zks). 
+I have a second search function for when I want to try TF-IDF that puts 'python --filename "zk filename.md" |' in
+front of the fzf. This will give you a view to scroll through related files with the same opening and linking commands
+as zk-search. 


How about just making this part of zks by adding another --bind, e.g. Alt-S will show similar notes?

Good point! Again, not super experienced with this but looks straightforward to go based on the binds you have already.

sirupsen · 2020-05-10T20:13:08Z

Another thing I was thinking is that this tool could take into consideration the entire link hierarchy. E.g.

# Note A

[[B]]
[[C]]

# Note

[[D]]
[[A]]

Then D should probably show up for A, even if they have low similarity.

My script for finding related notes did it based on tags only. There's lots of things to use here in the long-term. It's a fascinating search problem :)

I know this is more about finding notes that may not be tagged right, just floating some ideas.

jcd13d · 2020-05-12T22:15:33Z

Yeah, I have a little data science experience so know where to find some of those library functions that make things super easy. Fairly new to the bash/terminal stuff so looking to use projects like this to learn!

I haven't used Lucene but would be willing to get into it if you decide to switch over. Really liking zk so far I'm going to be doing all my personal note-taking in it. I'm with you on the startup time, the fact that it is so snappy makes it fun to use. Even that 1 second where the similarity search is calculated gives me a little sigh every time... lol

I like the link hierarchy idea! I have been thinking about some graph-related analysis like that. Maybe a list of notes in order of "steps away" from the note in question. I want to do that and also some unsupervised clustering of notes at some point. Lots of interesting ways we can go!

sirupsen · 2020-05-13T10:54:39Z

Let me know when you're ready for a final look :)

Make sure to remove the py extension too. You can replace it with a Python 3 shebang, probably #!/usr/bin/env python3

jimgraham · 2020-05-15T20:01:17Z

bin/zksim.py

+for module in ("pandas", "sklearn"):
+    if not util.find_spec(module) is not None:
+        raise ModuleNotFoundError('Missing {0}! Please run pip3 install scikit-learn pandas'.format(module))


a more python-esque way to do this (similar to the rescue import) is:

try: from sklearn.metrics.pairwise import linear_kernel from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd except ImportError as e: print(f"Missing {e.name}! Please run pip3 install scikit-learn pandas") exit()

the ImportError is usually a ModuleNotFoundError like you were doing manually

Thanks! I was looking for what the "pythonic" way to do this would be... I had a feeling what I did was a little awkward.

jcd13d · 2020-05-17T00:33:03Z

Hey @sirupsen check it out - think I have covered the "must-haves".

I also took a shot at adding a --bind to zks. I have an issue that if type a query to find a note, then hit alt-s it does do the similarity search but the preview will not show text unless the .md file has the keyword in its body.

Not sure if there is a way to erase that query when you do the alt-s similarity search. All I did was add the last bind here:

fzf --ansi --height 100% --preview 'zk-fts-search -f {} {q} | bat --language md --style=plain --color always' \
  --bind "ctrl-o:execute-silent@tmux send-keys -t \{left\} Escape :read Space ! Space echo Space && \
          tmux send-keys -t \{left\} -l '\"'[[{}]]'\"' && \
          tmux send-keys -t \{left\} Enter@" \
  --bind "enter:execute-silent[ \
    tmux send-keys -t \{left\} Escape :e Space && \
    tmux send-keys -t \{left\} -l {} && \
    tmux send-keys -t \{left\} Enter \
  ]" \
  --bind "change:reload:zk-fts-search '{q}'" \
  --bind "alt-s:reload:zksim {}" \
  --phony --preview-window=top:65% --no-info --no-multi

Let me know if you see anything else that should be changed!

sirupsen · 2020-05-18T13:20:08Z

Not sure if there is a way to erase that query when you do the alt-s similarity search. All I did was add the last bind here:

I think alt-s:execute is the right one, I'll do that in mater :)

sirupsen · 2020-05-18T13:20:23Z

This is awesome. Will add the bind and a README entry.

TF-IDF based search for ZK init

4d8704b

sirupsen reviewed May 10, 2020

View reviewed changes

Justin DiEmmanuele added 2 commits May 12, 2020 18:34

help when no args, change name, take default arg

d7f3bdb

error handling for imports and ZK_PATH variable

ae1d950

jimgraham reviewed May 15, 2020

View reviewed changes

Justin DiEmmanuele added 2 commits May 16, 2020 18:19

added shebang, dropped .py, added hightlights filtering

30517fd

updated import error handling

422a95e

sirupsen merged commit f5a6d3b into sirupsen:master May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF-IDF based search for ZK init #1

TF-IDF based search for ZK init #1

jcd13d commented May 10, 2020

sirupsen commented May 10, 2020

sirupsen left a comment

sirupsen May 10, 2020

jcd13d May 12, 2020

sirupsen May 10, 2020

sirupsen May 10, 2020

jcd13d May 12, 2020

sirupsen May 10, 2020

jcd13d May 12, 2020

sirupsen May 13, 2020

sirupsen May 10, 2020

sirupsen May 10, 2020

jcd13d May 12, 2020

sirupsen May 13, 2020

sirupsen May 10, 2020

jcd13d May 12, 2020

sirupsen May 13, 2020

sirupsen May 10, 2020

jcd13d May 12, 2020

sirupsen commented May 10, 2020 •

edited

Loading

jcd13d commented May 12, 2020

sirupsen commented May 13, 2020 •

edited

Loading

jimgraham May 15, 2020

jcd13d May 16, 2020

jcd13d commented May 17, 2020

sirupsen commented May 18, 2020

sirupsen commented May 18, 2020

TF-IDF based search for ZK init #1

TF-IDF based search for ZK init #1

Conversation

jcd13d commented May 10, 2020

sirupsen commented May 10, 2020

sirupsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirupsen commented May 10, 2020 • edited Loading

jcd13d commented May 12, 2020

sirupsen commented May 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcd13d commented May 17, 2020

sirupsen commented May 18, 2020

sirupsen commented May 18, 2020

sirupsen commented May 10, 2020 •

edited

Loading

sirupsen commented May 13, 2020 •

edited

Loading