Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push Files to Index from Obsidian, Emacs & Desktop Clients using Multi-Part Forms Method #499

Merged
merged 24 commits into from
Oct 17, 2023
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6aa69da
Put indexer API endpoint under /api path segment
debanjum Oct 10, 2023
9ba173b
Improve emoji, message on content index updated via logger
debanjum Oct 12, 2023
60e9a61
Use multi-part form to receive files to index on server
debanjum Oct 12, 2023
68018ef
Use multi-part form to send files to index on desktop client
debanjum Oct 12, 2023
fc99431
Send files to index on server from the khoj.el emacs client
debanjum Oct 12, 2023
bed3aff
Update tests to test multi-part/form method of pushing files to index
debanjum Oct 12, 2023
292f042
Send content for indexing on server at a regular interval from khoj.el
debanjum Oct 13, 2023
bea196a
Explicitly make GET request to /config/data from khoj.el:khoj-server-…
debanjum Oct 13, 2023
b669aa2
Clean and fix the content indexing code in the Emacs client
debanjum Oct 14, 2023
f64fa06
Initialize the Khoj Transient menu on first run instead of load
debanjum Oct 14, 2023
79b3f82
Make khoj.el send files to be deleted from index to server
debanjum Oct 17, 2023
6baaaaf
Test request body of multi-part form to update content index from kho…
debanjum Oct 17, 2023
f2e293a
Push Vault files to index to Khoj server using Khoj Obsidian plugin
debanjum Oct 17, 2023
8e627a5
Pass any files to be deleted to indexer API via Khoj Obsidian plugin
debanjum Oct 17, 2023
d27dc71
Use encoding of each file set in indexer request to read file
debanjum Oct 17, 2023
541cd59
Let fs_syncer pass PDF files directly as binary before indexing
debanjum Oct 17, 2023
99a2c93
Add CORS policy to allow requests from khoj apps, obsidian & localhost
debanjum Oct 17, 2023
13a3122
Stop configuring server to pull files to index from Obsidian client
debanjum Oct 17, 2023
05be6bd
Clicking Update Index in Obsidian settings should push files to index
debanjum Oct 17, 2023
e347823
Log telemetry for index updates via push to API endpoint
debanjum Oct 17, 2023
84654ff
Update indexer API endpoint URL to index/update from indexer/batch
debanjum Oct 17, 2023
5efae1a
Update indexer API endpoint query params for force, content type
debanjum Oct 17, 2023
6a4f1b2
Add more client, request details in logs by index/update API endpoint
debanjum Oct 17, 2023
7b1c62b
Mark test_get_configured_types_via_api unit test as flaky
debanjum Oct 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Use encoding of each file set in indexer request to read file
Get encoding type from multi-part/form-request body for each file
Read text files as utf-8 and pdfs, images as binary
debanjum committed Oct 17, 2023

Verified

This commit was signed with the committer’s verified signature.
jasnell James M Snell
commit d27dc71dfecf3f395a7200e7622ed6b7054543fc
2 changes: 1 addition & 1 deletion src/interface/desktop/main.js
Original file line number Diff line number Diff line change
@@ -93,9 +93,9 @@ function filenameToMimeType (filename) {
case 'png':
return 'image/png';
case 'jpg':
return 'image/jpeg';
case 'jpeg':
return 'image/jpeg';
case 'md':
case 'markdown':
return 'text/markdown';
case 'org':
6 changes: 4 additions & 2 deletions src/khoj/routers/indexer.py
Original file line number Diff line number Diff line change
@@ -73,7 +73,7 @@ async def index_batch(
plaintext_files: Dict[str, str] = {}

for file in files:
file_type = get_file_type(file.content_type)
file_type, encoding = get_file_type(file.content_type)
dict_to_update = None
if file_type == "org":
dict_to_update = org_files
@@ -85,7 +85,9 @@ async def index_batch(
dict_to_update = plaintext_files

if dict_to_update is not None:
dict_to_update[file.filename] = file.file.read().decode("utf-8")
dict_to_update[file.filename] = (
file.file.read().decode("utf-8") if encoding == "utf-8" else file.file.read()
)
else:
logger.warning(f"Skipped indexing unsupported file type sent by client: {file.filename}")

17 changes: 9 additions & 8 deletions src/khoj/utils/helpers.py
Original file line number Diff line number Diff line change
@@ -66,24 +66,25 @@ def merge_dicts(priority_dict: dict, default_dict: dict):
return merged_dict


def get_file_type(file_type: str) -> str:
def get_file_type(file_type: str) -> tuple[str, str]:
"Get file type from file mime type"

encoding = file_type.split("=")[1].strip().lower() if ";" in file_type else None
file_type = file_type.split(";")[0].strip() if ";" in file_type else file_type
if file_type in ["text/markdown"]:
return "markdown"
return "markdown", encoding
elif file_type in ["text/org"]:
return "org"
return "org", encoding
elif file_type in ["application/pdf"]:
return "pdf"
return "pdf", encoding
elif file_type in ["image/jpeg"]:
return "jpeg"
return "jpeg", encoding
elif file_type in ["image/png"]:
return "png"
return "png", encoding
elif file_type in ["text/plain", "text/html", "application/xml", "text/x-rst"]:
return "plaintext"
return "plaintext", encoding
else:
return "other"
return "other", encoding


def load_model(