-
-
Notifications
You must be signed in to change notification settings - Fork 699
Added Python MarkItDown docs #1977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
WalkthroughA new guide focused on converting documents to markdown using Python and the Microsoft MarkItDown library has been added to the documentation. The navigation structure is updated to include this guide under the Python guides section. The introduction page for guides now references the new document conversion guide, and a comprehensive instructional document is introduced, detailing setup, implementation, and usage of the conversion process within a Trigger.dev project. Additionally, some sidebar titles in existing Python guides were updated for clarity. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TriggerDev
participant PythonScript
participant MarkItDown
User->>TriggerDev: Initiate document conversion task (provide URL)
TriggerDev->>TriggerDev: Download document to temp file
TriggerDev->>PythonScript: Run Python script with temp file path
PythonScript->>MarkItDown: Convert document to markdown
MarkItDown-->>PythonScript: Return markdown or error
PythonScript->>TriggerDev: Return conversion result (success/error, markdown)
TriggerDev->>TriggerDev: Clean up temp files
TriggerDev->>User: Return conversion result (markdown, error, success)
Suggested reviewers
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
docs/guides/python/python-doc-to-markdown.mdx (3)
1-5
: Validate frontmatter metadata
Frontmatter fields (title
,sidebarTitle
,description
) follow the established pattern. Consider verifying that “Markdown” capitalization here aligns with other guide titles or branding guidelines.
41-62
: Clarify OS-specific instructions in build snippet
ThedevPythonBinaryPath: \
venv/bin/python`setting works on Unix-like systems but not on Windows. You may want to add a note or show a Windows equivalent (e.g.,
venv\Scripts\python.exe`) for Windows users.
123-129
: Improve temporary file cleanup
Thefs.unlink
call is non-blocking and swallows errors. To ensure no leftover files and better error visibility, consider awaiting the unlink or adding an error callback, for example:await fs.promises.unlink(tempFilePath);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/docs.json
(1 hunks)docs/guides/introduction.mdx
(1 hunks)docs/guides/python/python-doc-to-markdown.mdx
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/python/python-doc-to-markdown.mdx
[uncategorized] ~15-~15: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...
(AI_EN_LECTOR_MISSING_PUNCTUATION)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (5)
docs/docs.json (1)
314-316
: Add new Python document-to-markdown guide to navigation
The entry"guides/python/python-doc-to-markdown"
was correctly inserted under the “Python guides” group. Please verify that the.mdx
file atdocs/guides/python/python-doc-to-markdown.mdx
exists and matches this path.docs/guides/introduction.mdx (1)
31-33
: Include new guide in the introduction table
The new table row linking to/guides/python/python-doc-to-markdown
accurately reflects the added guide. Table alignment and link text look consistent with surrounding entries.docs/guides/python/python-doc-to-markdown.mdx (3)
7-7
: Verify snippet import path
Make sure the/snippets/python-learn-more.mdx
component is present and renders correctly. Otherwise, the<PythonLearnMore />
include at the end of the file will fail.
231-234
: Handle script exit codes consistently
Whenmarkdown-converter.py
exits with a non-zero status on missing args, it prints JSON then callssys.exit(1)
. Verify thatpython.runScript
doesn’t throw on non-zero codes (which would bypass your stdout/stderr logic), or adjust the Python script to return a 0 exit code with an error object.
209-211
: Confirm MarkItDown return API
The code usesresult = md.convert(file_path)
and returnsresult.text_content
. Please verify that the MarkItDown API exposes atext_content
property. If the returned object uses a different field (e.g.,markdown
), update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
docs/guides/python/python-doc-to-markdown.mdx (5)
9-13
: Consistent casing for "Markdown"
Consider capitalizing "Markdown" (e.g., "Convert documents to Markdown") to match the conventional naming of the format across the title and content.
13-17
: Add missing punctuation for consistency
The first bullet under Prerequisites lacks a trailing period, while the second bullet includes one. For uniformity, either end all bullets with a period or omit them consistently.🧰 Tools
🪛 LanguageTool
[uncategorized] ~15-~15: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...(AI_EN_LECTOR_MISSING_PUNCTUATION)
87-92
: Improve temp file naming randomness
UsingDate.now()
andMath.random()
reduces collisions but consider using a UUID (e.g.,crypto.randomUUID()
) for stronger uniqueness guarantees.
113-114
: Handle cleanup errors or use promises
Thefs.unlink
callback is empty, silently ignoring errors. Consider usingfs.promises.unlink
or handling the callback’s error parameter for improved reliability.
116-125
: Robust parsing of Python output
JSON.parse(pythonResult.stdout)
assumes valid JSON. Wrap parsing in atry/catch
, and handlepythonResult.stderr
to surface conversion errors instead of falling back to the generic "No output" case.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/guides/python/python-doc-to-markdown.mdx
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/python/python-doc-to-markdown.mdx
[uncategorized] ~15-~15: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...
(AI_EN_LECTOR_MISSING_PUNCTUATION)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (16)
docs/guides/python/python-doc-to-markdown.mdx (16)
1-5
: Frontmatter metadata is well-defined
Thetitle
,sidebarTitle
, anddescription
fields are clear and descriptive.
7-7
: Confirm snippet import path
Ensure that the/snippets/python-learn-more.mdx
file exists and correctly exports thePythonLearnMore
component for use in this guide.
18-23
: Features section is clear and concise
The bullet points accurately summarize the guide's core features and are easy to scan.
26-33
: Verify the example repository link
Please confirm that the GitHub URL (https://github.com/triggerdotdev/examples/tree/main/python-doc-to-markdown-converter
) points to the intended example project and is kept up to date.
41-62
: Validate code block metadata
The fence syntax (```ts trigger.config.ts
) may not be recognized by all MDX parsers. Confirm that the code block metadata renders the filename correctly in the documentation site.
64-67
: Info component usage is appropriate
The<Info>
callout directs users to additional docs and is well-placed after the build snippet.
73-81
: Imports are relevant and utilized
All imported modules (fs
,path
,os
,https
,http
) are used in the task implementation.
108-112
: Confirm script path resolution
Ensure thatpython.runScript("./src/python/markdown-converter.py", …)
resolves correctly at runtime within the build environment.
127-132
: Fallback return is clear
The default return structure for missingstdout
is straightforward and informative for debugging.
141-143
: Python dependency specification is correct
Listingmarkitdown[all]
inrequirements.txt
aligns with the guide’s instructions for installing all optional extras.
155-168
: Conversion function is well-structured
Theconvert_to_markdown
function correctly checks for file existence, initializes the library, and raises exceptions on failure.
171-183
: Trigger task processing logic is solid
process_trigger_task
cleanly differentiates between success and error states and returns a consistent JSON structure.
185-203
: Main execution block handles arguments
The__main__
section validates input, prints structured JSON on both success and failure, and sets the exit code appropriately.
208-215
: Testing instructions are comprehensive
The step-by-step guide covers environment setup, dependency installation, configuration, local testing, and deployment.
218-229
: Verify nested list rendering
The nested list under “Convert various file formats to Markdown” uses two-space indentation. Confirm that this renders correctly on the site as a sublist.
230-230
: Final snippet call is correct
The<PythonLearnMore />
component matches the import and is placed appropriately at the end of the guide.
await new Promise<void>((resolve, reject) => { | ||
const protocol = url.startsWith("https") ? https : http; | ||
const file = fs.createWriteStream(tempFilePath); | ||
|
||
protocol.get(url, (response) => { | ||
response.pipe(file); | ||
file.on("finish", () => { | ||
file.close(); | ||
resolve(); | ||
}); | ||
}); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@D-K-P I don't think it's needed to save the file like this. It can be useful to do streams when dealing with large files, but for a doc I think it'd be better to download it using fetch
and then save the response to the file. http
is a super old school Node.js thing that I wouldn't recommend anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point have just updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the way the file is downloaded to just use fetch
. Other than that this is really cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/guides/python/python-doc-to-markdown.mdx (1)
107-109
: Optional: Improve temp-file cleanup
Usingfs.unlink
with an empty callback may hide cleanup errors. Consider using the promise-based API for better consistency:- fs.unlink(tempFilePath, () => {}); + try { + await fs.promises.unlink(tempFilePath); + } catch { + // ignore any errors during cleanup + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/guides/python/python-crawl4ai.mdx
(1 hunks)docs/guides/python/python-doc-to-markdown.mdx
(1 hunks)docs/guides/python/python-image-processing.mdx
(1 hunks)docs/guides/python/python-pdf-form-extractor.mdx
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- docs/guides/python/python-image-processing.mdx
- docs/guides/python/python-pdf-form-extractor.mdx
- docs/guides/python/python-crawl4ai.mdx
🧰 Additional context used
🪛 LanguageTool
docs/guides/python/python-doc-to-markdown.mdx
[uncategorized] ~20-~20: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...
(AI_EN_LECTOR_MISSING_PUNCTUATION)
🪛 GitHub Actions: 📚 Docs Checks
docs/guides/python/python-doc-to-markdown.mdx
[error] 1-1: Broken link found: /docs/upgrade-to-v4. Run link checker to fix broken links.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (14)
docs/guides/python/python-doc-to-markdown.mdx (14)
1-5
: Frontmatter metadata looks good
The title, sidebarTitle, and description are clear and follow existing conventions.🧰 Tools
🪛 GitHub Actions: 📚 Docs Checks
[error] 1-1: Broken link found: /docs/upgrade-to-v4. Run link checker to fix broken links.
7-7
: Verify snippet import path
Please confirm that the file/snippets/python-learn-more.mdx
exists at this location and resolves correctly in the built site.
14-17
: Overview section is clear
The overview succinctly explains the purpose and value of using MarkItDown with Trigger.dev.
18-23
: Verify internal quick-start link
Ensure that the/quick-start
link correctly points to your project initialization guide. If your quick-start page lives under/docs/quick-start
, please adjust accordingly.🧰 Tools
🪛 LanguageTool
[uncategorized] ~20-~20: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...(AI_EN_LECTOR_MISSING_PUNCTUATION)
24-28
: Features list is well-structured
The bullet list clearly outlines what this guide provides.
29-38
: GitHub repo Card is correct
TheCard
component usage and link to the examples repository look accurate.
46-67
: Build configuration snippet is accurate
The code block fortrigger.config.ts
includes the right imports and options for the Python extension.
69-72
: Info callout is helpful
The information about the Python build extension link is clear and correctly formatted.
78-129
: Task implementation snippet is clear
The TypeScript example forconvertToMarkdown
follows our SDK patterns and provides a full end-to-end demonstration.
135-137
: Requirements snippet is correct
Therequirements.txt
code block and dependency declaration are straightforward.
143-198
: Python script example is solid
The script covers error handling, JSON I/O, and demonstrates MarkItDown usage clearly.
200-209
: Testing instructions are complete
The step-by-step guide to creating a venv and running the task locally is clear and actionable.
212-223
: Conversion capabilities list is comprehensive
The bulleted capabilities cover all supported formats and features.
224-224
: Embedded learn-more snippet
The<PythonLearnMore />
component at the end reinforces further reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
docs/guides/python/python-doc-to-markdown.mdx (1)
97-101
: 🛠️ Refactor suggestionAdd HTTP status check when downloading the document.
The code writes whatever comes back fromfetch
directly to disk, even on 404/500 responses. To catch HTTP errors early, you can do:- const response = await fetch(url); + const response = await fetch(url); + if (!response.ok) { + throw new Error(`Failed to download document: ${response.status} ${response.statusText}`); + }This ensures you don’t pass an HTML error page into your converter.
🧹 Nitpick comments (2)
docs/guides/python/python-doc-to-markdown.mdx (2)
20-21
: Consistency: add punctuation to bullet list.
For uniformity, consider adding a trailing period to the first prerequisite (- A project with [Trigger.dev initialized](/quick-start).
) so it matches the style of the subsequent bullet.🧰 Tools
🪛 LanguageTool
[uncategorized] ~20-~20: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...(AI_EN_LECTOR_MISSING_PUNCTUATION)
103-109
: Improve error handling for the Python script and file cleanup.
Currently, failures frompython.runScript
aren’t surfaced, andfs.unlink
swallows errors. Consider wrapping the call intry/catch/finally
, checkingpythonResult.exitCode
orstderr
, and usingfs.promises.unlink
:- const pythonResult = await python.runScript("./src/python/markdown-converter.py", [ - JSON.stringify({ file_path: tempFilePath }), - ]); - fs.unlink(tempFilePath, () => {}); + let pythonResult; + try { + pythonResult = await python.runScript( + "./src/python/markdown-converter.py", + [JSON.stringify({ file_path: tempFilePath })] + ); + if (pythonResult.exitCode !== 0) { + throw new Error( + `Python script failed (${pythonResult.exitCode}): ${pythonResult.stderr}` + ); + } + } finally { + try { + await fs.promises.unlink(tempFilePath); + } catch (cleanupErr) { + console.warn("Cleanup error:", cleanupErr); + } + }This pattern ensures robust error propagation and reliable cleanup.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/docs.json
(1 hunks)docs/guides/python/python-doc-to-markdown.mdx
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/docs.json
🧰 Additional context used
🪛 LanguageTool
docs/guides/python/python-doc-to-markdown.mdx
[uncategorized] ~20-~20: A punctuation mark might be missing here.
Context: ... A project with Trigger.dev initialized - Python install...
(AI_EN_LECTOR_MISSING_PUNCTUATION)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (2)
docs/guides/python/python-doc-to-markdown.mdx (2)
7-8
: Verify snippet import path for PythonLearnMore.
The MDX import uses an absolute path ("/snippets/python-learn-more.mdx"
). Please confirm that this file exists at that location or change to a relative import (e.g.,../snippets/python-learn-more.mdx
) to avoid broken imports at build time.
143-198
: Python script implementation is clear and robust.
Themarkdown-converter.py
handles missing files, wraps exceptions in JSON, and matches the TypeScript consumer’s expectations. Well structured, commented, and easy to follow.
<Note> | ||
This project uses Trigger.dev v4 (which is currently in beta as of 28 April 2025). If you want to | ||
run this project you will need to [upgrade to v4](/upgrade-to-v4). | ||
</Note> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken “upgrade to v4” link in Note.
The callout points at /upgrade-to-v4
, but our v4 upgrade guide lives under /docs/guides/upgrade-to-v4
(or similar). Update the link to the correct path to prevent a 404 in the docs.
```ts trigger.config.ts | ||
import { pythonExtension } from "@trigger.dev/python/extension"; | ||
import { defineConfig } from "@trigger.dev/sdk/v3"; | ||
|
||
export default defineConfig({ | ||
runtime: "node", | ||
project: "<your-project-ref>", | ||
// Your other config settings... | ||
build: { | ||
extensions: [ | ||
pythonExtension({ | ||
// The path to your requirements.txt file | ||
requirementsFile: "./requirements.txt", | ||
// The path to your Python binary | ||
devPythonBinaryPath: `venv/bin/python`, | ||
// The paths to your Python scripts to run | ||
scripts: ["src/python/**/*.py"], | ||
}), | ||
], | ||
}, | ||
}); | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent SDK version vs. Note v4 reference.
The guide introduction mentions Trigger.dev v4 (beta), but this snippet imports from @trigger.dev/sdk/v3
. Please either:
- Update the snippet to use the v4 SDK package (e.g.,
@trigger.dev/sdk/v4
), - Or adjust the Note to indicate that this example targets v3.
Aligning versions avoids confusion.
Summary by CodeRabbit
New Features
Documentation