Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's not parsing LaTex syntax correctly, even with plugins #785

Closed
4 tasks done
lyzy0906 opened this issue Oct 20, 2023 · 23 comments
Closed
4 tasks done

It's not parsing LaTex syntax correctly, even with plugins #785

lyzy0906 opened this issue Oct 20, 2023 · 23 comments
Labels
🙋 no/question This does not need any changes 👎 phase/no Post cannot or will not be acted on

Comments

@lyzy0906
Copy link

lyzy0906 commented Oct 20, 2023

Initial checklist

Affected packages and versions

9.0.0

Link to runnable example

No response

Steps to reproduce

#784

Actually, I'm already using these plugins: remarkGfm, remarkMath, rehypeKatex!
But I found the plugins CANNOT identify LaTex syntax with this format: \[ ... \]
It only recognize LaTex with $$ ... $$ format.

So I am trying to parse the \[ ... \] syntax myself, and found the text will be modified by the component, which is the slashes are gone after going through the component!
Please look into my screenshot. The slashes in the middle are kept. Only the start and end ones are gone. Why is this happening?

image

\[ \int_{0}^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2} \]

Expected behavior

The slashes should be kept.

Actual behavior

The slashes are gone.

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Oct 20, 2023
@wooorm
Copy link
Member

wooorm commented Oct 20, 2023

You don’t have to open new issues. Closed ones can still be commented on.
Also, questions go to discussions, these aren’t issues. See the support docs.

That’s not how math works with remark-math. See the examples in the docs. The one I mentioned earlier and https://github.com/remarkjs/remark-math#examples.

@wooorm wooorm closed this as completed Oct 20, 2023
@wooorm wooorm added the 🙋 no/question This does not need any changes label Oct 20, 2023
@github-actions

This comment has been minimized.

@github-actions github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Oct 20, 2023
@lyzy0906
Copy link
Author

lyzy0906 commented Oct 20, 2023

You don’t have to open new issues. Closed ones can still be commented on. Also, questions go to discussions, these aren’t issues. See the support docs.

That’s not how math works with remark-math. See the examples in the docs. The one I mentioned earlier and https://github.com/remarkjs/remark-math#examples.

Sorry, I thought I cannot comment after closed.
And I think this is a bug, not a question………………

So let's not talk about formula, any regular string, starts with "\[", will be parsed into "[" only. Here my input is: '\[ something ]', and the output is:
image

What I mean is, the first slash is omitted by the Markdown component.

@wooorm
Copy link
Member

wooorm commented Oct 20, 2023

You probably need to provide more info. Please read the support guide. Don’t post a screenshot. Post code. Post what versions you are using.

You use \[. That is not supported. See the examples. Use dollars. Read the syntax section.

@lyzy0906
Copy link
Author

lyzy0906 commented Oct 20, 2023

You probably need to provide more info. Please read the support guide. Don’t post a screenshot. Post code. Post what versions you are using.

You use \[. That is not supported. See the examples. Use dollars. Read the syntax section.

OK. I am using v9.0.0. My code is something like:
<ReactMarkdown remarkPlugins={[remarkMath, texPlugin]} rehypePlugins={[rehypeKatex]}> {'\\[ something ]'} </ReactMarkdown>

And I expect it to show \[ something ], but it is showing [ something ].

@lyzy0906
Copy link
Author

lyzy0906 commented Oct 20, 2023

BTW, I cannot use dollars. Because the LaTex input is generated by ChatGPT automatically, which is in \[ ... \] format. So I am trying to parse it myself. And during this process, I found the slashes are gone in the remark tree:

function customPlugin() { return (tree) => { visit(tree, (node, index) => { console.log(node, index); if ( node.type === 'paragraph' && node.children && node.children.length === 1 && node.children[0].type === 'text' && node.children[0].value.startsWith('\\[') ) { const data = node.data || (node.data = {}); data.hName = 'tex'; data.hProperties = { value: node.children[0].value, }; } }); }; }

The condition node.children[0].value.startsWith('\\[') is not working.

@wooorm
Copy link
Member

wooorm commented Oct 20, 2023

Then ask ChatGPT to solve it.

Dollars are what is used here, not \\[ and such.

Escapes working as escapes is how markdown works. If you put &copy; in HTML, you see ©, not those literal characters. It’s the same here.
And it’s the same as you do with JS: \\ turns into \.

@lyzy0906
Copy link
Author

lyzy0906 commented Oct 20, 2023

Then ask ChatGPT to solve it.

Dollars are what is used here, not \\[ and such.

Escapes working as escapes is how markdown works. If you put &copy; in HTML, you see ©, not those literal characters. It’s the same here. And it’s the same as you do with JS: \\ turns into \.

?
But the component only omit the first slash. Let's say my input is:
<ReactMarkdown remarkPlugins={[remarkMath, texPlugin]} rehypePlugins={[rehypeKatex]}> {'\\[ \\something ]'} </ReactMarkdown>

Now the output is [ \something ]. The middle slash is there. Only the first slash is gone……

@wooorm
Copy link
Member

wooorm commented Oct 20, 2023

right, that’s how markdown works. Escapes work on punctuation. [ is punctuation, so \[ turns into [. s is not punctuation so \s remains as \s. See https://spec.commonmark.org/0.30/#backslash-escapes

@lyzy0906
Copy link
Author

right, that’s how markdown works. Escapes work on punctuation. [ is punctuation, so \[ turns into [. s is not punctuation so \s remains as \s. See https://spec.commonmark.org/0.30/#backslash-escapes

I see. Thank you for your detailed explaination. I was not aware that [ is punctuation.

@wooorm
Copy link
Member

wooorm commented Oct 20, 2023

No problem! Good luck! :)

@mandeep511
Copy link

No problem! Good luck! :)

So how do you recommend we solve it? In ChatGPT's web-ui they are somehow parsing [ \ ] properly. So there must be a simple way to take care of all corner cases when compiling latex and react

@wooorm
Copy link
Member

wooorm commented Nov 9, 2023

Regexp? Build your own plugins?

They parsing things does not equal that it is simple.

@prashantbhudwal
Copy link

For processing chatgpt or openai latex.

export const preprocessLaTeX = (content: string) => {
  // Replace block-level LaTeX delimiters \[ \] with $$ $$

  
  const blockProcessedContent = content.replace(
    /\\\[(.*?)\\\]/gs,
    (_, equation) => `$$${equation}$$`,
  );
  // Replace inline LaTeX delimiters \( \) with $ $
  const inlineProcessedContent = blockProcessedContent.replace(
    /\\\((.*?)\\\)/gs,
    (_, equation) => `$${equation}$`,
  );
  return inlineProcessedContent;
};

@shubh675
Copy link

@prashantbhudwal it's good idea

@pavloko
Copy link

pavloko commented Jun 5, 2024

@mandeep511 @shubh675 @prashantbhudwal I'm investigating the same issue and wondering how chatgpt.com is doing it because they seem to be using react-markdown.... and the supplied children have \[ delimiter...

I'm sorry this discussion is obviously not related to react-markdown, but I guess many people will come here looking for this answer.

Screenshot 2024-06-05 at 18 04 29

@prashantbhudwal
Copy link

@pavloko this is just additional preprocessing to handle latex edge cases. Everything else is done with react-markdown with plugins.

You can give examples of how to format latex and most of the times katex plugin will render it. If it doesn't, this might help.

@pavloko
Copy link

pavloko commented Jun 5, 2024

@prashantbhudwal the formula you supplied has worked - big thank you!

I'm just wondering how it works on their website since from the screenshot, the supplied markdown still contains \[ as delimiters.

@prashantbhudwal
Copy link

@pavloko They are using this property for the math plugin - [remarkMath, { singleDollarTextMath: false }],

That could change stuff. Please feel free to try.

Screenshot 2024-06-05 at 9 21 33 PM

@danny-avila
Copy link

Just want to add to the discussion.
[remarkMath, { singleDollarTextMath: false }],

It makes sense for them to do this because enabling it (default behavior, I believe) will cause issues when users don't expect LaTeX at all, i.e.:

I have $50 in my wallet and $100 in the bank.

Their pre-processing may involve detecting LaTeX more robustly for single-line, perhaps converting them to multi-line formatting, I'm not sure, since the setting drops the native in-line rendering, or maybe the settings is triggered on by specific LaTeX identifiers, who knows.

Before @prashantbhudwal suggested his function, I'd implemented my own pre-processing function. Upon revisiting this problem, I was inspired to tackle some edge cases users of LibreChat have experienced with the LaTeX rendering taking some lessons learned from this thread, so I'm here to share back, since the solution and others, including my previous implementation, exhibit those edge cases.

We can set singleDollarTextMath to true but we need to escape several uses of $ as suggested, otherwise, users will see weird rendering for the simple "wallet" statement above, to name one of the edge cases.

I was inspired by the implementation here: lobehub/lobe-ui#168 but it was not complete.

Here's what I came up with after several hours of trial and error:

/**
 * Preprocesses LaTeX content by replacing delimiters and escaping certain characters.
 *
 * @param content The input string containing LaTeX expressions.
 * @returns The processed string with replaced delimiters and escaped characters.
 */
export function preprocessLaTeX(content: string): string {
  // Step 1: Protect code blocks
  const codeBlocks: string[] = [];
  content = content.replace(/(```[\s\S]*?```|`[^`\n]+`)/g, (match, code) => {
    codeBlocks.push(code);
    return `<<CODE_BLOCK_${codeBlocks.length - 1}>>`;
  });

  // Step 2: Protect existing LaTeX expressions
  const latexExpressions: string[] = [];
  content = content.replace(/(\$\$[\s\S]*?\$\$|\\\[[\s\S]*?\\\]|\\\(.*?\\\))/g, (match) => {
    latexExpressions.push(match);
    return `<<LATEX_${latexExpressions.length - 1}>>`;
  });

  // Step 3: Escape dollar signs that are likely currency indicators
  content = content.replace(/\$(?=\d)/g, '\\$');

  // Step 4: Restore LaTeX expressions
  content = content.replace(/<<LATEX_(\d+)>>/g, (_, index) => latexExpressions[parseInt(index)]);

  // Step 5: Restore code blocks
  content = content.replace(/<<CODE_BLOCK_(\d+)>>/g, (_, index) => codeBlocks[parseInt(index)]);

  // Step 6: Apply additional escaping functions
  content = escapeBrackets(content);
  content = escapeMhchem(content);

  return content;
}

I also wrote some tests for this:

  preprocessLaTeX
    ✓ returns the same string if no LaTeX patterns are found (1 ms)
    ✓ escapes dollar signs followed by digits
    ✓ does not escape dollar signs not followed by digits
    ✓ preserves existing LaTeX expressions (1 ms)
    ✓ handles mixed LaTeX and currency
    ✓ converts LaTeX delimiters
    ✓ escapes mhchem commands
    ✓ handles complex mixed content
    ✓ handles empty string
    ✓ preserves code blocks
    ✓ handles multiple currency values in a sentence
    ✓ preserves LaTeX expressions with numbers
    ✓ handles currency values with commas
    ✓ preserves LaTeX expressions with special characters

I'm sure it's not perfect, nor the most optimal, but it handles both those expecting LaTeX to render correctly with most AI model providers (OpenAI, Anthropic, Llama3.1, whose formatting of LaTeX largely depend on their training)

I'm happy to share and collect feedback, and be corrected on this approach, maybe the way OpenAI does with ChatGPT is the better route, but I chose this one as it was more apparent to me.

Example rendering:
image

@pavloko
Copy link

pavloko commented Aug 23, 2024

@danny-avila This is great. Thank you very much! I'll try to use the shared solution and get back to you.

@supuwoerc
Copy link

@danny-avila Your drilling spirit helps everyone, thank you very much, I researched for two hours and you saved me

@joseph-mccombs
Copy link

@danny-avila this algo is amazing. i was having similar issues with some of the other algorithms, but yours took care of every single issue. 🐐

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋 no/question This does not need any changes 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

9 participants