-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change youtube/index.mjs to not confuse chatgpt or run out of tokens #205
Conversation
function replaceHtmlEntities(htmlString) { // This function was written by ChatGPT and modified by me (iamsirsammy) | ||
const doc = new DOMParser().parseFromString(htmlString.replace("&", "&"), 'text/html'); | ||
return doc.documentElement.innerText; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string here has been processed and does not contain HTML tag content. I don't think there is any difference here compared to directly using htmlString.replaceAll("&", "&")
. Have I missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the tone I've been talking to customer support all day
This code is designed to remove all HTML Entities from code, that's what the DOMParser is for. For example, generally, ' is the html entity for an apostrophe ('). However, your code or YouTube subtitles or whatever has ' for an apostrophe (at least for some videos), meaning the & has to be replaced with &, then domparser has to be used to deal with the actual apostrophe. This should also apply for <> maybe ,/ and probably other characters, why domparser is used instead of many find and replaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We could also probably remove amp replacement and just run replace HtmlEntities twice but that feels silly)
@@ -44,9 +49,11 @@ export default { | |||
subtitleContent += subtitleData.substring(0, subtitleData.indexOf('<')) + ',' | |||
} | |||
|
|||
subtitleContent = replaceHtmlEntities(subtitleContent.replace(",", " ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to replace commas here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi subtitles are usually, given to the AI like, this, with a random comma every few words. This is fine I guess but I think it uses more tokens and it might confuse the AI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for probably misunderstanding, I moved the comma replacement to just adding spaces instead of commas on line 49
I understand now, thank you for explaining and for all the contributions you have made to all users |
I made some slight adjustments 34cad26
The following is a comparison of the final text content before and after the PR (including modifications I made on your basis). I think this result should meet our expectations |
Thanks for catching replace instead of replaceAll. I'm not quite sure how I missed that. I'm not sure why the commas are necessary, but it's fine. |
Adding commas is because YouTube's captions don't necessarily include commas. For some particularly long videos, if all the captions are concatenated without any separation, it will not be conducive to cutting to control the token length. Therefore, I proactively added commas for each segment and then cut them based on the segments to ensure that each remaining segment is a complete sentence, which can achieve a better summarizing effect. |
…205 @iamsirsammy) * change youtube/index.mjs to not confuse chatgpt or run out of tokens * Update index.mjs
This PR fixes formatting for subtitles to not confuse chatgpt, and slightly improves the prompt. This should make the result better. I've tested the added lines in console, but I haven't fully tested the extension like this yet as I don't have a good build environment for extensions yet. It should work fine.
Not related, but I've made a prototype for adding Replit support to this extension and it should be finished soon.