Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change youtube/index.mjs to not confuse chatgpt or run out of tokens #205

Merged
merged 2 commits into from
Apr 14, 2023

Conversation

iamsirsammy
Copy link
Contributor

This PR fixes formatting for subtitles to not confuse chatgpt, and slightly improves the prompt. This should make the result better. I've tested the added lines in console, but I haven't fully tested the extension like this yet as I don't have a good build environment for extensions yet. It should work fine.

Not related, but I've made a prototype for adding Replit support to this extension and it should be finished soon.

Comment on lines +4 to +8
function replaceHtmlEntities(htmlString) { // This function was written by ChatGPT and modified by me (iamsirsammy)
const doc = new DOMParser().parseFromString(htmlString.replace("&", "&"), 'text/html');
return doc.documentElement.innerText;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string here has been processed and does not contain HTML tag content. I don't think there is any difference here compared to directly using htmlString.replaceAll("&", "&"). Have I missed something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the tone I've been talking to customer support all day
This code is designed to remove all HTML Entities from code, that's what the DOMParser is for. For example, generally, ' is the html entity for an apostrophe ('). However, your code or YouTube subtitles or whatever has &#39; for an apostrophe (at least for some videos), meaning the & has to be replaced with &, then domparser has to be used to deal with the actual apostrophe. This should also apply for <> maybe ,/ and probably other characters, why domparser is used instead of many find and replaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We could also probably remove amp replacement and just run replace HtmlEntities twice but that feels silly)

@@ -44,9 +49,11 @@ export default {
subtitleContent += subtitleData.substring(0, subtitleData.indexOf('<')) + ','
}

subtitleContent = replaceHtmlEntities(subtitleContent.replace(",", " "))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to replace commas here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi subtitles are usually, given to the AI like, this, with a random comma every few words. This is fine I guess but I think it uses more tokens and it might confuse the AI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for probably misunderstanding, I moved the comma replacement to just adding spaces instead of commas on line 49

@josStorer
Copy link
Owner

I understand now, thank you for explaining and for all the contributions you have made to all users

@josStorer josStorer merged commit a04bc46 into josStorer:master Apr 14, 2023
josStorer added a commit that referenced this pull request Apr 14, 2023
josStorer added a commit that referenced this pull request Apr 14, 2023
@josStorer
Copy link
Owner

I made some slight adjustments 34cad26

  • replacing replace with replaceAll

  • restoring the comma as the end symbol when concatenating subtitleContent, so that the cropText function can correctly control the number of tokens according to a certain ratio to avoid exceeding the limit

The following is a comparison of the final text content before and after the PR (including modifications I made on your basis). I think this result should meet our expectations

image

@iamsirsammy
Copy link
Contributor Author

Thanks for catching replace instead of replaceAll. I'm not quite sure how I missed that. I'm not sure why the commas are necessary, but it's fine.

@josStorer
Copy link
Owner

Adding commas is because YouTube's captions don't necessarily include commas. For some particularly long videos, if all the captions are concatenated without any separation, it will not be conducive to cutting to control the token length. Therefore, I proactively added commas for each segment and then cut them based on the segments to ensure that each remaining segment is a complete sentence, which can achieve a better summarizing effect.

josStorer pushed a commit that referenced this pull request Apr 27, 2023
…205 @iamsirsammy)

* change youtube/index.mjs to not confuse chatgpt or run out of tokens

* Update index.mjs
josStorer added a commit that referenced this pull request Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants