change youtube/index.mjs to not confuse chatgpt or run out of tokens #205

iamsirsammy · 2023-04-12T22:39:54Z

This PR fixes formatting for subtitles to not confuse chatgpt, and slightly improves the prompt. This should make the result better. I've tested the added lines in console, but I haven't fully tested the extension like this yet as I don't have a good build environment for extensions yet. It should work fine.

Not related, but I've made a prototype for adding Replit support to this extension and it should be finished soon.

josStorer · 2023-04-13T12:14:55Z

src/content-script/site-adapters/youtube/index.mjs

+function replaceHtmlEntities(htmlString) { // This function was written by ChatGPT and modified by me (iamsirsammy)
+  const doc = new DOMParser().parseFromString(htmlString.replace("&amp;", "&"), 'text/html');
+  return doc.documentElement.innerText;
+}
+


The string here has been processed and does not contain HTML tag content. I don't think there is any difference here compared to directly using htmlString.replaceAll("&", "&"). Have I missed something?

Sorry for the tone I've been talking to customer support all day
This code is designed to remove all HTML Entities from code, that's what the DOMParser is for. For example, generally, ' is the html entity for an apostrophe ('). However, your code or YouTube subtitles or whatever has ' for an apostrophe (at least for some videos), meaning the & has to be replaced with &, then domparser has to be used to deal with the actual apostrophe. This should also apply for <> maybe ,/ and probably other characters, why domparser is used instead of many find and replaces.

(We could also probably remove amp replacement and just run replace HtmlEntities twice but that feels silly)

josStorer · 2023-04-13T12:20:02Z

src/content-script/site-adapters/youtube/index.mjs

@@ -44,9 +49,11 @@ export default {
        subtitleContent += subtitleData.substring(0, subtitleData.indexOf('<')) + ','
      }

+      subtitleContent = replaceHtmlEntities(subtitleContent.replace(",", " "))


Why do we have to replace commas here?

Hi subtitles are usually, given to the AI like, this, with a random comma every few words. This is fine I guess but I think it uses more tokens and it might confuse the AI

Sorry for probably misunderstanding, I moved the comma replacement to just adding spaces instead of commas on line 49

josStorer · 2023-04-14T11:27:23Z

I understand now, thank you for explaining and for all the contributions you have made to all users

josStorer · 2023-04-14T11:47:09Z

I made some slight adjustments 34cad26

replacing replace with replaceAll
restoring the comma as the end symbol when concatenating subtitleContent, so that the cropText function can correctly control the number of tokens according to a certain ratio to avoid exceeding the limit

The following is a comparison of the final text content before and after the PR (including modifications I made on your basis). I think this result should meet our expectations

iamsirsammy · 2023-04-23T21:40:37Z

Thanks for catching replace instead of replaceAll. I'm not quite sure how I missed that. I'm not sure why the commas are necessary, but it's fine.

josStorer · 2023-04-24T11:59:22Z

Adding commas is because YouTube's captions don't necessarily include commas. For some particularly long videos, if all the captions are concatenated without any separation, it will not be conducive to cutting to control the token length. Therefore, I proactively added commas for each segment and then cut them based on the segments to ensure that each remaining segment is a complete sentence, which can achieve a better summarizing effect.

@iamsirsammy

…205 @iamsirsammy) * change youtube/index.mjs to not confuse chatgpt or run out of tokens * Update index.mjs

change youtube/index.mjs to not confuse chatgpt or run out of tokens

efffac5

josStorer reviewed Apr 13, 2023

View reviewed changes

Update index.mjs

342e5dd

josStorer merged commit a04bc46 into josStorer:master Apr 14, 2023

josStorer added a commit that referenced this pull request Apr 14, 2023

patch for #205

edef4e2

josStorer added a commit that referenced this pull request Apr 14, 2023

patch for #205

34cad26

josStorer pushed a commit that referenced this pull request Apr 27, 2023

change youtube/index.mjs to not confuse chatgpt or run out of tokens (#…

b2cd54e

…205 @iamsirsammy) * change youtube/index.mjs to not confuse chatgpt or run out of tokens * Update index.mjs

josStorer added a commit that referenced this pull request Apr 27, 2023

patch for #205

793380a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change youtube/index.mjs to not confuse chatgpt or run out of tokens #205

change youtube/index.mjs to not confuse chatgpt or run out of tokens #205

iamsirsammy commented Apr 12, 2023

josStorer Apr 13, 2023

iamsirsammy Apr 14, 2023

iamsirsammy Apr 14, 2023

josStorer Apr 13, 2023

iamsirsammy Apr 14, 2023

iamsirsammy Apr 14, 2023

josStorer commented Apr 14, 2023

josStorer commented Apr 14, 2023

iamsirsammy commented Apr 23, 2023

josStorer commented Apr 24, 2023

change youtube/index.mjs to not confuse chatgpt or run out of tokens #205

change youtube/index.mjs to not confuse chatgpt or run out of tokens #205

Conversation

iamsirsammy commented Apr 12, 2023

josStorer Apr 13, 2023

Choose a reason for hiding this comment

iamsirsammy Apr 14, 2023

Choose a reason for hiding this comment

iamsirsammy Apr 14, 2023

Choose a reason for hiding this comment

josStorer Apr 13, 2023

Choose a reason for hiding this comment

iamsirsammy Apr 14, 2023

Choose a reason for hiding this comment

iamsirsammy Apr 14, 2023

Choose a reason for hiding this comment

josStorer commented Apr 14, 2023

josStorer commented Apr 14, 2023

iamsirsammy commented Apr 23, 2023

josStorer commented Apr 24, 2023