Merge #158: Sanitize torrent description in markdown

77cbdad refactor: extract functions for sanitizing torrent description (Jose Celano) 7d65d3b feat: [#105] allow only some html tags in sanitized markdown torrent description (Jose Celano) 9143736 feat: sanitize torrent description in markdown (Jose Celano) 34b9fb6 feat: add dependency: dompurify (Jose Celano) Pull request description: - Purify HTML to avoid potential XSS attacks. - Remove external URLS to protect users' privacy. ### Details The sanitize function: - Removes harmful code like the one in the example. - Remove internal links. It removes the "href" attribute. - Replace image sources with the direct base64 image data from the backend image proxy, only for valid image formats. Invalid format images are removed from the DOM. - Valid images are images whose source URL ends with a valid extension: `["png", "PNG", "jpg", "JPG", "jpeg", "JPEG"]`. The backend is supposed to support only PNG and JPG images. - Other HTML tags like `<embed>` are removed. The sanitize function is being tested with this sample torrent description: ``` Harmful script (the button show not show the alert): <p>Click the button to display an alert box.</p> <button onclick="alert('Hello! I am an alert box!')">Try it</button> Valid PNG image in markdown format (it should show the image using the proxy) ![Torrust Logo](https://raw.githubusercontent.com/torrust/torrust-index-backend/develop/docs/media/torrust_logo.png) Another valid (JPG) image in markdown format (it should show the image using the proxy) ![Mandelbrot Set](https://upload.wikimedia.org/wikipedia/commons/2/21/Mandel_zoom_00_mandelbrot_set.jpg) Valid image (PNG) in html format (it should show the image using the proxy): <img src="https://raw.githubusercontent.com/torrust/torrust-index-backend/develop/docs/media/torrust_logo.png"> Invalid image (TIF) in html format (it should remove the image): <img src="https://commons.wikimedia.org/wiki/Category:TIFF_files#/media/File:Arkansas_Constitution_of_1836,_page_1.tif"> Invalid html link (it should remove the href attribute value): <a href="https://commons.wikimedia.org/wiki/Category:TIFF_files#/media/File:Arkansas_Constitution_of_1836,_page_1.tif">Arkansas Constitution of 1836. Page 1</a> <embed type="video/webm" src="/media/cc0-videos/flower.mp4" width="250" height="200"> ``` ### Todo - [x] Basic sanitize function - [x] Remove other HTML tags that can contain external sources like `<video>` - [ ] Add tests: unit and E2E. This could be hard to test as base64 encoded images are big strings. Maybe with a very small image. - [x] Refactor: extract sanitize function from component to a service I think these are the HTML tags that can contain sources linking to external resources: - `<a>`: The href attribute can contain an external URL. - `<img>`: The src attribute can point to an external image URL. - `<script>`: The src attribute can point to an external JavaScript file. - `<link>`: Used for linking CSS files, favicons, etc., the href attribute can contain an external URL. - `<iframe>`: The src attribute can contain an external URL to embed content from another site. - `<object>` and `<embed>`: These tags are used to embed multimedia content like Flash or PDFs, and their data and src attributes respectively can contain external URLs. - `<audio>` and `<video>`: The src attribute can contain an external URL to a media file. - `<source>`: This tag, used inside <audio>, <video>, or <picture> elements, has a src or srcset attribute that can contain an external URL. - `<form>`: The action attribute can contain an external URL where the form data is sent when submitted. - `<meta http-equiv="refresh" content="0; url=http://example.com/" />`: This meta tag can be used to redirect to an external URL. The dompurifier only removes unsafe code, but we also want to [remove external links to avoid tracking the users](#67). Top commit has no ACKs. Tree-SHA512: 97bfdcf4fcf07d5940654872b8759308a12f701ac2023248878fa0bce96278f0560b6b3a38c1474da44aa885b9bb99dac4afa87c515e0d14eacd469715782ff3
torrust · Jul 5, 2023 · d3ae471 · d3ae471
2 parents 02268c1 + 77cbdad
commit d3ae471
Show file tree

Hide file tree

Showing 5 changed files with 126 additions and 75 deletions.
diff --git a/components/Markdown.vue b/components/Markdown.vue
@@ -6,6 +6,7 @@
 <script setup lang="ts">
 import { computed } from "vue";
 import { marked } from "marked";
+import { sanitize } from "~/src/domain/services/sanitizer";
 import { onMounted, ref, useRestApi, watch } from "#imports";
 
 const props = defineProps({
@@ -15,8 +16,6 @@ const props = defineProps({
   }
 });
 
-const rest = useRestApi().value;
-
 const sanitizedDescription = ref("");
 
 const options = {
@@ -34,83 +33,13 @@ onMounted(() => {
   sanitizeDescription();
 });
 
-function markdown (src: string) {
+function convert_markdown_to_html (src: string) {
   return marked(src, options);
 }
 
 async function sanitizeDescription () {
-  // Get the original not sanitized markdown string.
-  const description = markdown(props.source);
-
-  // Replace the img src's with a random id and return a map
-  // of these ids mapped to the original url.
-  const [filteredDescriptionWithImageIds, imageIdUrlMap] = filterDescriptionImagesWithRandomIds(description);
-
-  // Get the image data using the backend's image proxy.
-  const imageIdDataUrlMap = await getImageDataUrlsFromUrls(imageIdUrlMap);
-
-  // Replace the img id's with the proxied sources.
-  sanitizedDescription.value = replaceDescriptionImageIdsWithDataUrls(filteredDescriptionWithImageIds, imageIdDataUrlMap);
-}
-
-function filterDescriptionImagesWithRandomIds (description: string): [string, Map<string, string>] {
-  const filteredImageMap = new Map();
-
-  // Replace all image urls with a random id.
-  description = description.replace(/img src="(.*?)"/gi, (match, url): string => {
-    const imageId = randomId(32);
-
-    filteredImageMap.set(imageId, url);
-
-    return `img src="${imageId}"`;
-  });
-
-  return [description, filteredImageMap];
-}
-
-async function getImageDataUrlsFromUrls (imageMap: Map<string, string>): Promise<Map<string, string>> {
-  const imageDataMap: Map<string, string> = new Map();
-
-  for (const [id, url] of imageMap) {
-    const imageBlob = await rest.torrent.proxiedImage(url);
-    const imageDataUrl = await blobToDataURL(imageBlob);
-
-    imageDataMap.set(id, imageDataUrl);
-  }
-
-  return imageDataMap;
-}
-
-function blobToDataURL (blob: Blob): Promise<string> {
-  return new Promise<string>((resolve, reject) => {
-    const reader = new FileReader();
-    reader.onload = _e => resolve(reader.result as string);
-    reader.onerror = _e => reject(reader.error);
-    reader.onabort = _e => reject(new Error("Read aborted"));
-    reader.readAsDataURL(blob);
-  });
-}
-
-function replaceDescriptionImageIdsWithDataUrls (description: string, imageIdDataUrlMap: Map<string, string>): string {
-  imageIdDataUrlMap.forEach((dataUrl, id) => {
-    description = description.replace(id, dataUrl);
-  });
-
-  return description;
-}
-
-function randomId (length: number) {
-  let result = "";
-  const characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
-  const charactersLength = characters.length;
-  let counter = 0;
-
-  while (counter < length) {
-    result += characters.charAt(Math.floor(Math.random() * charactersLength));
-    counter += 1;
-  }
-
-  return result;
+  const html = convert_markdown_to_html(props.source);
+  sanitizedDescription.value = await sanitize(html);
 }
 </script>
 

diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -16,6 +16,7 @@
     "@nuxtjs/eslint-config-typescript": "^12.0.0",
     "@nuxtjs/tailwindcss": "^6.8.0",
     "@tailwindcss/typography": "^0.5.9",
+    "@types/dompurify": "^3.0.2",
     "@types/marked": "^5.0.0",
     "@types/node": "^20.3.2",
     "@typescript-eslint/eslint-plugin": "^5.60.0",
@@ -32,6 +33,7 @@
   "dependencies": {
     "@heroicons/vue": "^2.0.18",
     "daisyui": "^3.1.7",
+    "dompurify": "^3.0.4",
     "marked": "^5.1.0",
     "notiwind-ts": "^2.0.2",
     "torrust-index-api-lib": "^0.2.0",

diff --git a/project-words.txt b/project-words.txt
@@ -1,4 +1,5 @@
 composables
+dompurify
 heroicons
 notiwind
 Nuxt

diff --git a/src/domain/services/sanitizer.ts b/src/domain/services/sanitizer.ts
@@ -0,0 +1,97 @@
+import DOMPurify from "dompurify";
+import { useRestApi } from "#imports";
+
+const rest = useRestApi().value;
+
+const allowedTags = ["h1", "h2", "h3", "h4", "h5", "h6", "em", "strong", "del", "a", "img", "ul", "ol", "li", "hr"];
+const allowedImageExtensions = ["png", "PNG", "jpg", "JPG", "jpeg", "JPEG", "gif", "GIF"];
+
+export async function sanitize (html: string) {
+  const safeHtml = remove_harmful_code(html);
+  const htmlWithNoUserTracking = await remove_user_tracking(safeHtml);
+  return htmlWithNoUserTracking;
+}
+
+function remove_harmful_code (html: string) {
+  return DOMPurify.sanitize(html, { ALLOWED_TAGS: allowedTags });
+}
+
+async function remove_user_tracking (html: string) {
+  // Parse the description as HTML to easily manipulate it.
+  const parser = new DOMParser();
+
+  const htmlDoc = parser.parseFromString(html, "text/html");
+
+  remove_all_external_links(htmlDoc);
+  await replace_images_with_proxied_images(htmlDoc);
+
+  return document_to_html(htmlDoc);
+}
+
+function remove_all_external_links (htmlDoc: Document) {
+  const links = htmlDoc.querySelectorAll("a");
+  links.forEach((link) => {
+    const href = link.getAttribute("href");
+    if (href && !href.startsWith("#")) {
+      link.removeAttribute("href");
+    }
+  });
+}
+
+async function replace_images_with_proxied_images (htmlDoc: Document) {
+  const images = htmlDoc.querySelectorAll("img");
+  for (let i = 0; i < images.length; i++) {
+    const img = images[i];
+    const src = img.getAttribute("src");
+
+    if (src) {
+      if (isAllowedImage(src)) {
+        const imageDataSrc = await getImageDataUrl(src);
+        img.setAttribute("src", imageDataSrc);
+      } else {
+        img.remove();
+      }
+    }
+  }
+}
+
+function document_to_html (descriptionHtml: Document) {
+  const body = descriptionHtml.querySelector("body");
+  const serializer = new XMLSerializer();
+  let html = "";
+  if (body) {
+    html = serializer.serializeToString(body);
+    html = html
+      .replace("<body xmlns=\"http://www.w3.org/1999/xhtml\">", "")
+      .replace("<body>", "")
+      .replace("</body>", "");
+  }
+  return html;
+}
+
+// Returns true if the image is allowed to be displayed.
+function isAllowedImage (href: string): boolean {
+  const extension = href.split(".").pop().trim();
+  return allowedImageExtensions.includes(extension);
+}
+
+// Returns a base64 string ready to be use in a "src" attribute in a "img" html tag,
+// like this `<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA+gA…IiIiIiIiIiIiIiHyO/P85XT/jxW1glg5Erk==">`.
+async function getImageDataUrl (url: string): Promise<string> {
+  const imageBlob = await rest.torrent.proxiedImage(url);
+  const data = await blobToDataURL(imageBlob);
+  return data;
+}
+
+// Convert binary data into a base64 encoded string ready to be use in a "src"
+// attribute in a "img" html tag, like the following:
+// `<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAA+gA…IiIiIiIiIiIiIiHyO/P85XT/jxW1glg5Erk==">`.
+function blobToDataURL (blob: Blob): Promise<string> {
+  return new Promise<string>((resolve, reject) => {
+    const reader = new FileReader();
+    reader.onload = _e => resolve(reader.result as string);
+    reader.onerror = _e => reject(reader.error);
+    reader.onabort = _e => reject(new Error("Read aborted"));
+    reader.readAsDataURL(blob);
+  });
+}