Skip to content

Commit

Permalink
Merge #158: Sanitize torrent description in markdown
Browse files Browse the repository at this point in the history
77cbdad refactor: extract functions for sanitizing torrent description (Jose Celano)
7d65d3b feat: [#105] allow only some html tags in sanitized markdown torrent description (Jose Celano)
9143736 feat: sanitize torrent description in markdown (Jose Celano)
34b9fb6 feat: add dependency: dompurify (Jose Celano)

Pull request description:

  - Purify HTML to avoid potential XSS attacks.
  - Remove external URLS to protect users' privacy.

  ### Details

  The sanitize function:

  - Removes harmful code like the one in the example.
  - Remove internal links. It removes the "href" attribute.
  - Replace image sources with the direct base64 image data from the backend image proxy, only for valid image formats. Invalid format images are removed from the DOM.
  - Valid images are images whose source URL ends with a valid extension: `["png", "PNG", "jpg", "JPG", "jpeg", "JPEG"]`. The backend is supposed to support only PNG and JPG images.
  - Other HTML tags like `<embed>` are removed.

  The sanitize function is being tested with this sample torrent description:

  ```
  Harmful script (the button show not show the alert):

  <p>Click the button to display an alert box.</p>

  <button onclick="alert('Hello! I am an alert box!')">Try it</button>

  Valid PNG image in markdown format (it should show the image using the proxy)

  ![Torrust Logo](https://raw.githubusercontent.com/torrust/torrust-index-backend/develop/docs/media/torrust_logo.png)

  Another valid (JPG) image in markdown format (it should show the image using the proxy)

  ![Mandelbrot Set](https://upload.wikimedia.org/wikipedia/commons/2/21/Mandel_zoom_00_mandelbrot_set.jpg)

  Valid image (PNG) in html format (it should show the image using the proxy):

  <img src="https://raw.githubusercontent.com/torrust/torrust-index-backend/develop/docs/media/torrust_logo.png">

  Invalid image (TIF) in html format (it should remove the image):

  <img src="https://commons.wikimedia.org/wiki/Category:TIFF_files#/media/File:Arkansas_Constitution_of_1836,_page_1.tif">

  Invalid html link (it should remove the href attribute value):

  <a href="https://commons.wikimedia.org/wiki/Category:TIFF_files#/media/File:Arkansas_Constitution_of_1836,_page_1.tif">Arkansas Constitution of 1836. Page 1</a>

  <embed type="video/webm"
         src="/media/cc0-videos/flower.mp4"
         width="250"
         height="200">
  ```

  ### Todo

  - [x] Basic sanitize function
  - [x] Remove other HTML tags that can contain external sources like `<video>`
  - [ ] Add tests: unit and E2E. This could be hard to test as base64 encoded images are big strings. Maybe with a very small image.
  - [x] Refactor: extract sanitize function from component to a service

  I think these are the HTML tags that can contain sources linking to external resources:

  - `<a>`: The href attribute can contain an external URL.
  - `<img>`: The src attribute can point to an external image URL.
  - `<script>`: The src attribute can point to an external JavaScript file.
  - `<link>`: Used for linking CSS files, favicons, etc., the href attribute can contain an external URL.
  - `<iframe>`: The src attribute can contain an external URL to embed content from another site.
  - `<object>` and `<embed>`: These tags are used to embed multimedia content like Flash or PDFs, and their data and src attributes respectively can contain external URLs.
  - `<audio>` and `<video>`: The src attribute can contain an external URL to a media file.
  - `<source>`: This tag, used inside <audio>, <video>, or <picture> elements, has a src or srcset attribute that can contain an external URL.
  - `<form>`: The action attribute can contain an external URL where the form data is sent when submitted.
  - `<meta http-equiv="refresh" content="0; url=http://example.com/" />`: This meta tag can be used to redirect to an external URL.

  The dompurifier only removes unsafe code, but we also want to [remove external links to avoid tracking the users](#67).

Top commit has no ACKs.

Tree-SHA512: 97bfdcf4fcf07d5940654872b8759308a12f701ac2023248878fa0bce96278f0560b6b3a38c1474da44aa885b9bb99dac4afa87c515e0d14eacd469715782ff3
  • Loading branch information
josecelano committed Jul 5, 2023
2 parents 02268c1 + 77cbdad commit d3ae471
Show file tree
Hide file tree
Showing 5 changed files with 126 additions and 75 deletions.
79 changes: 4 additions & 75 deletions components/Markdown.vue
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
<script setup lang="ts">
import { computed } from "vue";
import { marked } from "marked";
import { sanitize } from "~/src/domain/services/sanitizer";
import { onMounted, ref, useRestApi, watch } from "#imports";
const props = defineProps({
Expand All @@ -15,8 +16,6 @@ const props = defineProps({
}
});
const rest = useRestApi().value;
const sanitizedDescription = ref("");
const options = {
Expand All @@ -34,83 +33,13 @@ onMounted(() => {
sanitizeDescription();
});
function markdown (src: string) {
function convert_markdown_to_html (src: string) {
return marked(src, options);
}
async function sanitizeDescription () {
// Get the original not sanitized markdown string.
const description = markdown(props.source);
// Replace the img src's with a random id and return a map
// of these ids mapped to the original url.
const [filteredDescriptionWithImageIds, imageIdUrlMap] = filterDescriptionImagesWithRandomIds(description);
// Get the image data using the backend's image proxy.
const imageIdDataUrlMap = await getImageDataUrlsFromUrls(imageIdUrlMap);
// Replace the img id's with the proxied sources.
sanitizedDescription.value = replaceDescriptionImageIdsWithDataUrls(filteredDescriptionWithImageIds, imageIdDataUrlMap);
}
function filterDescriptionImagesWithRandomIds (description: string): [string, Map<string, string>] {
const filteredImageMap = new Map();
// Replace all image urls with a random id.
description = description.replace(/img src="(.*?)"/gi, (match, url): string => {
const imageId = randomId(32);
filteredImageMap.set(imageId, url);
return `img src="${imageId}"`;
});
return [description, filteredImageMap];
}
async function getImageDataUrlsFromUrls (imageMap: Map<string, string>): Promise<Map<string, string>> {
const imageDataMap: Map<string, string> = new Map();
for (const [id, url] of imageMap) {
const imageBlob = await rest.torrent.proxiedImage(url);
const imageDataUrl = await blobToDataURL(imageBlob);
imageDataMap.set(id, imageDataUrl);
}
return imageDataMap;
}
function blobToDataURL (blob: Blob): Promise<string> {
return new Promise<string>((resolve, reject) => {
const reader = new FileReader();
reader.onload = _e => resolve(reader.result as string);
reader.onerror = _e => reject(reader.error);
reader.onabort = _e => reject(new Error("Read aborted"));
reader.readAsDataURL(blob);
});
}
function replaceDescriptionImageIdsWithDataUrls (description: string, imageIdDataUrlMap: Map<string, string>): string {
imageIdDataUrlMap.forEach((dataUrl, id) => {
description = description.replace(id, dataUrl);
});
return description;
}
function randomId (length: number) {
let result = "";
const characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
const charactersLength = characters.length;
let counter = 0;
while (counter < length) {
result += characters.charAt(Math.floor(Math.random() * charactersLength));
counter += 1;
}
return result;
const html = convert_markdown_to_html(props.source);
sanitizedDescription.value = await sanitize(html);
}
</script>

Expand Down
22 changes: 22 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
"@nuxtjs/eslint-config-typescript": "^12.0.0",
"@nuxtjs/tailwindcss": "^6.8.0",
"@tailwindcss/typography": "^0.5.9",
"@types/dompurify": "^3.0.2",
"@types/marked": "^5.0.0",
"@types/node": "^20.3.2",
"@typescript-eslint/eslint-plugin": "^5.60.0",
Expand All @@ -32,6 +33,7 @@
"dependencies": {
"@heroicons/vue": "^2.0.18",
"daisyui": "^3.1.7",
"dompurify": "^3.0.4",
"marked": "^5.1.0",
"notiwind-ts": "^2.0.2",
"torrust-index-api-lib": "^0.2.0",
Expand Down
1 change: 1 addition & 0 deletions project-words.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
composables
dompurify
heroicons
notiwind
Nuxt
Expand Down
97 changes: 97 additions & 0 deletions src/domain/services/sanitizer.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import DOMPurify from "dompurify";
import { useRestApi } from "#imports";

const rest = useRestApi().value;

const allowedTags = ["h1", "h2", "h3", "h4", "h5", "h6", "em", "strong", "del", "a", "img", "ul", "ol", "li", "hr"];
const allowedImageExtensions = ["png", "PNG", "jpg", "JPG", "jpeg", "JPEG", "gif", "GIF"];

export async function sanitize (html: string) {
const safeHtml = remove_harmful_code(html);
const htmlWithNoUserTracking = await remove_user_tracking(safeHtml);
return htmlWithNoUserTracking;
}

function remove_harmful_code (html: string) {
return DOMPurify.sanitize(html, { ALLOWED_TAGS: allowedTags });
}

async function remove_user_tracking (html: string) {
// Parse the description as HTML to easily manipulate it.
const parser = new DOMParser();

const htmlDoc = parser.parseFromString(html, "text/html");

remove_all_external_links(htmlDoc);
await replace_images_with_proxied_images(htmlDoc);

return document_to_html(htmlDoc);
}

function remove_all_external_links (htmlDoc: Document) {
const links = htmlDoc.querySelectorAll("a");
links.forEach((link) => {
const href = link.getAttribute("href");
if (href && !href.startsWith("#")) {
link.removeAttribute("href");
}
});
}

async function replace_images_with_proxied_images (htmlDoc: Document) {
const images = htmlDoc.querySelectorAll("img");
for (let i = 0; i < images.length; i++) {
const img = images[i];
const src = img.getAttribute("src");

if (src) {
if (isAllowedImage(src)) {
const imageDataSrc = await getImageDataUrl(src);
img.setAttribute("src", imageDataSrc);
} else {
img.remove();
}
}
}
}

function document_to_html (descriptionHtml: Document) {
const body = descriptionHtml.querySelector("body");
const serializer = new XMLSerializer();
let html = "";
if (body) {
html = serializer.serializeToString(body);
html = html
.replace("<body xmlns=\"http://www.w3.org/1999/xhtml\">", "")
.replace("<body>", "")
.replace("</body>", "");
}
return html;
}

// Returns true if the image is allowed to be displayed.
function isAllowedImage (href: string): boolean {
const extension = href.split(".").pop().trim();
return allowedImageExtensions.includes(extension);
}

// Returns a base64 string ready to be use in a "src" attribute in a "img" html tag,
// like this `<img src="…IiIiIiIiIiIiIiHyO/P85XT/jxW1glg5Erk==">`.
async function getImageDataUrl (url: string): Promise<string> {
const imageBlob = await rest.torrent.proxiedImage(url);
const data = await blobToDataURL(imageBlob);
return data;
}

// Convert binary data into a base64 encoded string ready to be use in a "src"
// attribute in a "img" html tag, like the following:
// `<img src="…IiIiIiIiIiIiIiHyO/P85XT/jxW1glg5Erk==">`.
function blobToDataURL (blob: Blob): Promise<string> {
return new Promise<string>((resolve, reject) => {
const reader = new FileReader();
reader.onload = _e => resolve(reader.result as string);
reader.onerror = _e => reject(reader.error);
reader.onabort = _e => reject(new Error("Read aborted"));
reader.readAsDataURL(blob);
});
}

0 comments on commit d3ae471

Please sign in to comment.