Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

URL previewing support #688

Merged
merged 44 commits into from
Apr 11, 2016
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
7dd0c17
initial WIP of a tentative preview_url endpoint - incomplete, unteste…
ara4n Jan 24, 2016
adafa24
typo
ara4n Mar 25, 2016
d9d48aa
Merge branch 'develop' into matthew/preview_urls
ara4n Mar 27, 2016
ec0cf99
typo
ara4n Mar 25, 2016
e0c2490
Merge branch 'develop' into matthew/preview_urls
ara4n Mar 29, 2016
dd4287c
make it build
ara4n Mar 29, 2016
64b4aea
make it work
ara4n Mar 29, 2016
1903858
debug
ara4n Mar 29, 2016
721b2bf
implement redirects
ara4n Mar 29, 2016
ae5831d
fix bugs
ara4n Mar 29, 2016
7178ab7
spell out more packages
ara4n Mar 30, 2016
a8a5dd3
handle requests with missing content-length headers (e.g. YouTube)
ara4n Mar 31, 2016
0d3d7de
sync in changes from matrixfederationclient
ara4n Mar 31, 2016
bb9a2ca
synthesise basig OG metadata from pages lacking it
ara4n Mar 31, 2016
72550c3
prevent choking on invalid utf-8, and handle image thumbnailing smarter
ara4n Mar 31, 2016
683e564
handle spidered relative images correctly
ara4n Mar 31, 2016
c60b751
fix assorted redirect, unicode and screenscraping bugs
ara4n Apr 1, 2016
5fd07da
refactor calc_og; spider image URLs; fix xpath; add a (broken) expiri…
ara4n Apr 1, 2016
b26e860
make meta comparisons case insensitive
ara4n Apr 2, 2016
5037ee0
handle missing dimensions without crashing
ara4n Apr 2, 2016
2c838f6
pass back SVGs as their own thumbnails
ara4n Apr 2, 2016
9377157
how was _respond_default_thumbnail ever meant to work?
ara4n Apr 2, 2016
d1b154a
support gzip compression, and don't pass through error msgs
ara4n Apr 2, 2016
7426c86
add a persistent cache of URL lookups, and fix up the in-memory one t…
ara4n Apr 2, 2016
b09e29a
Ensure only one download for a given URL is active at a time
ara4n Apr 2, 2016
110780b
remove stale todo
ara4n Apr 2, 2016
c391646
rebase all image URLs
ara4n Apr 3, 2016
eab4d46
fix etag typing error. fix timestamp typing error
ara4n Apr 3, 2016
8b98a7e
pep8
ara4n Apr 3, 2016
0834b15
char encoding
ara4n Apr 3, 2016
cf51c41
report image size (bytewise) in OG meta
ara4n Apr 3, 2016
9f7dc2b
Merge branch 'develop' into matthew/preview_urls
ara4n Apr 3, 2016
d6e7333
Merge branch 'develop' into matthew/preview_urls
ara4n Apr 7, 2016
dafef5a
Add url_preview_enabled config option to turn on/off preview_url endp…
ara4n Apr 8, 2016
ec9331f
Add doc
ara4n Apr 8, 2016
b04f812
Add more doc
ara4n Apr 8, 2016
fb83f6a
fix SQL based on PR feedback
ara4n Apr 8, 2016
1ccabe2
more PR feedback
ara4n Apr 8, 2016
2460d90
fix error checking for new SQL
ara4n Apr 8, 2016
af582b6
fix typo
ara4n Apr 8, 2016
6ff7a79
move local_media_repository_url_cache.sql to schema v31
ara4n Apr 8, 2016
b36270b
Fix pep8 warning
Apr 8, 2016
83b2f83
actually throw meaningful errors
ara4n Apr 8, 2016
5ffacc5
fix typos and needless try/except from PR review
ara4n Apr 11, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/url_previews.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
URL Previews
============

Design notes on a URL previewing service for Matrix:

Options are:

1. Have an AS which listens for URLs, downloads them, and inserts an event that describes their metadata.
* Pros:
* Decouples the implementation entirely from Synapse.
* Uses existing Matrix events & content repo to store the metadata.
* Cons:
* Which AS should provide this service for a room, and why should you trust it?
* Doesn't work well with E2E; you'd have to cut the AS into every room
* the AS would end up subscribing to every room anyway.

2. Have a generic preview API (nothing to do with Matrix) that provides a previewing service:
* Pros:
* Simple and flexible; can be used by any clients at any point
* Cons:
* If each HS provides one of these independently, all the HSes in a room may needlessly DoS the target URI
* We need somewhere to store the URL metadata rather than just using Matrix itself
* We can't piggyback on matrix to distribute the metadata between HSes.

3. Make the synapse of the sending user responsible for spidering the URL and inserting an event asynchronously which describes the metadata.
* Pros:
* Works transparently for all clients
* Piggy-backs nicely on using Matrix for distributing the metadata.
* No confusion as to which AS
* Cons:
* Doesn't work with E2E
* We might want to decouple the implementation of the spider from the HS, given spider behaviour can be quite complicated and evolve much more rapidly than the HS. It's more like a bot than a core part of the server.

4. Make the sending client use the preview API and insert the event itself when successful.
* Pros:
* Works well with E2E
* No custom server functionality
* Lets the client customise the preview that they send (like on FB)
* Cons:
* Entirely specific to the sending client, whereas it'd be nice if /any/ URL was correctly previewed if clients support it.

5. Have the option of specifying a shared (centralised) previewing service used by a room, to avoid all the different HSes in the room DoSing the target.

Best solution is probably a combination of both 2 and 4.
* Sending clients do their best to create and send a preview at the point of sending the message, perhaps delaying the message until the preview is computed? (This also lets the user validate the preview before sending)
* Receiving clients have the option of going and creating their own preview if one doesn't arrive soon enough (or if the original sender didn't create one)

This is a bit magical though in that the preview could come from two entirely different sources - the sending HS or your local one. However, this can always be exposed to users: "Generate your own URL previews if none are available?"

This is tantamount also to senders calculating their own thumbnails for sending in advance of the main content - we are trusting the sender not to lie about the content in the thumbnail. Whereas currently thumbnails are calculated by the receiving homeserver to avoid this attack.

However, this kind of phishing attack does exist whether we let senders pick their thumbnails or not, in that a malicious sender can send normal text messages around the attachment claiming it to be legitimate. We could rely on (future) reputation/abuse management to punish users who phish (be it with bogus metadata or bogus descriptions). Bogus metadata is particularly bad though, especially if it's avoidable.

As a first cut, let's do #2 and have the receiver hit the API to calculate its own previews (as it does currently for image thumbnails). We can then extend/optimise this to option 4 as a special extra if needed.

API
---

GET /_matrix/media/r0/preview_url?url=http://wherever.com
200 OK
{
"og:type" : "article"
"og:url" : "https://twitter.com/matrixdotorg/status/684074366691356672"
"og:title" : "Matrix on Twitter"
"og:image" : "https://pbs.twimg.com/profile_images/500400952029888512/yI0qtFi7_400x400.png"
"og:description" : "“Synapse 0.12 is out! Lots of polishing, performance & bugfixes: /sync API, /r0 prefix, fulltext search, 3PID invites https://t.co/5alhXLLEGP”"
"og:site_name" : "Twitter"
}

* Downloads the URL
* If HTML, just stores it in RAM and parses it for OG meta tags
* Download any media OG meta tags to the media repo, and refer to them in the OG via mxc:// URIs.
* If a media filetype we know we can thumbnail: store it on disk, and hand it to the thumbnailer. Generate OG meta tags from the thumbnailer contents.
* Otherwise, don't bother downloading further.
6 changes: 5 additions & 1 deletion synapse/config/repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ class ContentRepositoryConfig(Config):
def read_config(self, config):
self.max_upload_size = self.parse_size(config["max_upload_size"])
self.max_image_pixels = self.parse_size(config["max_image_pixels"])
self.max_spider_size = self.parse_size(config["max_spider_size"])
self.media_store_path = self.ensure_directory(config["media_store_path"])
self.uploads_path = self.ensure_directory(config["uploads_path"])
self.dynamic_thumbnails = config["dynamic_thumbnails"]
Expand All @@ -73,14 +74,17 @@ def default_config(self, **kwargs):
# The largest allowed upload size in bytes
max_upload_size: "10M"

# The largest allowed URL preview spidering size in bytes
max_spider_size: "10M"

# Maximum number of pixels that will be thumbnailed
max_image_pixels: "32M"

# Whether to generate new thumbnails on the fly to precisely match
# the resolution requested by the client. If true then whenever
# a new resolution is requested by the client the server will
# generate a new thumbnail. If false the server will pick a thumbnail
# from a precalcualted list.
# from a precalculated list.
dynamic_thumbnails: false

# List of thumbnail to precalculate when an image is uploaded.
Expand Down
122 changes: 119 additions & 3 deletions synapse/http/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,22 @@
from OpenSSL import SSL
from OpenSSL.SSL import VERIFY_NONE

from synapse.api.errors import CodeMessageException
from synapse.api.errors import (
CodeMessageException, SynapseError, Codes,
)
from synapse.util.logcontext import preserve_context_over_fn
import synapse.metrics

from canonicaljson import encode_canonical_json

from twisted.internet import defer, reactor, ssl
from twisted.internet import defer, reactor, ssl, protocol
from twisted.web.client import (
Agent, readBody, FileBodyProducer, PartialDownloadError,
BrowserLikeRedirectAgent, ContentDecoderAgent, GzipDecoder, Agent,
readBody, FileBodyProducer, PartialDownloadError,
)
from twisted.web.http import PotentialDataLoss
from twisted.web.http_headers import Headers
from twisted.web._newclient import ResponseDone

from StringIO import StringIO

Expand Down Expand Up @@ -238,6 +243,96 @@ def get_raw(self, uri, args={}):
else:
raise CodeMessageException(response.code, body)

# XXX: FIXME: This is horribly copy-pasted from matrixfederationclient.
# The two should be factored out.

@defer.inlineCallbacks
def get_file(self, url, output_stream, max_size=None):
"""GETs a file from a given URL
Args:
url (str): The URL to GET
output_stream (file): File to write the response body to.
Returns:
A (int,dict,string,int) tuple of the file length, dict of the response
headers, absolute URI of the response and HTTP response code.
"""

response = yield self.request(
"GET",
url.encode("ascii"),
headers=Headers({
b"User-Agent": [self.user_agent],
})
)

headers = dict(response.headers.getAllRawHeaders())

if 'Content-Length' in headers and headers['Content-Length'] > max_size:
logger.warn("Requested URL is too large > %r bytes" % (self.max_size,))
# XXX: do we want to explicitly drop the connection here somehow? if so, how?
raise # what should we be raising here?

if response.code > 299:
logger.warn("Got %d when downloading %s" % (response.code, url))
raise

# TODO: if our Content-Type is HTML or something, just read the first
# N bytes into RAM rather than saving it all to disk only to read it
# straight back in again

try:
length = yield preserve_context_over_fn(
_readBodyToFile,
response, output_stream, max_size
)
except:
logger.exception("Failed to download body")
raise

defer.returnValue((length, headers, response.request.absoluteURI, response.code))


# XXX: FIXME: This is horribly copy-pasted from matrixfederationclient.
# The two should be factored out.

class _ReadBodyToFileProtocol(protocol.Protocol):
def __init__(self, stream, deferred, max_size):
self.stream = stream
self.deferred = deferred
self.length = 0
self.max_size = max_size

def dataReceived(self, data):
self.stream.write(data)
self.length += len(data)
if self.max_size is not None and self.length >= self.max_size:
self.deferred.errback(SynapseError(
502,
"Requested file is too large > %r bytes" % (self.max_size,),
Codes.TOO_LARGE,
))
self.deferred = defer.Deferred()
self.transport.loseConnection()

def connectionLost(self, reason):
if reason.check(ResponseDone):
self.deferred.callback(self.length)
elif reason.check(PotentialDataLoss):
# stolen from https://github.com/twisted/treq/pull/49/files
# http://twistedmatrix.com/trac/ticket/4840
self.deferred.callback(self.length)
else:
self.deferred.errback(reason)


# XXX: FIXME: This is horribly copy-pasted from matrixfederationclient.
# The two should be factored out.

def _readBodyToFile(response, stream, max_size):
d = defer.Deferred()
response.deliverBody(_ReadBodyToFileProtocol(stream, d, max_size))
return d


class CaptchaServerHttpClient(SimpleHttpClient):
"""
Expand Down Expand Up @@ -269,6 +364,27 @@ def post_urlencoded_get_raw(self, url, args={}):
defer.returnValue(e.response)


class SpiderHttpClient(SimpleHttpClient):
"""
Separate HTTP client for spidering arbitrary URLs.
Special in that it follows retries and has a UA that looks
like a browser.

used by the preview_url endpoint in the content repo.
"""
def __init__(self, hs):
SimpleHttpClient.__init__(self, hs)
# clobber the base class's agent and UA:
self.agent = ContentDecoderAgent(BrowserLikeRedirectAgent(Agent(
reactor,
connectTimeout=15,
contextFactory=hs.get_http_client_context_factory()
)), [('gzip', GzipDecoder)])
# We could look like Chrome:
# self.user_agent = ("Mozilla/5.0 (%s) (KHTML, like Gecko)
# Chrome Safari" % hs.version_string)


def encode_urlencode_args(args):
return {k: encode_urlencode_arg(v) for k, v in args.items()}

Expand Down
1 change: 1 addition & 0 deletions synapse/python_dependencies.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
"blist": ["blist"],
"pysaml2>=3.0.0,<4.0.0": ["saml2>=3.0.0,<4.0.0"],
"pymacaroons-pynacl": ["pymacaroons"],
"lxml>=3.6.0": ["lxml"],
"pyjwt": ["jwt"],
}
CONDITIONAL_REQUIREMENTS = {
Expand Down
1 change: 1 addition & 0 deletions synapse/rest/media/v1/base_resource.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ def __init__(self, hs, filepaths):
self.store = hs.get_datastore()
self.max_upload_size = hs.config.max_upload_size
self.max_image_pixels = hs.config.max_image_pixels
self.max_spider_size = hs.config.max_spider_size
self.filepaths = filepaths
self.version_string = hs.version_string
self.downloads = {}
Expand Down
2 changes: 2 additions & 0 deletions synapse/rest/media/v1/media_repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .download_resource import DownloadResource
from .thumbnail_resource import ThumbnailResource
from .identicon_resource import IdenticonResource
from .preview_url_resource import PreviewUrlResource
from .filepath import MediaFilePaths

from twisted.web.resource import Resource
Expand Down Expand Up @@ -78,3 +79,4 @@ def __init__(self, hs):
self.putChild("download", DownloadResource(hs, filepaths))
self.putChild("thumbnail", ThumbnailResource(hs, filepaths))
self.putChild("identicon", IdenticonResource())
self.putChild("preview_url", PreviewUrlResource(hs, filepaths))
Loading