-
-
Notifications
You must be signed in to change notification settings - Fork 679
WIP: Implement failing media API tests #1775
WIP: Implement failing media API tests #1775
Conversation
Signed-off-by: Rudraksh Pareek <54525605+DelusionalOptimist@users.noreply.github.com>
Signed-off-by: Rudraksh Pareek <54525605+DelusionalOptimist@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL previews aren't an easy thing to tackle due to the amount of paranoia you need to apply when downloading arbitrary URLs from clients. I didn't really think about this when tagging this issue as good-first-issue
so that's my bad. At present, there is still a lot of work which needs to be done before this is suitable to be merged.
request := &PreviewUrlRequest{ | ||
url: *parsedUrl, | ||
// Convert timestamp to ms | ||
ts: types.UnixMs(ts / 1000000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsStr
will be in milliseconds already looking at https://github.com/matrix-org/synapse/blob/4b965c862dc66c0da5d3240add70e9b5f0aa720b/synapse/rest/media/v1/preview_url_resource.py#L198 so it should just be fine to do ts: types.UnixMs(ts)
.
urlString := previewReq.url.String() | ||
|
||
// Get the URL | ||
response, err := http.Get(urlString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use the default client as it has no timeout, so malicious preview URLs could just not return anything and consume a file descriptor on the server. Use a client with a sensible timeout please e.g &http.Client{Timeout:30 * time.Second}
} | ||
|
||
// parse the url | ||
parsedUrl, err := url.Parse(urlToPreview) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check this URL against a blacklist of URLs otherwise this is just an arbitrary proxy which is 😨
See https://github.com/matrix-org/synapse/blob/4b965c862dc66c0da5d3240add70e9b5f0aa720b/synapse/rest/media/v1/preview_url_resource.py#L202 and https://github.com/matrix-org/synapse/blob/240b3ce253b2b36d79a0a6cc18aefd70ad066de3/synapse/config/repository.py#L323
|
||
// Save the response body to the temporary dir | ||
// TODO: should be a new directory like url_cache maybe and | ||
hash, bytesWritten, tmpDir, err := fileutils.WriteTempFile(ctx, response.Body, cfg.AbsBasePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to set a cap on the number of bytes we're willing to write to file, otherwise malicious urls can return a 10GB file and consume lots of disk space. 1MB should be plenty.
Origin: cfg.Matrix.ServerName, | ||
FileSizeBytes: bytesWritten, | ||
Base64Hash: hash, | ||
ContentType: types.ContentType(response.Header["Content-Type"][0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use response.Header.Get("Content-Type")
instead please. This otherwise will panic if the response has no Content-Type header.
return urlMetadata, nil | ||
} | ||
|
||
func (m *mediaInfo) generateMediaID(ctx context.Context, db storage.Database) (types.MediaID, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will race with multiple uploads. If 2 people upload a file at the same time it's possible that they will be allocated the same media ID which will then be checked in the database, see it's free and then overwrite each other later on. This is however unlikely given the 32 bytes of entropy used here so maybe that's okay.
if err != nil { | ||
return nil, &util.JSONResponse{ | ||
Code: http.StatusInternalServerError, | ||
JSON: jsonerror.Unknown("Couldn't get the path to the stored file " + err.Error()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect error message.
|
||
// Iterating the *html.Node, looking for og data | ||
var f func(*html.Node) | ||
f = func(n *html.Node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do f := func(n *html.Node)
to avoid the declaration on :371
?
} | ||
} | ||
for c := n.FirstChild; c != nil; c = c.NextSibling { | ||
f(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to limit how much we're willing to explore on the HTML to avoid pathological cases which consume lots of CPU.
@@ -0,0 +1,400 @@ | |||
// Copyright 2017 Vector Creations Ltd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyrights are wrong.
Hey @kegsay! Thanks for taking a look at this. |
The biggest thing standing between us and efficient docker cached builds is the `RELEASE_VERSION` arg. Since this argument is always changing, and it's one of the first things used in the Dockerfile, our docker build cache is always invalidated. This PR addresses that problem via: 1) Remove the -ldflags argument to the go build 2) Move the RELEASE_VERSION arg to the bottom 3) Make dendrite use the environment variable, instead of the build-time arg 4) Make dockerfile pass the build-time version ARG as an environment variable into the image --------- Co-authored-by: Brian Meek <brian@hntlabs.com>
Fixes #1303
× Can fetch images in room
× Test URL preview
×
Can read configuration endpointPreviews just work as of now, pretty tentative implementation
Currently storing the response body for the url to be previewed is heavily inspired by the
/upload
endpoint. Not sure but probably gotta make aurl_cache
inmedia_store
(similar to synapse).Would be grateful to receive suggestions and comments. 😁
Pull Request Checklist
sytest-whitelist
as specified in docs/sytest.mdSigned-off-by:
Rudraksh Pareek <54525605+DelusionalOptimist@users.noreply.github.com>