Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link parsing #120

Merged
merged 4 commits into from
Jul 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ The following libraries and frameworks are used by GoToSocial, with gratitude
* [gorilla/websocket](https://github.com/gorilla/websocket); Websocket connectivity. [BSD-2-Clause License](https://spdx.org/licenses/BSD-2-Clause.html).
* [h2non/filetype](https://github.com/h2non/filetype); filetype checking. [MIT License](https://spdx.org/licenses/MIT.html).
* [microcosm-cc/bluemonday](https://github.com/microcosm-cc/bluemonday); HTML user-input sanitization. [BSD-3-Clause License](https://spdx.org/licenses/BSD-3-Clause.html).
* [mvdan/xurls](https://github.com/mvdan/xurls); URL parsing regular expressions. [BSD-3-Clause License](https://spdx.org/licenses/BSD-3-Clause.html).
* [nfnt/resize](https://github.com/nfnt/resize); convenient image resizing. [ISC License](https://spdx.org/licenses/ISC.html).
* [oklog/ulid](https://github.com/oklog/ulid); sequential, database-friendly ID generation. [Apache-2.0 License](https://spdx.org/licenses/Apache-2.0.html).
* [sirupsen/logrus](https://github.com/sirupsen/logrus); logging. [MIT License](https://spdx.org/licenses/MIT.html).
Expand Down
149 changes: 149 additions & 0 deletions docs/user_guide/posts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Posts

## Input Types

GoToSocial currently accepts two different types of input for posts. These are:

* `plain`
* `markdown`

Plain is the default method of posting: GtS accepts some plain looking text, and converts it into some nice HTML by parsing links and mentions etc. If you're used to Mastodon or Twitter or most other social media platforms, this way of writing posts will be immediately familiar.

Markdown is a more complex way of organizing text, which gives you more control over how your text is parsed and formatted.

For more information on markdown, see [The Markdown Guide](https://www.markdownguide.org/).

## Formatting

When a post is submitted in `plain` format, GoToSocial automatically does some tidying up and formatting of the post in order to convert it to HTML, as described below.

### Whitespace

Any leading or trailing whitespaces and newlines are removed from the post. So for example:

```text


this post starts with some newlines
```

will become:

```text
this post starts with some newlines
```

### Wrapping

The whole post will be wrapped in `<p></p>`.

So the following text:

```text
Hi here's a little post!
```

Will become:

```html
<p>Hi here's a little post!</p>
```

### Linebreaks

Any newlines will be replaced with `<br />`

So to continue the above example:

```text
Hi here's a little post!

And here's another line.
```

Will become:

```html
<p>Hi here's a little post!<br /><br />And here's another line</p>
```

### Links

Any recognizable links in the text will be shortened and turned into proper hyperlinks, and have some additional attributes added to them.

For example:

```text
Here's a link to something: https://example.org/some/link/address
```

will become:

```html
Here's a link to something: <a href="https://example.org/some/link/address" rel="nofollow" rel="noreferrer" rel="noopener">example.org/some/link/address</a>
```

which will be rendered as:

> Here's a link to something: [example.org/some/link/address](https://example.org/some/link/address)

Note that this will only work for `http` and `https` links; other schemes are not supported.

### Mentions

You can 'mention' another account by referring to the account in the following way:

> @some_account@example.org

In this example, `some_account` is the username of the account you want to mention, and `example.org` is the domain that hosts their account.

The mentioned account will get a notification that you've mentioned them, and be able to see the post in which they were mentioned.

Mentions are formatted in a similar way to links, so:

```text
hi @some_account@example.org how's it going?
```

will become:

```html
hi <span class="h-card"><a href="https://example.org/@some_account" class="u-url mention">@<span>some_account</span></a></span> how's it going?
```

which will be rendered as:

> hi <span class="h-card"><a href="https://example.org/@some_account" class="u-url mention">@<span>some_account</span></a></span> how's it going?

When mentioning local accounts (ie., accounts on your instance), the second part of the mention is not necessary. If there's an account with username `local_account_person` on your instance, you can mention them just by writing:

```text
hey @local_account_person you're my neighbour
```

This will become:

```html
hey <span class="h-card"><a href="https://my.instance.org/@local_account_person" class="u-url mention">@<span>local_account_person</span></a></span> you're my neighbour
```

which will be rendered as:

> hey <span class="h-card"><a href="https://my.instance.org/@local_account_person" class="u-url mention">@<span>local_account_person</span></a></span> you're my neighbour

## Input Sanitization

In order not to spread scripts, vulnerabilities, and glitchy HTML all over the place, GoToSocial performs the following types of input sanitization:

`plain` input type:

* Before parsing, any existing HTML is completely removed from the post body and content-warning fields.
* After parsing, all generated HTML is run through a sanitizer to remove harmful elements.

`markdown` input type:

* Before parsing, any existing HTML is completely removed from the content-warning field.
* Before parsing, any existing HTML in the post body is run through a sanitizer to remove harmful elements.
* After parsing, all generated HTML is run through a sanitizer to remove harmful elements.

GoToSocial uses [bluemonday](https://github.com/microcosm-cc/bluemonday) for HTML sanitization.
48 changes: 0 additions & 48 deletions docs/user_guide/writing_posts.md

This file was deleted.

1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,5 @@ require (
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c // indirect
gopkg.in/yaml.v2 v2.4.0
gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b // indirect
mvdan.cc/xurls/v2 v2.3.0
)
4 changes: 4 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -301,13 +301,15 @@ github.com/onsi/gomega v1.10.1/go.mod h1:iN09h71vgCQne3DLsj+A5owkum+a2tYe+TOCB1y
github.com/onsi/gomega v1.10.3/go.mod h1:V9xEwhxec5O8UDM77eCW8vLymOMltsqPVYWrpDsH8xc=
github.com/onsi/gomega v1.14.0 h1:ep6kpPVwmr/nTbklSx2nrLNSIO62DoYAhnPNIMhK8gI=
github.com/onsi/gomega v1.14.0/go.mod h1:cIuvLEne0aoVhAgh/O6ac0Op8WWw9H6eYCriF+tEHG0=
github.com/pkg/diff v0.0.0-20210226163009-20ebb0f2a09e/go.mod h1:pJLUxLENpZxwdsKMEsNbx1VGcRFpLqf3715MtcvvzbA=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA=
github.com/quasoft/memstore v0.0.0-20180925164028-84a050167438/go.mod h1:wTPjTepVu7uJBYgZ0SdWHQlIas582j6cn2jgk4DDdlg=
github.com/quasoft/memstore v0.0.0-20191010062613-2bce066d2b0b h1:aUNXCGgukb4gtY99imuIeoh8Vr0GSwAlYxPAhqZrpFc=
github.com/quasoft/memstore v0.0.0-20191010062613-2bce066d2b0b/go.mod h1:wTPjTepVu7uJBYgZ0SdWHQlIas582j6cn2jgk4DDdlg=
github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4=
github.com/rogpeppe/go-internal v1.8.0/go.mod h1:WmiCO8CzOY8rg0OYDC4/i/2WRWAB6poM+XZ2dLUbcbE=
github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/russross/blackfriday/v2 v2.1.0 h1:JIOH55/0cWyOuilr9/qlrm0BSXldqnqwMsf35Ld67mk=
github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
Expand Down Expand Up @@ -729,6 +731,8 @@ honnef.co/go/tools v0.0.1-2020.1.3/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9
honnef.co/go/tools v0.0.1-2020.1.4/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k=
mellium.im/sasl v0.2.1 h1:nspKSRg7/SyO0cRGY71OkfHab8tf9kCts6a6oTDut0w=
mellium.im/sasl v0.2.1/go.mod h1:ROaEDLQNuf9vjKqE1SrAfnsobm2YKXT1gnN1uDp1PjQ=
mvdan.cc/xurls/v2 v2.3.0 h1:59Olnbt67UKpxF1EwVBopJvkSUBmgtb468E4GVWIZ1I=
mvdan.cc/xurls/v2 v2.3.0/go.mod h1:AjuTy7gEiUArFMjgBBDU4SMxlfUYsRokpJQgNWOt3e4=
rsc.io/binaryregexp v0.2.0/go.mod h1:qTv7/COck+e2FymRvadv62gMdZztPaShugOCi3I+8D8=
rsc.io/quote/v3 v3.1.0/go.mod h1:yEA65RcK8LyAZtP9Kv3t0HmxON59tX3rD+tICJqUlj0=
rsc.io/sampler v1.3.0/go.mod h1:T1hPZKmBbMNahiBKFy5HrXp6adAjACjK9JXDnKaTXpA=
9 changes: 6 additions & 3 deletions internal/processing/status/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"github.com/superseriousbusiness/gotosocial/internal/db"
"github.com/superseriousbusiness/gotosocial/internal/gtsmodel"
"github.com/superseriousbusiness/gotosocial/internal/id"
"github.com/superseriousbusiness/gotosocial/internal/text"
"github.com/superseriousbusiness/gotosocial/internal/util"
)

Expand Down Expand Up @@ -248,13 +249,15 @@ func (p *processor) processContent(form *apimodel.AdvancedStatusCreateForm, acco
form.Format = apimodel.StatusFormatDefault
}

// remove any existing html from the status
content := text.RemoveHTML(form.Status)

// parse content out of the status depending on what format has been submitted
var content string
switch form.Format {
case apimodel.StatusFormatPlain:
content = p.formatter.FromPlain(form.Status, status.GTSMentions, status.GTSTags)
content = p.formatter.FromPlain(content, status.GTSMentions, status.GTSTags)
case apimodel.StatusFormatMarkdown:
content = p.formatter.FromMarkdown(form.Status, status.GTSMentions, status.GTSTags)
content = p.formatter.FromMarkdown(content, status.GTSMentions, status.GTSTags)
default:
return fmt.Errorf("format %s not recognised as a valid status format", form.Format)
}
Expand Down
115 changes: 115 additions & 0 deletions internal/text/link.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
/*
GoToSocial
Copyright (C) 2021 GoToSocial Authors admin@gotosocial.org

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package text

import (
"fmt"
"net/url"

"mvdan.cc/xurls/v2"
)

// schemes is the regex for schemes we accept when looking for links.
// Basically, we accept https or http.
var schemes = `(((http|https))://)`

// FindLinks parses the given string looking for recognizable URLs (including scheme).
// It returns a list of those URLs, without changing the string, or an error if something goes wrong.
// If no URLs are found within the given string, an empty slice and nil will be returned.
func FindLinks(in string) ([]*url.URL, error) {
rxStrict, err := xurls.StrictMatchingScheme(schemes)
if err != nil {
return nil, err
}

urls := []*url.URL{}

// bail already if we don't find anything
found := rxStrict.FindAllString(in, -1)
if len(found) == 0 {
return urls, nil
}

// for each string we find, we want to parse it into a URL if we can
// if we fail to parse it, just ignore this match and continue
for _, f := range found {
u, err := url.Parse(f)
if err != nil {
continue
}
urls = append(urls, u)
}

// deduplicate the URLs
urlsDeduped := []*url.URL{}

for _, u := range urls {
if !contains(urlsDeduped, u) {
urlsDeduped = append(urlsDeduped, u)
}
}

return urlsDeduped, nil
}

// contains checks if the given url is already within a slice of URLs
func contains(urls []*url.URL, url *url.URL) bool {
for _, u := range urls {
if u.String() == url.String() {
return true
}
}
return false
}

// ReplaceLinks replaces all detected links in a piece of text with their HTML (href) equivalents.
// Note: because Go doesn't allow negative lookbehinds in regex, it's possible that an already-formatted
// href will end up double-formatted, if the text you pass here contains one or more hrefs already.
// To avoid this, you should sanitize any HTML out of text before you pass it into this function.
func ReplaceLinks(in string) string {
rxStrict, err := xurls.StrictMatchingScheme(schemes)
if err != nil {
panic(err)
}

replaced := rxStrict.ReplaceAllStringFunc(in, func(urlString string) string {
thisURL, err := url.Parse(urlString)
if err != nil {
return urlString // we can't parse it as a URL so don't replace it
}

shortString := thisURL.Hostname()

if thisURL.Path != "" {
shortString = shortString + thisURL.Path
}

if thisURL.Fragment != "" {
shortString = shortString + "#" + thisURL.Fragment
}

if thisURL.RawQuery != "" {
shortString = shortString + "?" + thisURL.RawQuery
}

replacement := fmt.Sprintf(`<a href="%s" rel="noopener">%s</a>`, urlString, shortString)
return replacement
})
return replaced
}
Loading