Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance when using findIfExists #2131

Closed
marijnvandevoorde opened this issue May 21, 2024 · 13 comments
Closed

Improve performance when using findIfExists #2131

marijnvandevoorde opened this issue May 21, 2024 · 13 comments

Comments

@marijnvandevoorde
Copy link
Contributor

Shlink version

4.1.0

PHP version

8.2

How do you serve Shlink

Self-hosted RoadRunner

Database engine

MicrosoftSQL

Database version

2022

Current behavior

``We did performance tests without and with this param turned to true.

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 10 max VUs, 40s max duration (incl. graceful stop):
              * default: 10 looping VUs for 10s (gracefulStop: 30s)


     data_received..................: 341 kB 34 kB/s
     data_sent......................: 265 kB 26 kB/s
     http_req_blocked...............: avg=16.33µs  min=2µs     med=5µs      max=911µs    p(90)=7µs      p(95)=10µs    
     http_req_connecting............: avg=7.75µs   min=0s      med=0s       max=572µs    p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=159.74ms min=96.56ms med=149.79ms max=528.73ms p(90)=191.99ms p(95)=210.66ms
       { expected_response:true }...: avg=159.74ms min=96.56ms med=149.79ms max=528.73ms p(90)=191.99ms p(95)=210.66ms
     http_req_failed................: 0.00%  ✓ 0        ✗ 630 
     http_req_receiving.............: avg=90.68µs  min=31µs    med=80µs     max=751µs    p(90)=128.1µs  p(95)=168.74µs
     http_req_sending...............: avg=40.05µs  min=13µs    med=37µs     max=381µs    p(90)=52µs     p(95)=62.54µs 
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=159.61ms min=96.46ms med=149.7ms  max=528.6ms  p(90)=191.85ms p(95)=210.53ms
     http_reqs......................: 630    62.13345/s
     iteration_duration.............: avg=160ms    min=96.78ms med=150.01ms max=529.87ms p(90)=192.15ms p(95)=210.85ms
     iterations.....................: 630    62.13345/s
     vus............................: 10     min=10     max=10
     vus_max........................: 10     min=10     max=10

So 62 per second. At times reaching 80, so solid!

When we set it to true...


          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 10 max VUs, 40s max duration (incl. graceful stop):
              * default: 10 looping VUs for 10s (gracefulStop: 30s)


     data_received..................: 13 kB 928 B/s
     data_sent......................: 10 kB 718 B/s
     http_req_blocked...............: avg=327.7µs  min=3µs   med=9µs     max=810µs p(90)=794.1µs p(95)=805.2µs 
     http_req_connecting............: avg=194.95µs min=0s    med=0s      max=522µs p(90)=490.8µs p(95)=497.1µs 
     http_req_duration..............: avg=5.78s    min=4.29s med=5.74s   max=7.33s p(90)=7.1s    p(95)=7.13s   
       { expected_response:true }...: avg=5.78s    min=4.29s med=5.74s   max=7.33s p(90)=7.1s    p(95)=7.13s   
     http_req_failed................: 0.00% ✓ 0        ✗ 24  
     http_req_receiving.............: avg=147.45µs min=53µs  med=121.5µs max=381µs p(90)=231.4µs p(95)=294.49µs
     http_req_sending...............: avg=45.7µs   min=18µs  med=44µs    max=86µs  p(90)=65µs    p(95)=68.4µs  
     http_req_tls_handshaking.......: avg=0s       min=0s    med=0s      max=0s    p(90)=0s      p(95)=0s      
     http_req_waiting...............: avg=5.78s    min=4.29s med=5.74s   max=7.33s p(90)=7.1s    p(95)=7.13s   
     http_reqs......................: 24    1.714676/s
     iteration_duration.............: avg=5.78s    min=4.29s med=5.74s   max=7.33s p(90)=7.09s   p(95)=7.13s   
     iterations.....................: 24    1.714676/s
     vus............................: 10    min=10     max=10
     vus_max........................: 10    min=10     max=10

Expected behavior

The expecation is that the param only marginally influences the performance. Now, it's terrible.

Minimum steps to reproduce

  • Run shlink with a MS SQL database in the back
  • fill it with 1M unique long urls.
  • Run a performance test with the flag turned on.
@acelaya
Copy link
Member

acelaya commented May 22, 2024

Thanks for #2133. However, I think this needs a bit more thinking.

That PR is covering the case in which only the longUrl is provided, but findIfExists has more things into account (see this method).

Of course, it's better to improve performance for at least one use case, but I would like to think what would be needed to have a more general solution.

For example, I don't know if it's enough to create one index for all "indexable fields", or we would need indexes for all the possible combinations of 2 or more fields, which is not realistic.

Also, there are things that cannot be indexed so easily, like when you try to find a short URLs with a specific set of tags.

@acelaya
Copy link
Member

acelaya commented May 22, 2024

Also, it would be great if you could share your k6 load testing set-up, as I think it's something useful to have.

I tried to create one a couple of years ago to cover common cases, but it ended up "dying" 😅. It would be useful to detect performance regressions.

@marijnvandevoorde
Copy link
Contributor Author

Thanks for #2133. However, I think this needs a bit more thinking.

That PR is covering the case in which only the longUrl is provided, but findIfExists has more things into account (see this method).

Of course, it's better to improve performance for at least one use case, but I would like to think what would be needed to have a more general solution.

For example, I don't know if it's enough to create one index for all "indexable fields", or we would need indexes for all the possible combinations of 2 or more fields, which is not realistic.

Also, there are things that cannot be indexed so easily, like when you try to find a short URLs with a specific set of tags.

I saw the method and that's why I reverted the index to check for the long url first, and then the domain. It would be ideal for our usecase, although having one on just the long url alone would already be nice. At least it would also be used in all other scenarios in that method, but I understand if you want to limit it to just the long url.

For us, it could even be a unique, but since Shlink offers the flexibility to either recreate or reuse, this is not an option and it's not something we would ask you to change. Worst case, we just add the index for ourselves and get it over with :-)

For k6s, I'm just running it locally. The script is super simple in fact:

import http from 'k6/http';
import { randomString } from 'https://jslib.k6.io/k6-utils/1.4.0/index.js';

export const options = {
  // A number specifying the number of VUs to run concurrently.
    vus: 10,
  // A string specifying the total duration of the test run.
    duration: '10s',
};

export default function () {
    const randomURL = randomString(8);

    const payload = JSON.stringify({
        "longUrl": "https://www.google.com?" + randomURL,
        "validUntil": "2024-06-25T23:27:00+02:00",
        "crawlable": false,
        "forwardQuery": false,
        "findIfExists": true,
        "domain": "localhost:8000"
    });

    const url = 'http://localhost:8000/rest/v2/short-urls';

    const params = {
        headers: {
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'X-Api-Key': '[API KEY HERE]'
        },
    };

    http.post(url, payload, params);

}

That is the essence of it.

Haven't really looked into running it in the cloud for now though, we basically needed to know if it would manage the load we anticipate for the next 2 years or so.

@marijnvandevoorde
Copy link
Contributor Author

As for the indexes, not sure whether I'm telling you things you already know, but:
Database engines will use parts of indexes where they can. So an index on [original_url,domain_id] will come in handy even when a query does not contain the domain_id in the where clause. However, an index on [domain_id, original_url] will only be useful in queries that have the domain_id in the where clause.

This is why it's important to think about the order of your composite indexes. It's also why there is no use in having 2 indexes:

  • one on just the original_url
  • one composite index on [original_url, other_column]
    The reason is that any query that would benefit from the first index could also use the second index. It would make no difference.

So I'd say either have an index on just the original_url, or have a combined index with orginal_url + most common other columns to be used in this method?

@acelaya
Copy link
Member

acelaya commented May 22, 2024

For k6s, I'm just running it locally. The script is super simple in fact:

import http from 'k6/http';
import { randomString } from 'https://jslib.k6.io/k6-utils/1.4.0/index.js';

export const options = {
  // A number specifying the number of VUs to run concurrently.
    vus: 10,
  // A string specifying the total duration of the test run.
    duration: '10s',
};

export default function () {
    const randomURL = randomString(8);

    const payload = JSON.stringify({
        "longUrl": "https://www.google.com?" + randomURL,
        "validUntil": "2024-06-25T23:27:00+02:00",
        "crawlable": false,
        "forwardQuery": false,
        "findIfExists": true,
        "domain": "localhost:8000"
    });

    const url = 'http://localhost:8000/rest/v2/short-urls';

    const params = {
        headers: {
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'X-Api-Key': '[API KEY HERE]'
        },
    };

    http.post(url, payload, params);

}

That is the essence of it.

Haven't really looked into running it in the cloud for now though, we basically needed to know if it would manage the load we anticipate for the next 2 years or so.

This is what I was looking for. Thanks!

I created some similar scripts a few years ago. I would swear I pushed them to a repository under Shlink's org, but I couldn't find them 🤷🏼‍♂️

@acelaya
Copy link
Member

acelaya commented May 22, 2024

As per the indexes, I wasn't trying to say that we should cover the long URL only, or that the order of domain+url vs url+domain was wrong.

What I meant is that there are plenty of other conditions that will end up being used while attyempting to match an existing short URL when findIfExists is true, like the tags, custom slug, max visits, etc. etc. That's why I was referencing the method, see all the if statements there that would affect the query conditions if provided.

So, my point is that, while it would be possible to just create an index on long_url+domain, and that would probably improve performance in some cases, I would like to understand the requirements to cover all the other fields as well, as my understanding is that an index on long_url+domain would not make a difference if the WHERE clause includes other fields which are not part of any index.

And at the same time, I don't know if adding an index which includes a lot of the other fields will have other side effects in terms of disk space, inserts/updates performance, and ultimately, if it will still improve performance when conditions only include some of the fields which compose the index.

Those are things I definitely don't know 😅. I would need to test a few of those scenarios in order to determine the right course of action, for short and long term.

@acelaya acelaya added enhancement and removed bug labels May 22, 2024
@marijnvandevoorde
Copy link
Contributor Author

marijnvandevoorde commented May 22, 2024

So, my point is that, while it would be possible to just create an index on long_url+domain, and that would probably improve performance in some cases, I would like to understand the requirements to cover all the other fields as well, as my understanding is that an index on long_url+domain would not make a difference if the WHERE clause includes other fields which are not part of any index.

That's where the order of your index becomes important. As long as the original_url is the first thing in the index, it will use the index and benefit from it. This column is always part of the where clause. Realistically this will be enough. Worst case you still have a few records left at that point (let's say you have different domains and the same long url has been shortened on all of them), but that sure beats scanning a coulple of thousand records.

And at the same time, I don't know if adding an index which includes a lot of the other fields will have other side effects in terms of disk space, inserts/updates performance, and ultimately, if it will still improve performance when conditions only include some of the fields which compose the index.
Less familiar with those. it's true that having more indexes will slow down inserts and memory, disk space etc.

Those are things I definitely don't know 😅. I would need to test a few of those scenarios in order to determine the right course of action, for short and long term.

Sure thing. It's not a blocker for us anyway since we're fine with starting with the findIfExists = false. It's more finetuning for later :-)

@acelaya
Copy link
Member

acelaya commented May 22, 2024

That's where the order of your index becomes important. As long as the original_url is the first thing in the index, it will use the index and benefit from it. This column is always part of the where clause. Realistically this will be enough. Worst case you still have a few records left at that point (let's say you have different domains and the same long url has been shortened on all of them), but that sure beats scanning a coulple of thousand records.

Gotcha.

@marijnvandevoorde
Copy link
Contributor Author

Soooo, today I learned:
Microsoft SQL server does not support indexes on TEXT fields. So the migration will fail for those.
So in order to do this, we'd have to change the field type to something like a varchar. I'm not sure why you decided on having TEXT fields? It seems overkill for something like an url, at least for our usecase, but maybe I'm overlooking something?

@acelaya
Copy link
Member

acelaya commented May 23, 2024

I'm not sure why you decided on having TEXT fields?

This #1674

This was implemented after having to increment the length a few times, and still having someone who needed yet a bigger field.

@marijnvandevoorde
Copy link
Contributor Author

That's what I was afraid of, haha. Anyway, given it's at database level, we'll probably run some separate migrations to change it to a varchar and add an index to it. I understand that you need the flexibility for shlink. So no need to follow up on this one for us. However, be aware that the performance drops drastically for us when reaching a decent amount of urls and using findIfExists. The MR will work for MySQL with setting a key length. I think postgres has an operator that will make it work for text fields. But MS SQL offers no options it seems.

@acelaya
Copy link
Member

acelaya commented May 24, 2024

Adding an index for all engines but MS SQL would be okeyish, as it would benefit most of the people. I know your team is precisely using MS SQL, I'm thinking more from a broad product perspective.

That said, I tried to push back a few times on incrementing the long URL field type/size. It was a 2k varchar just before transitioning it to text, and I couldn't think on a reasonable use case to need anything bigger than that. However, I couldn't come with good arguments against it, except gut feeling.

@acelaya acelaya added this to the 4.2.0 milestone May 30, 2024
@acelaya acelaya moved this to Todo in Shlink Jul 3, 2024
@acelaya acelaya modified the milestones: 4.2.0, 4.3.0 Aug 11, 2024
@acelaya acelaya removed the status in Shlink Aug 11, 2024
@acelaya acelaya moved this to Todo in Shlink Oct 8, 2024
@acelaya acelaya modified the milestones: 4.3.0, 4.4.0 Nov 18, 2024
@acelaya acelaya removed this from Shlink Nov 18, 2024
@acelaya acelaya added this to Shlink Nov 30, 2024
@acelaya acelaya moved this to Todo in Shlink Nov 30, 2024
@acelaya
Copy link
Member

acelaya commented Dec 2, 2024

I've been investigating ways to improve performance here, and unfortunately there are so many combinations that it is not easy to do it consistently, which adds up to the other things already discussed here.

I'm afraid I will have to close this for now.

@acelaya acelaya closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Shlink Dec 2, 2024
@acelaya acelaya removed this from Shlink Dec 2, 2024
@acelaya acelaya removed this from the 4.4.0 milestone Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants