-
-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance when using findIfExists #2131
Comments
Thanks for #2133. However, I think this needs a bit more thinking. That PR is covering the case in which only the longUrl is provided, but Of course, it's better to improve performance for at least one use case, but I would like to think what would be needed to have a more general solution. For example, I don't know if it's enough to create one index for all "indexable fields", or we would need indexes for all the possible combinations of 2 or more fields, which is not realistic. Also, there are things that cannot be indexed so easily, like when you try to find a short URLs with a specific set of tags. |
Also, it would be great if you could share your k6 load testing set-up, as I think it's something useful to have. I tried to create one a couple of years ago to cover common cases, but it ended up "dying" 😅. It would be useful to detect performance regressions. |
I saw the method and that's why I reverted the index to check for the long url first, and then the domain. It would be ideal for our usecase, although having one on just the long url alone would already be nice. At least it would also be used in all other scenarios in that method, but I understand if you want to limit it to just the long url. For us, it could even be a unique, but since Shlink offers the flexibility to either recreate or reuse, this is not an option and it's not something we would ask you to change. Worst case, we just add the index for ourselves and get it over with :-) For k6s, I'm just running it locally. The script is super simple in fact:
That is the essence of it. Haven't really looked into running it in the cloud for now though, we basically needed to know if it would manage the load we anticipate for the next 2 years or so. |
As for the indexes, not sure whether I'm telling you things you already know, but: This is why it's important to think about the order of your composite indexes. It's also why there is no use in having 2 indexes:
So I'd say either have an index on just the original_url, or have a combined index with orginal_url + most common other columns to be used in this method? |
This is what I was looking for. Thanks! I created some similar scripts a few years ago. I would swear I pushed them to a repository under Shlink's org, but I couldn't find them 🤷🏼♂️ |
As per the indexes, I wasn't trying to say that we should cover the long URL only, or that the order of domain+url vs url+domain was wrong. What I meant is that there are plenty of other conditions that will end up being used while attyempting to match an existing short URL when So, my point is that, while it would be possible to just create an index on long_url+domain, and that would probably improve performance in some cases, I would like to understand the requirements to cover all the other fields as well, as my understanding is that an index on long_url+domain would not make a difference if the WHERE clause includes other fields which are not part of any index. And at the same time, I don't know if adding an index which includes a lot of the other fields will have other side effects in terms of disk space, inserts/updates performance, and ultimately, if it will still improve performance when conditions only include some of the fields which compose the index. Those are things I definitely don't know 😅. I would need to test a few of those scenarios in order to determine the right course of action, for short and long term. |
That's where the order of your index becomes important. As long as the original_url is the first thing in the index, it will use the index and benefit from it. This column is always part of the where clause. Realistically this will be enough. Worst case you still have a few records left at that point (let's say you have different domains and the same long url has been shortened on all of them), but that sure beats scanning a coulple of thousand records.
Sure thing. It's not a blocker for us anyway since we're fine with starting with the findIfExists = false. It's more finetuning for later :-) |
Gotcha. |
Soooo, today I learned: |
This #1674 This was implemented after having to increment the length a few times, and still having someone who needed yet a bigger field. |
That's what I was afraid of, haha. Anyway, given it's at database level, we'll probably run some separate migrations to change it to a varchar and add an index to it. I understand that you need the flexibility for shlink. So no need to follow up on this one for us. However, be aware that the performance drops drastically for us when reaching a decent amount of urls and using findIfExists. The MR will work for MySQL with setting a key length. I think postgres has an operator that will make it work for text fields. But MS SQL offers no options it seems. |
Adding an index for all engines but MS SQL would be okeyish, as it would benefit most of the people. I know your team is precisely using MS SQL, I'm thinking more from a broad product perspective. That said, I tried to push back a few times on incrementing the long URL field type/size. It was a 2k varchar just before transitioning it to text, and I couldn't think on a reasonable use case to need anything bigger than that. However, I couldn't come with good arguments against it, except gut feeling. |
I've been investigating ways to improve performance here, and unfortunately there are so many combinations that it is not easy to do it consistently, which adds up to the other things already discussed here. I'm afraid I will have to close this for now. |
Shlink version
4.1.0
PHP version
8.2
How do you serve Shlink
Self-hosted RoadRunner
Database engine
MicrosoftSQL
Database version
2022
Current behavior
``We did performance tests without and with this param turned to true.
So 62 per second. At times reaching 80, so solid!
When we set it to true...
Expected behavior
The expecation is that the param only marginally influences the performance. Now, it's terrible.
Minimum steps to reproduce
The text was updated successfully, but these errors were encountered: