-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to maintain quotas and file logging #9
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not entirely sure what the aim is. Given the lack of code for user-specific things, I assume that this is about a global quota for now.
This is not ready to me merged yet, and it has quite a way to go before it can:
- The global quota should not be managed by the XMPP server, but via configuration.
- The global quota is a delicate issue in and by itself: it allows a single user with lots of bandwidth to cause lots of churn on the entire thing and make files of other users be deleted.
- The quota implementation as it is is not at all multitasking safe (while the upload implementation is). This is a hard requirement, because the duration of an upload is determined by the speed of the client, and thus xhu must be able to run in multiple threads, coroutine tasks (possibly in the future) and/or processes concurrently. Simply decorating the thing with a few locks won’t do (due to multiprocessing), this needs to be solved on the file-system level.
- The quota implementation is O(n) in the number of uploaded files.
In general, I think that quota data needs to be stored to reduce the complexity of the quota check (from O(n) to O(1)). For this, a file which stores the current quota use per scope (scopes are global
and one for each user) and which is read and updated using a file-level lock (flock
; reading can use a shared lock, updating needs an exclusive lock). The update must be rolled back properly when the upload aborts etc.
(flock
s don’t work properly with threads though.)
total_size = 0 | ||
# We assume a file structure whereby files are are stored inside | ||
# uuid() directories inside the root dir and that there aren't any files in | ||
# the root dir itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not make assumptions about the directory structure. While there is currently only an implementation of this in Prosody, I don’t see why we should rely on only prosody being used on the other side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t see why we should rely on only prosody being used on the other side
I'm using Prosody and I'm trying to get something working and running sooner rather than later.
I understand that you'd rather prefer less coupled and more generic solutions.
@@ -135,6 +179,11 @@ def put_file(path): | |||
) | |||
|
|||
dest_path.parent.mkdir(parents=True, exist_ok=True, mode=0o770) | |||
|
|||
quota = flask.request.args.get("q", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows a client to provide an arbitrary quota, or leave out the q
entirely, or set it to 0 to delete all files uploaded so far. The quota needs to be part of the HMAC instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, very good point.
set it to 0 to delete all files uploaded so far
A quota of 0 means that no files get deleted. But that's also a potential avenue of attack.
|
||
quota = flask.request.args.get("q", "") | ||
if (quota): | ||
apply_quota(dest_path.parent.parent, int(quota)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t make assumptions about the directory structure.
This should also take into account the size of the uploaded file. The amount of data used must still be below the quota even after the upload.
It also needs to handle the case where the file is larger than the quota allows for: in that case, the upload should be refused without any files being deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also take into account the size of the uploaded file. The amount of data used must still be below the quota even after the upload.
It also needs to handle the case where the file is larger than the quota allows for: in that case, the upload should be refused without any files being deleted.
Depends on whether you're enforcing a hard quota or a soft quota. I'm OK with usage slightly going over the quota. Prosody already enforces file size via http_upload_external_file_size_limit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t make assumptions about the directory structure.
I don't yet see how I can do it otherwise. Putting storage usage in a file like you suggested provides some decoupling, but you still need to know what files to remove.
total_size += size | ||
modified = os.stat(fp)[stat.ST_MTIME] | ||
file_list.append((modified, path, name, size)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my reading, this sums the files of all users.
Also this is O(n) in the number of stored files, which I don’t like at all. This must be condensed into an O(1) operation.
See the main review comment for details.
filemode='a', | ||
level=logging.INFO | ||
) | ||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to log to standard output and have the service manager deal with it. Alternatively, support systemd journal (optionally) directly. Configuring and managing log directories is a PITA, and always opens up the question of logrotate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to log to standard output and have the service manager deal with it.
What service manager are you talking about here? I use Circus to run the processes, but I don't see my logs ending up in its log, so I'm not sure what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stuff like systemd. Even sending to syslog is better than having to work with logfiles.
Also, unrelated to the review, please ensure that you talk to @mwild1, because I think he has some plans for a v3 of the upload protocol. |
Hi @horazont Thank you for taking the time to look into the PR and for providing feedback.
That depends on the directory structure within which the files are stored. It can be global or per user. I want to use it as a per-user quota. I've made changes to
I first had it as local configuration, but I want eventually to have more finegrained quotas. For example I'd like to allow friends and family or people who pay a subscription to have larger quotas than complete strangers. I think this information will more likely need to be managed and stored by the XMPP server rather than the file upload service.
Yes, so it comes with trade-offs and people need to decide whether they want to use a global quota or not. My intention is to implement a per-user quota.
This is interesting and not something I considered. I'm not sure however what you mean by "this needs to be solved on the file-system level". This makes me think of the ZODB Python object database. It's transactional (atomic) and allows rollbacks. You can then use ZEO to allow multiple processes to access to a single database and then put a load-balancer in front of the different processes. Something like that would pretty much solve the concurrency and thread-safety issue, but it's a big departure from the current design and I wouldn't be surprised if you're not interested in going in such a direction.
Yes, or rather the number of files per user for the case I have in mind.
Yes, you're right in pointing out that looping over all files to calculate storage usage is not something that'll scale well. I like the idea of using a file to store the quota. |
Alright, now I get the big picture here. To summarise, so that we’re on the same page: You modified (or plan to modify) mod_http_upload_external to send the quota (as query argument) and to prefix the URL with an HMAC of the JID and a secret key. Thus, the directory structure ends up being something like this:
This allows you to only look at the files in Still, there are unsolved issues. Continuing the discussion.
The file-system is in the end the shared resource the tasks (be it threads, processes or coroutines) operate on. Thus, the synchronisation is best handled by the very same thing. For this, file-system level locks such as some lock based on
Yes, it’s a huge deviation from the current design, which is extremely slim. This will be a nasty trade-off: either we use OS specific tools (such as I am really not sure what would be the right way here. On the one hand, I’d like to avoid sqlite or anything like that, because I perceive it as overkill, even though it probably isn’t. On the other hand, stuff like Actually I think that this would be an excellent use-case of LMDB (which is an embeddable ordered key-value store database with support for transactions), but never having used LMDB in a project yet, I’m not confident in that either. However, since we might want to use LMDB in JabberCat, this might be a good playground to explore it. It would also allow to reduce the complexity of various operations by storing basic file metadata in the LMDB too.
The file to store the quota makes everything much trickier though. First the aforementioned multitasking issues. Second, where would you put the file? It needs to be on the level of the user. In your current design, you cannot really be sure whether you’re working with multiple users in prosody or not. And you cannot put the file in the user directory, because it is possible for the XMPP server to create a URL which matches that file name, and then all bets are off – unless characters which aren’t URL safe are used in the filename… such as \t or a space. Takes a bit of consideration though. And it’s still a hack. |
The README diff contains more info on the changes made.
I'm planning to update Prosody's mod_http_upload_external to prepend the hashed/salted JID in the PUT URLs, so that admins can identify the files uploaded for a particular JID.
This can help with GDPR compliance (e.g. removal of particular user's data upon request) and also makes per-user quotas possible.
If the hashed/salted JID is not in the URL, then the quota is enforced globally across all users.
Since files get removed to enforce the quotas, I figured it would be good to log that to a file.