Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per user filtering #30

Closed
ech3 opened this issue Jun 26, 2014 · 18 comments
Closed

Per user filtering #30

ech3 opened this issue Jun 26, 2014 · 18 comments

Comments

@ech3
Copy link

ech3 commented Jun 26, 2014

At work I have to monitor files that differ on a per-user basis, so it would be nice to produce graphs that are limited to a single user and / or use the data to produce a graph that showed the percentage of the disk space used by a specific user. This way I could send users graphs of the disk usage limited to their user.

The basic use case is I want to have a cron job that sends out an e-mail to each user telling them how much disk they are using as a percentage in a pie chart with all users listed, and below that another graph that shows them where their specific user's disk usage is concentrated in, which is a normal duc graph filtered for their user. The user could tell from the first graph whether they needed to take action, then from the second graph they could figure out where big files were if they wanted to delete them.

@l8gravely
Copy link
Collaborator

ech3> At work I have to monitor files that differ on a per-user basis,
ech3> so it would be nice to produce graphs that are limited to a
ech3> single user and / or use the data to produce a graph that showed
ech3> the percentage of the disk space used by a specific user. This
ech3> way I could send users graphs of the disk usage limited to their
ech3> user.

I have a similar issue with my users as well. This is an obvious
enhancement request, but the size of the DB will get much bigger
when using this.

ech3> The basic use case is I want to have a cron job that sends out
ech3> an e-mail to each user telling them how much disk they are using
ech3> as a percentage in a pie chart with all users listed, and below
ech3> that another graph that shows them where their specific user's
ech3> disk usage is concentrated in, which is a normal duc graph
ech3> filtered for their user. The user could tell from the first
ech3> graph whether they needed to take action, then from the second
ech3> graph they could figure out where big files were if they wanted
ech3> to delete them.

For my users, I use Netapps, so I turn on unlimited quotas, which gets
me per-user/per-volume disk usage numbers. Then I use philesight
(haven't quite moved to duc yet in production) to generate the graphs
they can drill down into.

I also use yet another script which looks for the top N largest files
in a directory tree, and emails the users automatically. In my
experience, this is one of the most effective ways to get back disk
space, by showing all the users who is using the largest files.
Public shaming!

Unfortunately, keeping all this kind of data is expensive. Both in
terms of filesystem overhead, file system scanning overhead, etc.

I think for now, this suggestion will have to go into the nice to have
camp for me, at least until I get more of the CGI stuff working well
for multiple DBs at the same time. Can't speak for Zevv though...

John

@ech3
Copy link
Author

ech3 commented Jun 26, 2014

We use use Netapps as well and have an internal script which produces a per user statistics, probably similar to what you are using. I saw philesite, but when I actually went to implement a test machine, I found duc and have a bias towards speed since the IT folks around here don't like us scanning the disk. When I started at my present job, I was assigned as being the maintainer of our script which goes and looks for big files, so we have a script that does something similar in that regard, it's just that presenting this visually makes it easier for the user to take action. It would be nice if I could pull from the scanned database the bigfiles as well, but that would probably be another request, and it seems you would want that as well.

I am fine with this getting put on the way back burner if that's where it needs to go, but I get none of things I don't ask for, and thought it would be a good feature. I am not concerned with the DB size personally, but I may once I found out just how much bigger it was ;). Thanks for looking this issue over, and thanks for this program.

@l8gravely
Copy link
Collaborator

ech3> We use use Netapps as well and have an internal script which
ech3> produces a per user statistics, probably similar to what you are
ech3> using. I saw philesite, but when I actually went to implement a
ech3> test machine, I found duc and have a bias towards speed since
ech3> the IT folks around here don't like us scanning the disk. When I
ech3> started at my present job, I was assigned as being the
ech3> maintainer of our script which goes and looks for big files, so
ech3> we have a script that does something similar in that regard,
ech3> it's just that presenting this visually makes it easier for the
ech3> user to take action. It would be nice if I could pull from the
ech3> scanned database the bigfiles as well, but that would probably
ech3> be another request, and it seems you would want that as well.

Well, duc will be faster than philesight, and the DBs are much
smaller. But the impact of scanning the entire disk is still there no
matter what, hard to get around that.

I'm happy to share my perl script which pulls out quota reports and
also does a search for large files. And maybe that's all duc needs,
is a way to ask for the largest N files in each directory to be stored
in the DB.

Now how to present that visually will be hard to do clearly and
conscisely.

One of the simpler options we've been thinking about is adding a
per-directory file count to the DB, since we know that number, and it
shouldn't increase DB size too badly.

ech3> I am fine with this getting put on the way back burner if that's
ech3> where it needs to go, but I get none of things I don't ask for,
ech3> and thought it would be a good feature. I am not concerned with
ech3> the DB size personally, but I may once I found out just how much
ech3> bigger it was ;). Thanks for looking this issue over, and thanks
ech3> for this program.

Let's see, my largest philesight DB is 24G in size. I have a 3.6G
philesight DB which went down to 320Mb when done in duc, so there's a
huge potential for space savings. Esp since I have remote sites where
I run the queries, and then copy the DBs to the presentation site.

Speedwise, I find that duc is faster by about 50% or so, but I don't
have any hard and fast numbers yet. Personally I'm still working on
the CGI code to allow multiple DBs to be viewed easily.

John

@ech3
Copy link
Author

ech3 commented Jun 26, 2014

I'm not in your league yet in terms of DB size, the largest duc database I have laying around is 72MB at present. We mostly have a bunch of source code and a few really big files on our drives.

If that perl script extracted the data from the tokyodb, it would be interesting, but otherwise I will go with our legacy system due to institutional momentum. All I am looking to do is to consolidate my drive scanning to one scan and get all the data I need so I don't have to do multiple passes over the drive. If it can't be done at this point, it can't be done at this point. If you guys don't want to gold plate and you considered any of this gold plating, I wouldn't blame you. Right now my plan is to only run duc on drives that are getting pretty full so that we get pretty graphs when bad stuff is going on.

I would agree with you that getting the CGI working is a higher priority, I have not messed with that, but I could see how it would be much more valuable than what I am requesting. I know that just e-mailing a link to the CGI would probably be a lot less headache than getting the mime working for attaching the duc png since I have done that before. Plus interactivity is really nice if you guys have that and blows a static picture out of the water.

@zevv
Copy link
Owner

zevv commented Apr 20, 2015

Hi ech3, is this request still relevant for you? I'm considering adding an option to add user information, although the implementation will probably get quite hairy.

@ech3
Copy link
Author

ech3 commented Apr 20, 2015

I would still like to have this. For my case, I would like to have it even at the cost of having a super-huge database, a very long run time, and/or having to use non-standard switches. My situation is I have a NetApp filer which tells me what users are the major user of the files on a particular partition, but it gives me no idea of where those files are located. 9 times out of ten it's easy to figure out where the files are thanks to a normal duc graph, but every so often it's a royal pain.

My use case is to send an individually tailored e-mail with a graph limited to only their user files so they know where their overages are located. I am fine with individually making each graph manually by specifying the user via command line.

If you need me to provide any additional info let me know.

@zevv
Copy link
Owner

zevv commented Apr 20, 2015

  • On 2015-04-20 21:55:24 +0200, ech3 wrote:

My use case is to send an individually tailored e-mail with a graph
limited to only their user files so they know where their overages are
located. I am fine with individually making each graph manually by
specifying the user via command line.

Well, probably far from your ideal solution, but would it be acceptable
for you to index the database individually for each user? This change
would be trivial compared to storing the user info for each directory
entry, and would also remove all of the hassle involved with handling
user names and uids. The only change needed would be to have the index
process ignore all files not owned by a given user, no other changes
would be necessary.

:wq
^X^Cy^K^X^C^C^C^C

@ech3
Copy link
Author

ech3 commented Apr 21, 2015

Well my understanding of the UNIX file structure was that the real problem with this feature was that you had to do an additional request for inode information for each file to get file ownership info. Sorry, I would have to pull my APUE to remember the exact way things are organized and I don't have it handy. The problem is that if I have to re-index for each user, I would probably have to beat on the filers with more traffic is something that my IT department will not be particularly overjoyed about. I will admit that most of the time I am doing this I care mostly about one user who is particularly offensive in terms of disk usage.

Since I don't grok the code the way you do, I don't know how is easy or difficult it is to add this, and don't have the ability to invest the time you do, I won't criticize you if you decide to implement this whatever way you need to.

@zevv
Copy link
Owner

zevv commented Apr 21, 2015

  • On 2015-04-21 02:31:47 +0200, ech3 wrote:

[...] I would probably have to beat on the filers with more traffic is
something that my IT department will not be particularly overjoyed
about.

Understandable, having to run multiple indexes kind of defy the whole
purpose of making an index at all. I was just trying to get way with the
solution ith the smallest impact, but I agree it makes the feature
almost unusable.

I do like the feature myself, and I think it would make a valuable
addition to Duc.

Since I don't grok the code the way you do, I don't know how is easy
or difficult it is to add this, and don't have the ability to invest
the time you do,

The problem is not so much the gathering or storing of the user data per
se, since the uid and gid are already made available at index time by
the stat() call which is done to retrieve the file size. My concern is
more about the complexity that comes with proper uid/gid mapping to/from
their names.

Having only uids/gids is not practical, since these will likely not map
between systems which don't have their accounts synced.

I think the following would make for the best solution:

  • A new option is added to 'duc index' which requests recording of user
    and/or group metadata
  • The indexed directories in the database get an extra field which
    indicates what kind of metadata is stored in the directory. This flag
    will make sure the code that reads the database knows what to expect.
    This flag will add one single byte per directory for indexes without this
    info, which I think is acceptable.
  • All uids and gids that are found during indexing are also stored in
    the database with their corresponding user- and group names. This
    will need to be a two-way index, mapping uid to usernames and
    usernames to uids.
  • At display time the code detects if any user and/or group info is
    stored in the directory, which can then be displayed in the user
    interface.
  • The various query/display tools get an extra option for filtering
    on user and group names.

The extra complexity is mostly in the handling of the names.

Any thoughts?

I'll give it a go and see if I can come up with a nice implementation.

Thanks for the feedback!

:wq
^X^Cy^K^X^C^C^C^C

@zevv
Copy link
Owner

zevv commented Apr 21, 2015

I've been experimenting a bit with this feature, the results so far: I
have adjusted the db format to allow for optional extra metadata and
added options to 'duc index' to enable recording of uid and/or gid. The
resulting database is only 1 or 2% bigger, but this might vary on your
system.

That was the easy part.

Now I've run into an interesting problem with generating the graph
though: for each directory encountered, duc stores the total size of
that directory. Now consider a directory which contains files and
subdirectories from user A and user B. When drawing the graph section
for the directory, duc does not know yet how much of the dir is used by
user A or user B, so it does not know how large to draw this section if
it needs to filter on a specific user.

If it wants to know this size, it has to calculate the graph from the
outside in and first reverse the entire tree structure to re-calculate
each directory size to contain only the data of user A or B. That's
kind of nasty, and I'm not sure if I know how to solve this at this
moment.

A pragmatic solution would be to simply leave the graph shape intact,
and simply leave out the files which are not owned by the user you are
filtering for. In this case we need to draw all the directories, but
only files owned by the given user are drawn.

Example output: http://i.imgur.com/rlbBsXZ.png

That's a directory structure with files owned by 3 different users. It's
quite hard to see where exactly the data goes, since the directories are
always drawn.

Not sure if this is usable at all in this state though?

Another way of handling this would be to change the indexing and
database in a way that it stores the size totals for each directory
multiple times for each individual user. This makes the graph drawing
trivial, but the change to indexer and database would be considerable.
I'm not sure how to implement this, this will probably take some time to
figure out.

:wq
^X^Cy^K^X^C^C^C^C

@ech3
Copy link
Author

ech3 commented Apr 22, 2015

Zevv:

Having only uids/gids is not practical, since these will likely not map
between systems which don't have their accounts synced.

Mostly I am on NIS managed systems, so I hadn't thought that far ahead.

Zevv:

A pragmatic solution would be to simply leave the graph shape intact,
and simply leave out the files which are not owned by the user you are
filtering for. In this case we need to draw all the directories, but
only files owned by the given user are drawn.

How hard would it be to create a "shadow" database that inserted the files one by one into the database, doing "appropriate" calculations for the directory sizes, then calculated the graph based on the "shadow", finally deleting it.

@ech3
Copy link
Author

ech3 commented Apr 22, 2015

Side note, Zevv, your sig reminds me of this page: https://www.gnu.org/fun/jokes/ed-msg.html
The interaction with ed on that page looks eerily similar to my initial interaction with it.

@zevv
Copy link
Owner

zevv commented Apr 22, 2015

  • On 2015-04-22 12:13:40 +0200, ech3 wrote:

A pragmatic solution would be to simply leave the graph shape
intact, and simply leave out the files which are not owned by the
user you are filtering for. In this case we need to draw all the
directories, but only files owned by the given user are drawn.

How hard would it be to create a "shadow" database that inserted the
files one by one into the database, doing "appropriate" calculations
for the directory sizes, then calculated the graph based on the
"shadow", finally deleting it.

That would be the kind of workflow indeed; the whole tree needs to be
traversed to create a per-user subtree, and all totals need to be
recalculated.

I'm planning to create import functionalty for reading out from tools
like find; this would make it possible to create databases without
having to run Duc on the machines. With an appropriate matching export
function this could be used to create the functionality you need:

  • create a multi user duc database
  • export the database, filtering per user
  • import each exported dump into a new database
  • create a per-user index and graph for each database

This way we keep duc simple, and with a small wrapper shell script
complex funcitonality can be created.

:wq
^X^Cy^K^X^C^C^C^C

@zevv
Copy link
Owner

zevv commented Apr 22, 2015

  • On 2015-04-22 12:21:32 +0200, ech3 wrote:

Side note, Zevv, your sig reminds me of this page:
https://www.gnu.org/fun/jokes/ed-msg.html The interaction with ed on
that page looks eerily similar to my initial interaction with it.

You saw that just right, and I did not make the signature up. One day I
was sending mail on a system I was not familiar with. The editor was not
vi, not emacs, not joe. What was it then? It was indeed Ed!

:wq
^X^Cy^K^X^C^C^C^C

@foutoucour
Copy link

Hello Zevv,

Thank for your work on DUC. It is actually really helpful for us.

We would be also interested in having this per user filter.
We would like to be able to filter quickly file by user and to act on them accordingly.

Is there any ETA on the feature?

Regards,
Jordi

@zevv
Copy link
Owner

zevv commented May 30, 2015

  • On 2015-05-29 18:57:04 +0200, Jordi Riera wrote:

We would be also interested in having this per user filter. We would
like to be able to filter quickly file by user and to act on them
accordingly.

Is there any ETA on the feature?

Not yet; I'm still not quite sure how to implement this. It seems that
the most sane way to do this is by post-processing an indexed database
to split it up in a separate database per-user; this is because all the
totals need to be recalculated only to include the files owned by a
given user.

This could be combined with an export/import function with filter
options, so that the workflow would be something like

index db1 ---> export db1 (user=john) --> import db2 --> graph db2
+-> export db1 (user=mary) --> import db3 --> graph db3
`-> export db1 (user=mike) --> import db4 --> graph db4

It's quite cumbersome though because an export and import needs to be
done for every single user.

:wq
^X^Cy^K^X^C^C^C^C

@zevv
Copy link
Owner

zevv commented Nov 7, 2016

Closing this ticket for now. The request has been added to TODO.md

@zevv zevv closed this as completed Nov 7, 2016
@groutr
Copy link

groutr commented Jan 6, 2017

Adding user information to the index is a big +10 from me. Our use case is similar in that we want to know which users to notify that are using the most the disk space.

I agree that per-user graphs can get complicated. Maybe a more immediate solution would be to color the files in the graph by the user that owns them. It could be a toggle between the default graph color scheme and the user color scheme. This would be for admin use to get an idea of which users own what. This still needs to mapping between uid and user names though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants