Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[document_repository] Fixed multiple files with same name when uploading to document repository #3700

Conversation

maltheism
Copy link
Member

This pull request allow files with duplicate name to exist and when user uploads a file to the document repository.

I made the mysql column "uuid" in the table "document_repository" for files to be stored in unique directories.

The uuid and file_name are used for the system directory structure when storing the files.
System Directory Structure Examples:

Note: {uuid} is unique per upload.
user_uploads/admin/text.png/{uuid}/text.png
user_uploads/admin/text.txt/{uuid}/text.txt

user_uploads/admin/text.png/{uuid}/text.png
user_uploads/admin/text.txt/{uuid}/text.txt

See also: Bug #10847 and Bug #11243

@maltheism maltheism added Category: Bug PR or issue that aims to report or fix a bug Language: SQL PR or issue that update SQL code [branch] minor labels Jun 5, 2018
@maltheism maltheism changed the title Fixed multiple files with same name for document repository [document_repository] Fixed multiple files with same name when uploading to document repository Jun 5, 2018
um4r12
um4r12 previously requested changes Jun 11, 2018
// Setup Database object.
$config =& \NDB_Config::singleton();
$db_config = $config->getSetting('database');
$db = \Database::singleton(
Copy link
Contributor

@um4r12 um4r12 Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a initialization to the db class with parameters? Can't it be done just with
$db =& \Database::singleton();?
Tested it out on my instance, and I was getting an error due to that line, and when replaced with the above, the added feature worked fine.

$base_path = __DIR__ . "/../user_uploads/";
$fileBase = $puser . "/" . $fileName;
$base_path = __DIR__ . '/../user_uploads/';
$fileBase = $puser . '/docs_repo/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why you are nesting into an additional subdirectory?
Would it not a simpler to leave it as /../user_uploads/$puser/$fileName/$uuid/$fileName, instead of
/../user_uploads/$puser/docs_repo/$fileName/$uuid/$fileName. From my understanding, the user_uploads folder is already intended for the docs_repo, so I don't think there is a need to nest it further. Would like to hear your thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user_uploads is used elsewhere besides the docs_repo. So the extra nest is beneficial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @um4r12
the user_uploads folder is used in other modules but the folder is a sub-directory of document_repository

Copy link
Contributor

@johnsaigle johnsaigle Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@intralizee There are two user_upload folders: one a subdir of doc_repo and the other a subdir of data_release (which was based on doc_repo)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, every other module (that stores data) stores data under /data/.../...

Copy link
Member Author

@maltheism maltheism Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnsaigle @ZainVirani @um4r12 I'll get back to guys with a more concise response but one of the modules I tested before working on document_repository had uploaded a file or folder to the user_upload directory and resulted in a conflict when uploading a file from document_repository with the same name of the folder or file previously created. I may be wrong as it has been awhile.

Copy link
Member Author

@maltheism maltheism Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I must have been mistaken previously. Next commit removes the docs_repo as unnecessary. @johnsaigle @ZainVirani @um4r12

@maltheism maltheism dismissed um4r12’s stale review June 11, 2018 18:47

update Database initialization

@um4r12 um4r12 added the Passed manual tests PR has been successfully tested by at least one peer label Jun 11, 2018
@kongtiaowang
Copy link
Contributor

@intralizee If you update 0000-00-schema.sql , you need to update the testing data also.
Adding a "uuid" column for the document_repository table.

@kongtiaowang
Copy link
Contributor

kongtiaowang commented Jun 12, 2018

@intralizee Please updating your new data here "test/RBdata.sql" with the new document_repository table line 1676.

$target_path = $base_path . $fileBase;
// Create filename directory /base_path/user/fileName
if (!file_exists($base_path . $puser . '/' . $fileName)) {
mkdir($base_path . $puser . '/' . $fileName, 0777);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better with a 0770 permission. Same goes for the folder below.

The discussion seems to have disappeared but @ridz1208 and I discussed upload permissions for the document repository in #3603

*/
function uuid4()
{
if (function_exists('com_create_guid') === true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support Windows so I think this is a potentially confusing comment.

*
* @return string $version
*/
function uuid4()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to include return type declarations for new code.

Could you add String as a return type here?

@colezweber
Copy link
Contributor

Hit a snag with “There was an error uploading the file” but it's probably an issue with my permissions RE: move_uploaded_file method as discussed. Will update.

@colezweber colezweber self-requested a review July 9, 2018 14:57
colezweber
colezweber previously approved these changes Jul 9, 2018
Copy link
Contributor

@colezweber colezweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After updating permissions on user_uploads/ I was able to test this PR.
Was able to upload duplicate files, and confirmed unique UUID's for each upload in the table.

@@ -864,6 +864,7 @@ CREATE TABLE `document_repository` (
`EARLI` tinyint(1) DEFAULT '0',
`hide_video` tinyint(1) DEFAULT '0',
`File_category` int(3) unsigned DEFAULT NULL,
`uuid` varchar(36) DEFAULT '',
Copy link
Contributor

@zaliqarosli zaliqarosli Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PapillonMcGill recently set up a standard for all new additions to our SQL database (https://github.com/aces/Loris/blob/19.0-dev/docs/SQLModelingStandard.md). I think uuid should maybe be UUID? melanie, what do you think?

Copy link
Member Author

@maltheism maltheism Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaliqarosli @PapillonMcGill I think the capitalized UUID is great. uuid is abbreviation for universally unique identifier. We could also decide on Universally_Unique_Identifier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaliqarosli @intralizee
The standards say that The acronym should only contain upper case letter. So capitalized is the way to go. UUID might not be fully known by neuroimaging community (yet) but software developer within knows it well.

Copy link
Contributor

@zaliqarosli zaliqarosli Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@intralizee i had to look up what UUID meant 😝but i think it works here

@maltheism maltheism dismissed zaliqarosli’s stale review July 9, 2018 15:33

updated sql column uuid to UUID for updated standard specs

@zaliqarosli zaliqarosli removed the Passed manual tests PR has been successfully tested by at least one peer label Jul 9, 2018
@@ -0,0 +1 @@
ALTER TABLE document_repository ADD `UUID` varchar(36) DEFAULT '';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this patch should go into just the Archive directory? we're now on the 20 release. also, patches usually get moved to the version folders after a release if on minor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at this, when would you submit the patches into the New_patches dir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@um4r12 when you are on a branch that has these new directories. i'm going to bring this up in tomorrow's meeting :)

@zaliqarosli zaliqarosli added the Passed manual tests PR has been successfully tested by at least one peer label Jul 9, 2018
$fileBase = $puser . '/'
. $fileName
. '/' . $uuid
. '/' . $fileName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$filename is used twice. Is it supposed to be like that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the the first $fileName is referring to the directory, whereas the second one is referring to the actual file.
ex (from above). user_uploads/admin/text.png/{uuid}/text.png

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PapillonMcGill um4r12 is correct and so it's supposed to be like that.

$base_path = realpath(__DIR__ . '/..') . '/user_uploads/';
$fileBase = $puser . '/'
. $fileName
. '/' . $uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put all the front-slashes at the front of the line or the back of the line for consistency.

// Cleanup empty directories
set_error_handler(
function () {
// Silence the E_WARNING when files exist in the directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the best way to do this. Shouldn't we delete the files first in this case? I'm a little worried about this silencing useful warnings.

}
);
$rm_directory = __DIR__ . '/../user_uploads/'
. substr($dataDir, 0, strlen($dataDir)-(strlen($fileName)+1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to chop off the filename from $dataDir here? If so, just use dirname();


$fileSize = $_FILES["file"]["size"];
$fileName = $_FILES["file"]["name"];
$fileType = end((explode(".", $fileName)));
$fileType = '';
// Handle retrieving the file type.
Copy link
Contributor

@johnsaigle johnsaigle Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better done using pathinfo()

It's also more accurate to say that you are grabbing the uploaded file's path extension rather than its actual type. Properly getting the file type would involve looking at the file signatures.


// __DIR__ is the document_repository ajax directory
// when this script is executing. Go up a level to the
// document_repository module directory, and use a
// user_uploads directory as a base for user uploads
$base_path = __DIR__ . "/../user_uploads/";
$fileBase = $puser . "/" . $fileName;
$base_path = realpath(__DIR__ . '/..') . '/user_uploads/';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind changing this to $upload_path? The "base path" in LORIS typically refers to the Config setting corresponding to the web root.

@johnsaigle
Copy link
Contributor

Note: this code should not go live before bugfix is merged into minor during the next release cycle. The reason is that it does not contain the security fix issued in #3678.

$puser is a user-controllable value and its expanded usage in this PR allows for arbitrary file upload attacks.

I just wanted to leave a note here to the effect that these branches need to be synchronized.

@johnsaigle
Copy link
Contributor

johnsaigle commented Jul 9, 2018

This PR is tagged as a Bug Fix yet issued to minor, but in my opinion constitutes a new feature. @intralizee can you weigh in here as to where you think this PR should go?

@zaliqarosli
Copy link
Contributor

zaliqarosli commented Jul 9, 2018

Note: this code should not go live before bugfix is merged into minor during the next release cycle. The reason is that it does not contain the security fix issued in #3678.

should we maybe add a 'Blocked' label to this PR until the needed merging happens? @intralizee

@driusan
Copy link
Collaborator

driusan commented Jul 9, 2018

Sorry to be jumping into this so late but.. why are we even using UUIDs for this? A few bytes of random data encoded in some way (hex, base64, whatever) should be enough to fix the issue that this is trying to solve.

@driusan
Copy link
Collaborator

driusan commented Jul 9, 2018

(Another--probably more robust solution-- would be to use a content addressable naming convention on the backend and include a Content-Disposition header to associate it to a friendly filename when a user tries to download it from the frontend.)

@johnsaigle
Copy link
Contributor

johnsaigle commented Jul 9, 2018

@driusan I +1 the idea of using another approach. uuid feels like overkill. My personal preference is to use hashing.

content addressable naming convention

If I understand correctly, you mean something like storing the hash of the file contents mapped to a friendly name in the back-end, right? E.g.


The Database:

Friendly Name Actual Filename
my_meme.png bb1a5fd034191b49f094161e53e91f675430743b

Frontend:

Welcome to the Document Repository


Backend

  • /var/www/loris/modules/document_repository/user_uploads/john/bb1a5fd034191b49f094161e53e91f675430743b.png

@ZainVirani ZainVirani added the State: Discussion required PR or issue that requires the resolution of a discussion with the relevant parties to proceed label Jul 9, 2018
@ZainVirani
Copy link
Contributor

ZainVirani commented Jul 9, 2018

It seems like some discussion is still required on this PR, so I also want to bring the comments on the previous [Closed] PR into the mix

Even @driusan's idea seems a bit much to me. IMO the smoothest solution is a standard file directory solution of appending _(x) to the file, where x would be a simple counter of how many of those files already exist. This makes it easily identifiable on the front end as well. If for whatever reason the user needs a specific file name when they download it, they can rename it themselves.

@johnsaigle
Copy link
Contributor

@ZainVirani (we talked this over i person but I wanted to reply here so others could see).

I don't like the counter solution because it makes it very easy for users to upload duplicates and waste space. The hashing method has a built-in ease of detecting duplicate files (if the hash is the same, it's a duplicate, reject the upload)

With hashing, the users can still upload files of the same name as their contents will differ. So long as they use the categorization feature, files of different types will be separated into the right trees so they won't get mixed up.

If users need multiple revisions of the same document and the same name, they can use data_release which has more robust versioning. I feel like the counter scheme by contrast leads to messiness.

@maltheism
Copy link
Member Author

maltheism commented Jul 10, 2018

@driusan @johnsaigle @ZainVirani
Okay I feel like this PR needs to be discussed in person. I spent a decent amount of time planning how I would handle the problem (solved by the PR) before writing the code. I don't see my method of doing this as overkill and there are overlooked "possibilities" & "benefits" of multiple ways of storing files. One reason I don't see it as beneficial for not allowing multiple files (with same content & name) is when there are multiple sites and users with permissions factoring into the equation of reality. I don't believe this needs to be handled like google drive or dropbox but the edge cases are fun to think about solving.

@johnsaigle
Copy link
Contributor

@intralizee Let's discuss it at a LORIS meeting, either today or next week.

@driusan
Copy link
Collaborator

driusan commented Jul 10, 2018

@johnsaigle yes, that's exactly what I meant by a content addressable naming convention.

@intralizee It's not the approach of using a random part of the path that I think is overkill, it's just the version 4 UUIDs (which comes with the overhead of having to explain what UUIDs are to anyone who looks at the code, ensure that the generation/formatting/etc is compliant with the standard, decide what to do if someone uses a different kind of UUID, etc) when any meaningless random number would do (and I doubt we need a whole 16 bytes to avoid collisions.)

@driusan
Copy link
Collaborator

driusan commented Jul 10, 2018

@ZainVirani Your approach of having a counter would result in race conditions, while @intralizee's random approach doesn't have that problem (if the randomness source has enough entropy.)

@maltheism maltheism force-pushed the minor_document_repository_duplicate_name_fix branch from 85a416e to bab69e7 Compare July 30, 2018 14:27
@HenriRabalais
Copy link
Collaborator

Since changes need to be made to this PR, I am removing the passed manual test label.

@HenriRabalais HenriRabalais removed the Passed manual tests PR has been successfully tested by at least one peer label Jul 30, 2018
0770
);
}
$target_path = $base_path . $fileBase;

if (move_uploaded_file($_FILES["file"]["tmp_name"], $target_path)) {
$success = $DB->insert(
Copy link
Contributor

@Jkat Jkat Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$db (didn't check out the code, just noticing you changed from $DB to $db)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and same with line 174 & 176)

@zaliqarosli zaliqarosli added the State: Needs rebase PR that needs to be rebased to proceed (conflicts, wrong branch...) label Aug 13, 2018
@johnsaigle johnsaigle added the State: Needs work PR awaiting additional work by the author to proceed label Oct 2, 2018
@driusan driusan added the State: Needs adoption PR whose author is no longer involved in, awaiting someone else to pick it up to proceed label Oct 16, 2018
@johnsaigle
Copy link
Contributor

Note that this will not be able to be merged when #3971 goes through. However we should still have this feature at some point.

@cmadjar cmadjar removed the State: Needs adoption PR whose author is no longer involved in, awaiting someone else to pick it up to proceed label Feb 22, 2019
@maltheism
Copy link
Member Author

I think I'll close this PR for now because of the PR involving the react rewrite. That PR has made notable changes and where this PR is obsolete. I'll see what can be done after that PR is merged into LORIS.

@maltheism maltheism closed this Feb 25, 2019
@maltheism maltheism deleted the minor_document_repository_duplicate_name_fix branch May 24, 2020 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Bug PR or issue that aims to report or fix a bug Language: SQL PR or issue that update SQL code State: Discussion required PR or issue that requires the resolution of a discussion with the relevant parties to proceed State: Needs rebase PR that needs to be rebased to proceed (conflicts, wrong branch...) State: Needs work PR awaiting additional work by the author to proceed
Projects
None yet
Development

Successfully merging this pull request may close these issues.