Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRESOLVER-372] Rework the FileUtils collocated temp file #364

Merged
merged 2 commits into from
Nov 17, 2023

Conversation

cstamas
Copy link
Member

@cstamas cstamas commented Nov 15, 2023

Fixes:

  • move() call should NOT perform the move, as writer stream to tmp file may still be open
  • move the file move logic to close, make it happen only when closing collocated temp file
  • perform fsync before atomic move to ensure there is no OS dirty buffers related to newly written file
  • on windows go with old code that for some reason works (avoid NIO2)
  • on non-Win OS fsync the parent directory as well.

So, to recap:

  • original issue is not related to locking at all, is about change done in 1.9 where file processor was altered to use nio2 atomic move (to prevent possibility of partial download being read by other process and prevent incomplete files in case of crash). Maven LRM uses layout, so randomized temp files does not matter (if they remain after crash), as they would merely just clutter the local repository but not corrupt it.
  • nuking local repo is part of healthy hygiene anyway
  • seems windows (or java on windows) have issues with atomic fs ops
  • resolver have no deps on p-u or any commons, api and util modules are zero dependency jars and should remain as such
  • one thing bugs me still: why does the old code work? Or in reverse, the new why does not work? As opposed to "high frequency writes", this is not that case (fixed by @olamy ... Um, sorry @michael-o )

To explain this last bullet: there was https://issues.apache.org/jira/browse/MRESOLVER-325 with solution #259 that "seems similar", as originally this code was used by both, and in that case "high frequency atomic moves" was applicable. But this now happens seemingly on "install" and "cache" (download, place it in local repo) that has way less frequency that tracking file write....


https://issues.apache.org/jira/browse/MRESOLVER-372

@cstamas cstamas self-assigned this Nov 15, 2023
@cstamas cstamas added this to the 2.0.0 milestone Nov 15, 2023
Copy link

@laeubi laeubi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this helps, ATOMIC_MOVE will replace an existing file already as far as I know, the problem described in MRESOLVER-372 is that the file is opened by another process so you can't write to that file, whatever you try under windows (linux is different!)

So what happen is this:

  1. Resolver/Maven/... has downloaded this file previously (or maven install)
  2. Now some IDE, Tool, Mojo, ... open that file for reading
  3. while the file is open another build request that file and there is a new SNAPSHOT (from snapshot repo) and resolver tries to download and replace the SNAPSHOT file --> AccessDeniedException

So there are some ways to circumvent this:

  1. When resolver is asked for file and can't move it return the temp file before the move -> maybe results in multiple downloads but better than nothing
  2. Use time-stamped filenames instead of -SNAPSHOT filename
  3. Use some kind of "link-file" named -SNAPSHOT that points to the "real" file

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

I really hope "atomic on windows" will NOT replace the file and then throw 😄 otherwise, am really unsure what that OS is able to guarantee for at all.

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

If you check older resolvers (and let's assume "they worked"), they used pre-nio2 copy+delete, so they also did this but with a "window of possibility" to allow other process to read partially written file. That was implemented using pre-nio2 Java. All that resolver 1.9.x did, is to migrate the code to nio2 AND use atomic. My guts are telling me that the "atomic" is the problem here...

Otherwise, same problems would be reported with resolver 1.8, as it also:

  • normalized snapshots
  • overwrote files (if existed) in case of install/download in very same way

Really, 1.9 did change NOTHING in this respect, it merely rewrote FileProcessor to use NIO2 and Atomic FS Operations

@laeubi
Copy link

laeubi commented Nov 16, 2023

I really hope "atomic on windows" will NOT replace the file and then throw 😄 otherwise, am really unsure what that OS is able to guarantee for at all.

You can just look into the Widnows File system implementation of the JDK to see whats going on.

If you check older resolvers (and let's assume "they worked"), they used pre-nio2 copy+delete

"Old" code simply ignored when replace goes wrong, so you have the problem that afterwards you maybe use the old file. I don't say that new way is "wrong" just that not using atomic move does not solve the problem reported in the issue that is some other process (or the current) is currently opend that file and in such case you cam't overwrite/delete/replace/move ... to that file under windows.

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

@slawekjaranowski @laeubi reworked fully...

After a LOT of reading, seemingly we had several issues:

  • move() call literally MOVED the file, but the writer of that file may kept the file still opened for write
  • due that above and also due lack of fsync() Windows MAY detect "dirty buffers", hence AccessDeniedEx
  • on non-Windows is recommended to fsync() the containing directory as well

Fixes:
* move() call should NOT perform the move, as writer stream to tmp file may still be open
* move the file move logic to close, make it happen only when closing collocated temp file
* perform fsync before atomic move to ensure there is no OS dirty buffers related to newly written file
* on non-Win OS fsync the parent directory as well.

---

https://issues.apache.org/jira/browse/MRESOLVER-372
@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

Frankly, I'd remove the fsync on directory, as unlike Lucene (that depends on readability of new commit marker), Maven uses layout, so any other process will calculate the path and go directly for artifact jar.... it does not need to "list" the directory to discover new artifacts...

@cstamas cstamas changed the title [MRESOLVER-372] Fallback to non-atomic move on AccessDeniedEx [MRESOLVER-372] Rework the FileUtils collocated temp file Nov 16, 2023
Comment on lines 128 to 130
if (!IS_WINDOWS) {
fsyncParent(tempFile);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here ... delete tempFile can also change directory content ... so maybe should be sync after delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fsync is not important, then this commit is nearly wrong. If should simply be a matter of making sure the calls are ordered correctly ?


The call to move should be performed after the commit which actually writes to the file ?

@rdicroce
Copy link

I tested this on Windows and it does not fix the problem. The way I tested it was:

  1. Build the [MRESOLVER-372] Rework the FileUtils collocated temp file #365 branch.
  2. Replace all of the maven-resolver JARs in Eclipse's plugins\org.eclipse.m2e.maven.runtime_3.9.400.20230826-0755\jars directory by renaming the originals to have .orig on the end, and then changing the names of the built JARs to match the 1.9.14 filenames. I'm using Eclipse 2023-09.
  3. Open a workspace containing project A. This project has a SNAPSHOT dependency on project B.
  4. Open a second workspace containing project B.
  5. Run a 'clean install' build of project B.

I agree with @laeubi's assessment. If a file is open for any reason, it can't be replaced. At least, I've never heard of any way of doing it on Windows. Linux is a different story.

}

@Override
public void close() throws IOException {
if (wantsMove.get() && Files.isReadable(tempFile)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe. The CollocatedTempFile should provide a way to create an OutputStream so that it can be closed before the CollocatedTempFile. This is needed so that buffered streams are fully written to disk before moving the file.
See

try (InputStream in = new BufferedInputStream(Files.newInputStream(source.toPath()));
FileUtils.CollocatedTempFile tempTarget = FileUtils.newTempFile(target.toPath());
OutputStream out = new BufferedOutputStream(Files.newOutputStream(tempTarget.getPath()))) {
long result = copy(out, in, listener);
tempTarget.move();
return result;
}

The code is using a BufferedOutputStream so that it may not be completely written to the disk. I'm not sure if there's any guarantee on the ordering of the close() methods in the try/finally block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so order is specified as being the reversed order of declaration, which makes sense of course, so this should work because the BufferedOutputStream will be closed just before the CollocatedTempFile.

Copy link
Contributor

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case for output stream being buffered should be handled correctly imho.

@gnodet gnodet self-requested a review November 16, 2023 17:10
Copy link
Contributor

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit more in depth look at the code, I think this is a good improvement and closes some holes where the writer could write to the file after the move. If this is not sufficient, it will surely help.

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

There is one use case I have no idea what to do with, but I do know what I don't want to do:

If a file is open for any reason, it can't be replaced. (on Windows)

In this case, if resolver would "silently give up" (for example an installation attempt), it would mean it is intentionally leaving your local repository in an inconsistent state. And this is unacceptable for me. Hence, "ignoring IOExceptions" i consider as a non-option: no software should intentionally make transition from "consistent" to "inconsistent" state.

Least I can do is implement similar trick as for directory fsync, and make resolver broken on windows. But this would imply, that we have to state somewhere (release notes? site?) that "by design, resolver is not supported on Windows"?

Who will support then errors happening on windows, JIRAs like "I installed something with mvn install and IDE/other mvn process does not see the change?"

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

Similarly, @laeubi i don't see pre1.9 resolver code as "forgiving" and swallowing IOEx, here is the same method from 1.6.0:
https://github.com/apache/maven-resolver/blob/maven-resolver-1.6.0/maven-resolver-impl/src/main/java/org/eclipse/aether/internal/impl/DefaultFileProcessor.java#L153

If target is open by something else (ie IDE), line 166 would fail, no?

So to me it seems "when file is opened by something else" failed Resolver 1.6.0 as well. So, imho, this use case is red herring: we cannot do anything here.

@rdicroce ping also ^

@rdicroce
Copy link

So for shits and giggles, I decided to paste the 1.6.0 file copy code into the 1.9.x branch. Here's what it looks like:

    public static CollocatedTempFile newTempFile(Path file) throws IOException {
        Path parent = requireNonNull(file.getParent(), "file must have parent");
        Files.createDirectories(parent);
        Path tempFile = parent.resolve(file.getFileName() + "."
                + Long.toUnsignedString(ThreadLocalRandom.current().nextLong()) + ".tmp");
        return new CollocatedTempFile() {
            @Override
            public Path getPath() {
                return tempFile;
            }

            @Override
            public void move() throws IOException {
            	try (InputStream in = Files.newInputStream(tempFile); OutputStream out = Files.newOutputStream(file)) {
            		copy(out, in);
            	}
            }

            @Override
            public void close() throws IOException {
                Files.deleteIfExists(tempFile);
            }
        };
    }
    
    private static long copy( OutputStream os, InputStream is )
            throws IOException
        {
            long total = 0L;

            ByteBuffer buffer = ByteBuffer.allocate( 1024 * 32 );
            byte[] array = buffer.array();

            while ( true )
            {
                int bytes = is.read( array );
                if ( bytes < 0 )
                {
                    break;
                }

                os.write( array, 0, bytes );

                total += bytes;
            }

            return total;
        }

Then I tried the same test as before. This code did NOT throw an exception. And it appears to have actually worked. The file on disk has different timestamps inside the JAR. Eclipse even seems to have noticed that it changed, although that's difficult to say for sure.

Why does this work and the other code doesn't? I have no idea. Probably has something to do with the fact that this code actually copies the content of the file, byte by byte. Whereas the MoveFile approach probably only tries to change the actual location on disk that the file points to.

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

Wow, @rdicroce Thanks for testing this!

So, this for me proves even more, that problem lies within WindowsFS....

As I said, my intent with use of "atomic moves" was actually to prevent other processes to end up reading partially written file content... And this code @rdicroce tested, while does work, does not prevent this....

So, maybe IF win old_code ELSE new_code? Hm

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

@rdicroce updated this but also the backport PR #365

Please try!

@rdicroce
Copy link

@cstamas #365 works on my machine.

@cstamas
Copy link
Member Author

cstamas commented Nov 16, 2023

I propose this last change as final PR, will ask for re-review from everyone adding review so far.

@laeubi
Copy link

laeubi commented Nov 17, 2023

@cstamas I'm a bit late to the party but as @rdicroce already tested showed I cna only tell that Maven 3.8.x ("old"? resolver) has worked under Windows why upgrade to 3.9.x ("new"? resolver) shows reproducible problems in this regard.

I'm not a windows expert enough to tell but think @olamy has done already some analysis in the past about the Move/Sync problem.

In general I think it would really be beneficial if resolver would simply use time-stamped SNAPSHOTs instead of "normalized", maybe one can even for a while simply write the SNAPSHOT as an additional file (on windows) and symlink under Linux but resolver use/returns the timestamped file to start a migration?

@olamy
Copy link
Member

olamy commented Nov 17, 2023

@laeubi me having done some analysis about file locking on windows. uhmmm LOL you may confuse with someone else ;)

@laeubi
Copy link

laeubi commented Nov 17, 2023

@laeubi me having done some analysis about file locking on windows. uhmmm LOL you may confuse with someone else ;)

Hm... it should have been @michael-o sorry for the confusion no idea why github has suggested your name and I didn't notice :-D

Copy link
Member

@michael-o michael-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

// Logic borrowed from Commons-Lang3: we really need only this, to decide do we fsync on directories or not
private static final boolean IS_WINDOWS =
System.getProperty("os.name", "unknown").startsWith("Windows");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have plexus utils code for this, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and SystemUtils in commons-lang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if external library is not used I would not like add next dependency for one simple function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if external library is not used I would not like add next dependency for one simple function

With this, I agree, but if we use it anyway, I wouldn't write the code myself.

Files.deleteIfExists(tempFile);
}
};
}

/**
* On Windows we use pre-NIO2 way to copy files, as for some reason it works. Beat me why.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beat, why?

@cstamas
Copy link
Member Author

cstamas commented Nov 17, 2023

So, to recap:

  • original issue is not related to locking at all
  • is about change done in 1.9 where file processor was altered to use nio2 atomic move, inspiration was to prevent possibility of partial download being read by other process (and prevent incomplete files in case of crash). Maven lrm uses layout, so randomized tmp files does not matter (if remain after crash) it would merely just clutter local repo.
  • nuking local repo is part of healthy hygiene
  • seems windows (or java on windows) have issues with atomic fs ops
  • resolver have no deps on p-u or any commons, api and util modules are zero dependency jars and should remain as such
  • one thing bugs me still: why does the old code work? Or in reverse, the new why does not work? As opposed to "high frequency writes", this is not that case (fixed by @olamy ... Um, sorry @michael-o )

To explain this last bullet: there was https://issues.apache.org/jira/browse/MRESOLVER-325 with solution #259 that "seems similar", as originally this code was used by both, and in that case "high frequency atomic moves" was applicable. But this now happens seemingly on "install" and "cache" (download, place it in local repo) that has way less frequency that tracking file write....

@cstamas cstamas merged commit dfdb947 into apache:master Nov 17, 2023
4 checks passed
@cstamas cstamas deleted the MRESOLVER-372 branch November 17, 2023 12:26
cstamas added a commit that referenced this pull request Nov 17, 2023
Fixes:
* move() call should NOT perform the move, as writer stream to tmp file may still be open
* move the file move logic to close, make it happen only when closing collocated temp file
* perform fsync before atomic move to ensure there is no OS dirty buffers related to newly written file
* on windows go with old code that for some reason works (avoid NIO2)
* on non-Win OS fsync the parent directory as well.

---

https://issues.apache.org/jira/browse/MRESOLVER-372

Backport to 1.9.x branch of the #364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants