Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_data: Autorepair corrupted project files #3259

Merged
merged 2 commits into from
Feb 27, 2020

Conversation

ferdnyc
Copy link
Contributor

@ferdnyc ferdnyc commented Feb 27, 2020

If a file is detected as having corrupted unicode escapes, we make
a backup copy as filename.osp.bak (or filename.osp.bak.1...), then
regexp-replace all of the broken escapes with corrected ones. Then
we run the data through json.loads() and json.dumps() again to remove
the escaped characters, and write the results back to the original
file location as UTF-8.

There's a slight danger of false positives on the corrupted-file detection, which is why a backup is made of the original data before it's rewritten.

Also, as an added safety check, the repair will only run on project files matching "openshot_qt".*"2.5.0 (meaning, they're tagged as created by 2.5.0). That way, we don't accidentally edit users' OpenShot 2.4.4 projects that import files from their $HOME/Music/ub404eva/ directory.

If a file is detected as having corrupted unicode escapes, we make
a backup copy as filename.osp.bak (or filename.osp.bak.1...), then
regexp-replace all of the broken escapes with corrected ones. Then
we run the file through json.loads() and json.dumps() again to remove
the escaped characters, and write the results back to the original
file location as UTF-8.
@jonoomph
Copy link
Member

@ferdnyc This looks good to me! Perhaps we can remove this code at a future point. But for sure, our next release should include it. 👍 Thanks for the help.

@jonoomph jonoomph merged commit 4d46dde into OpenShot:develop Feb 27, 2020
@SuslikV
Copy link
Contributor

SuslikV commented Feb 27, 2020

There is case where full name (start of it) was missing from the project file (bottom of the message): #3258 (comment)

@SuslikV
Copy link
Contributor

SuslikV commented Feb 27, 2020

Side Note: the auto-saving (enabled by default) can convert any just opened (!) v2.4.4 file into release version 2.5.0 with all consequences of escaping and shortening if something not fully loaded. And user will be unaware that files were overwritten.

@jonoomph
Copy link
Member

@SuslikV This PR only auto-converts project files with "2.5.0" as the version. It would ignore v2.4.4 files. Maybe I'm misunderstanding you though.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented Feb 27, 2020

@jonoomph I think @SuslikV is saying that the current 2.5.0 will create (corrupted) 2.5.0-type files out of good 2.4.4 project files, if autosave is enabled. Nothing to do with this PR, but it does potentially multiply the impact vector for the exporting bug.

When 2.5.0 autosaves a formerly-2.4.4 project file, it'll update the version in its JSON data to "2.5.0", though, right? If so, then this PR will cover those autosaved files, so we're still OK. (If not, I may have to extend it to cover "2.4.4" files as well.)

@ferdnyc
Copy link
Contributor Author

ferdnyc commented Feb 27, 2020

import files from their $HOME/Music/ub404eva/ directory.

At the risk of being self-indulgent, I'd just like some recognition for the fact that I was totally able to come up with a plausible example of a file path that would trigger a false positive in the corruption-detection code. 😉

@jonoomph
Copy link
Member

Screenshot from 2020-02-27 17-07-02

@jonoomph
Copy link
Member

Seems like we clobber the "version" attribute on save, meaning any previous version data always gets updated, and your logic stands true!

@SuslikV
Copy link
Contributor

SuslikV commented Feb 28, 2020

The file sample from the #3258
has many entries of ./ at path - this is remainder of the v2.4.4 (where no project_named_assets are in the projects folder)
This makes
é.png into ./\u00e9.png(v2.4.4) then into ./u00e9.png(v2.4.4 file auto saved in v2.5.0) and it becomes .é.png after repair, which is expected, but not correct for the user.

Also, after repair (on Windows) file has no CRLF endings in it. Just straight line with the spaces. Is it intended?

@ferdnyc
Copy link
Contributor Author

ferdnyc commented Feb 28, 2020

The file sample from the #3258
has many entries of ./ at path - this is remainder of the v2.4.4 (where no project_named_assets are in the projects folder)
This makes
é.png into ./\u00e9.png(v2.4.4) then into ./u00e9.png(v2.4.4 file auto saved in v2.5.0) and it becomes .é.png after repair, which is expected, but not correct for the user.

Hmmm... ./, not ../? That could be a problem, yeah. I don't know why there would be ./ paths in the project file, that's odd.

../paths make sense, I realized (eventually) that it's because the paths were being processed after they were escaped. So if you have a project file at this path:

  • /var/tmp/Jöerç/Charsetᛀᛊᛇᛅ.osp
    and you ask it for a relative path to this file:
  • /var/tmp/J\u00f6er\u00e7/ForagerN\u00e5k\u0259d.mp4
    it's going to say that the relative path is .../J\u00f6er\u00e7/ForagerN\u00e5k\u0259d.mp4 because the two paths don't appear to be in the same directory.

But I'm not sure why relative path transforms would create ./ paths. I'll take a look at the file from #3258, see if I can figure out a fix.

Also, after repair (on Windows) file has no CRLF endings in it. Just straight line with the spaces. Is it intended?

It's expected that it wouldn't have CRLF endings, yeah. It does contain CR line endings, so WordPad should display it properly, along with any advanced editor like Notepad++, etc.

@SuslikV
Copy link
Contributor

SuslikV commented Feb 29, 2020

It does contain CR line endings

It don't. What daily build surely has the fix? Maybe some daily builds error?
The .bak files is created but no CR or CRLF in original project - just straight line of symbols on Windows (file modified, all paths is fixed, it works, but this is single line of 10000 - 18000 of symbols usually, it may cut somewhere in the future and some data will be lost again).

Hmmm... ./, not ../?

Yes. The v2.4.4 (where launch.exe) - it creates such paths for files that lies at the same directory as the project file. I switched number of times between the v2.5.0 and v2.4.4 (and fixed v2.5.0-dev1) to see how the start of the file was lost, but was unable to reproduce the issue. All I got - is simple example with é.png that places additional point at the start of the file after repair.

@SuslikV
Copy link
Contributor

SuslikV commented Feb 29, 2020

I did tests. The next code removes CRLF endings from the file (on Windows):

temp_data = json.loads(contents)
contents = json.dumps(temp_data, ensure_ascii=False)

Edit:
, indent=1 is missing
The code should be:

temp_data = json.loads(contents)
contents = json.dumps(temp_data, ensure_ascii=False, indent=1)

@SuslikV
Copy link
Contributor

SuslikV commented Feb 29, 2020

mystery behind ./ is simple:
os.path.relpath("c:/abc", "c:/abc")
returns .
os.path.join(".", "m.txt")
returns ./m.txt if m needs to be escaped then ./\u construction will appear...

There is os.path.normpath in Python it can get rid of point before the path. On Windows it also converts slashes. Maybe where path applies it can be used instead and then just start to store / from v2.5.x in all files?

@ferdnyc
Copy link
Contributor Author

ferdnyc commented Mar 1, 2020

@SuslikV

Yeah, sorry, I read your original message too fast the first time.

This makes é.png into ./\u00e9.png(v2.4.4)

Which is ./é.png, so far so good...

then into ./u00e9.png(v2.4.4 file auto saved in v2.5.0)

...THAT'S where things break down. ./anything would be fine, but when it's escaped it should be ./\uXXXX, or .//uXXXX if it gets corrupted. Unfortunately, there are plenty of chances for double slashes to get collapsed... and then we end up in trouble. Because it's not the ./ path that's the problem, it's the loss of the second forward slash.

and it becomes .é.png after repair,

Yeah, which is wrong. It needs to go back to the original ./é.png. But the repair algorithm will have to have some way of detecting that, because the information that there's supposed be a second slash has already been lost before the repair code ever got hold of the data.

I can't just assume that any ./uXXXX (with a literal period) should be treated as ./\uXXXX, because if someone has a filename like estar.en.éxtasis.mp4, that will get corrupted-escaped as estar.en./u00e9xtasis.mp4 and changing it to estar.en./\u00e9xtasis.mp4 would also corrupt the data.

I can make a couple of assumptions, though, to make the recovery smarter:

  • Any path string that starts with periods is a relative directory reference.
    Meaning, "./uXXXX and "../uXXXX should be replaced with "./\uXXXX and "../\uXXXX, respectively, during repair.
  • Any period(s) immediately following a slash are also relative directory references.
    So, something/./uXXXX and something/../uXXXX should also be replaced with something/./\uxXXX and something/../\uXXXX during repair

I'll add those rules to the repair logic, and hopefully that will cover most or all of these cases.

There is os.path.normpath in Python it can get rid of point before the path. On Windows it also converts slashes.

It does, but the wrong way (for our purposes).

On Windows, it converts forward slashes to backward slashes.

Normpath normalizes paths to the current OS, whereas we (for some reason) want to force the paths to one specific format regardless of OS. Python has no tools for doing that. (Which in and of itself feels like a really good reason to think twice about whether it's the right approach.)

Regardless, the issue right now isn't whether we should be doing it. It's already been done, so now we have to correct for it regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants