-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(translations/differences): visualize how many commits behind translations are (#8338) #8338
Conversation
server/translations.ts
Outdated
@@ -141,22 +142,45 @@ function getDocument(filePath) { | |||
}; | |||
} | |||
|
|||
function getCommitBehindFromLatest(filename, commitHash) { | |||
const commitHashes = execSync( | |||
`cd ${CONTENT_ROOT} && git log --pretty=format:%H -- ${filename}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the index of the file's latest hash array like this and see how far behind it is. However, there are significant delays. Approximately, it seems that source-commit is slowing down by about half a second per file written.
For the PoC, I wrote it as it is at first, but we need another method. Or do we need pre-processing to write to a separate file for count?
Thanks a ton @hochan222 <3 It's been ages I wanted to have this and did not have time to implement. |
I've tried using workers to do a pre-processing to save the commitHashCache to a file and load it in getCommitBehindFromLatest function. But as a result, I realized that the idea of storing hashes for all files in content repo is bad. There are over 11000 markdown files in the content repo.. The code worked fine, but my computer exploded and shut down automatically. (We can solve it by splitting the file, but I'm wondering if it's a good way.) Currently, PR's solution is problematic. In case of Japanese locale, sourcecommit meta tag was created for about 1800 files. I prevented to run the git log command when reload through 36e4628, but the initial loading time still takes 930 seconds (15m 30s). The number of files in the content repo, roughly 10000, takes 1.3 hours with the current solution. Any other good ideas? The idea of storing hashes for all files in content repo.Games, glossary, learn, mdn, mozilla, related, and webassembly are fine, but they explode on the web. // worker.js
import { parentPort } from "node:worker_threads";
import { execSync } from "node:child_process";
const CONTENT_ROOT = "/path/to/content/files/en-us/webassembly";
// games
// glossary
// learn
// mdn
// mozilla
// related
// web
// webassembly
parentPort.on("message", (filepath) => {
console.log(`worker ${filepath}...`);
const commitHashes = execSync(`git log --pretty=format:%H -- ${filepath}`, {
cwd: CONTENT_ROOT,
})
.toString()
.split("\n");
parentPort.postMessage({ filepath, commitHashes, done: true });
parentPort.close();
}); // mainThread.js
import fs from "node:fs";
import path from "node:path";
import { Worker } from "node:worker_threads";
const CONTENT_ROOT = "/path/to/content/files/en-us/webassembly";
const CACHE_FILE_PATH = "./commitHashCache.json";
let commitHashCache = {};
// Check if the cache file exists, and if so, load the cache data from the file
if (fs.existsSync(CACHE_FILE_PATH)) {
const cacheData = fs.readFileSync(CACHE_FILE_PATH, "utf8");
commitHashCache = JSON.parse(cacheData);
}
function saveCommitHashCacheToFile() {
const cacheData = JSON.stringify(commitHashCache);
fs.writeFileSync(CACHE_FILE_PATH, cacheData, "utf8");
}
async function cacheAllFiles(folder) {
const files = fs.readdirSync(folder);
const promises = [];
for (const file of files) {
const filepath = path.join(folder, file);
const stats = fs.statSync(filepath);
if (stats.isDirectory()) {
promises.push(cacheAllFiles(filepath));
}
if (stats.isFile()) {
if (path.extname(file) !== '.md') continue;
if (!commitHashCache[filepath]) {
const promise = new Promise((resolve, reject) => {
const worker = new Worker('./worker.js', { workerData: filepath });
worker.once('message', ({ filepath, commitHashes }) => {
commitHashCache[filepath] = commitHashes;
resolve();
});
worker.once('error', reject);
worker.postMessage(filepath);
});
promises.push(promise);
}
}
}
await Promise.all(promises);
}
async function main() {
await cacheAllFiles(CONTENT_ROOT);
saveCommitHashCacheToFile();
}
main().catch(console.error); |
Some idea about source commit hash cache:
But I'm not really sure about the second one. |
server/translations.ts
Outdated
if (commitHashCache[filename] === undefined) { | ||
try { | ||
commitHashCache[filename] = execSync( | ||
`git rev-list --count ${commitHash}..HEAD -- ${filename}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reduce 930 seconds to 214 seconds with git rev-list --count ${commitHash}..HEAD -- ${filename}
command for ja locale files.
- AS IS
- 6840 files (with source commit 1825 files): takes 930 seconds. (15m 30s)
- TO BE
- 6840 files (with source commit 1825 files): takes 214 seconds. (3m 34s)
But still we have to consider up to 10000 files.
Generate source commit error report
This pull request has merge conflicts that must be resolved before it can be merged. |
🎉🎉 Now just create source-commit.json file for the first time and we can render
It seems ready now. The first step is long, but maybe just creating the file is enough. source-commit.json file// source-commit.json
{
"ko/glossary/accessibility": 1,
"ko/glossary/style_origin": 1,
"ko/web/javascript": 0,
"ko/web/security": 3,
"ko/web/http/csp": 1,
"ko/web/css/gap": 1,
...
} source-commit-report.json fileThis is not redundantly checked, it is appended to the end of the file each time the server is restarted. This file is recorded when the hash value of stored meta-data is incorrect (e.g. hash does not exist in the content file in normal cases). // source-commit-report.txt
ja/web/api/pointerevent/tangentialpressure: 708baf34eabb75789bcd3314a6879da3702024d1
ja/web/api/workerglobalscope/languagechange_event: 0fe2d3ee23b1b23be971d42c7c56729bd23a3f83
ja/web/api/pointerevent/getcoalescedevents: 708baf34eabb75789bcd3314a6879da3702024d1
... Visualization |
further workI hope that the source-commit I worked on in the PR is expressed in the card in the image below. Even if it's not detailed, the translated page can gain a lot of trust from people who read MDN just by showing approximate status like red, green(or blue), and orange traffic lights. Alternatively, we can provide a funnel for page contributions to potential contributors. https://github.com/orgs/mdn/discussions/333 Let me know what you think. To do that, I think I need a way to provide the source-commit.json file to aws in the same way as the popularities.json file. Is there anyone who can help me? |
@caugner Hello. Currently, the PR is ready, but it is pending in the open state, so I mentioned it. Could you please designate a reviewer for the current PR? Thank you :) |
This is nice! My review of some specific changes to make is incoming, but I've come up with a proof of concept for a much faster way of calculating these numbers: On my machine, with the current approach: The key problem with the current approach, is by running a git command for each file, we keep traversing the same section of the commit graph, but git has to load it from disk each time. Instead if we invert the process a little bit, and load the files changed in each commit from git, storing that in memory, we can then do that repeated graph traversal very quickly, because we load it from memory each time. We obviously don't want to load the entire commit graph into memory, but we can keep expanding the graph loaded into memory whenever we hit a commit that isn't in it. Have a look at my proof of concept code in: LeoMcA@a0f258c It needs a bit of cleanup - for a start, it fully skips the cache so development was easier. With the cache invalidation I suggest in my review comments, we probably want to add that back. And it could possibly do with some better variable names - so feel free to modify it as much as you want :) |
Co-authored-by: LeoMcA <LeoMcA@users.noreply.github.com>
dde37bb
to
2d2dba7
Compare
86ce7b4
to
032db13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LeoMcA Thank you for your review.
That's a really good. Thank you so much for the POC.
One thing to worry about is that it may be fine now, but as sourcecommit gets older, more commits and file path information will be stored in memory(memCommitStore).
In the attempt of the comment above, I stored all the hashes in memory for each file, and it exploded. The above exploded while trying to save the whole thing, but the current case is also a possibility that it will explode due to insufficient memory. Also, the number of commits in the content repo is 17000 now, and the rate at which commits are increasing should be taken into consideration.
I'm wondering whether to properly set "the circuit breaker"(It was used in the sense of a criterion to stop expansion.) and think about it later. (Maybe the number of keys is limited..? It is also difficult to set standards.)
Another problem is that even though the commit hash for the path is correct, there is a case where execSync fails and is recorded in source-commit-invalid-report.txt
. For the failed case, it works well when entering the git command in bash. It seems to take a more time to fix.
Thank you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice: I had wondered about memory usage, so thanks for prompting about it - I had a look into it, and it's not anything to worry about, but it did help to track down a few bugs!
First I've moved away from using execSync
for the git log command to spawning a child process: this would have been the cause of those incorrect errors reported in source-commit-invalid.txt
- what was happening was the git log output for two commits a long distance from each other was exceeding the execSync
buffer size, so the command was killed and that error was caught by the try/catch
block.
Then, with the ability to have huge git log output I forced the command to just load the entire content repo's hash/filename log, all the way back to the original commit. And it's not that big: comparing memory usage from before the Map is filled to after results in a ~14MB increase, so this isn't something to worry about at all.
This exposed another bug where the commit after an empty commit wouldn't be recorded in the Map correctly, which is why you'll see a switch to using NULL chars as the delimiter between seperate commits and filenames.
Have a look at my commit here: LeoMcA@a2a0759
As before, have a review: one thing you'll want to remove (but I've left in for you to have a look at memory usage on your machine) is lines 216-219 and lines 256-259
Co-authored-by: LeoMcA <LeoMcA@users.noreply.github.com>
@LeoMcA I'm really sorry for the late reply because I've been busy with work for two weeks. I feel sorry for not being able to interact as quickly as you cared about. I think the way you suggested is great. Log measurements were taken on three laptops, details are at the bottom. I was worried about memory, but looking at the logs, it seems to be ok.
On a laptop(Surface Laptop2) with poor performance, there was a difference of 1m 10s based on the ja locale depending on the existence of the performance according to the laptop
I measured the following three laptops.
(1) In the case of the macbook pro, it was a company laptop, so I proceeded with a few programs turned on. rss logMacBook Pro(M1, 16GB, 2021 model)### server start (source-commit.json file not exist)
(1)
오후 9:32:33 server.1 | rss before: 262586368 bytes
오후 9:32:37 server.1 | rss after: 256000000 bytes
오후 9:32:37 server.1 | rss diff: -6586368 bytes
오후 9:32:39 server.1 | rss before: 260685824 bytes
오후 9:32:39 server.1 | rss after: 260784128 bytes
오후 9:32:39 server.1 | rss diff: 98304 bytes
오후 9:32:48 server.1 | rss before: 278544384 bytes
오후 9:32:48 server.1 | rss after: 244842496 bytes
오후 9:32:48 server.1 | rss diff: -33701888 bytes
오후 9:32:52 server.1 | Find all translated documents (ja): 20.461s
오후 9:33:03 server.1 | Find all translated documents (ko): 6.784s
(2)
오후 9:43:41 server.1 | rss before: 262602752 bytes
오후 9:43:44 server.1 | rss after: 284033024 bytes
오후 9:43:44 server.1 | rss diff: 21430272 bytes
오후 9:43:46 server.1 | rss before: 285048832 bytes
오후 9:43:46 server.1 | rss after: 285065216 bytes
오후 9:43:46 server.1 | rss diff: 16384 bytes
오후 9:43:53 server.1 | rss before: 308019200 bytes
오후 9:43:53 server.1 | rss after: 266960896 bytes
오후 9:43:53 server.1 | rss diff: -41058304 bytes
오후 9:43:59 server.1 | Find all translated documents (ja): 19.008s
오후 9:44:51 server.1 | Find all translated documents (ko): 6.601s
(3)
오후 9:48:41 server.1 | rss before: 258359296 bytes
오후 9:48:44 server.1 | rss after: 289980416 bytes
오후 9:48:44 server.1 | rss diff: 31621120 bytes
오후 9:48:46 server.1 | rss before: 283033600 bytes
오후 9:48:46 server.1 | rss after: 283082752 bytes
오후 9:48:46 server.1 | rss diff: 49152 bytes
오후 9:48:53 server.1 | rss before: 305774592 bytes
오후 9:48:53 server.1 | rss after: 264060928 bytes
오후 9:48:53 server.1 | rss diff: -41713664 bytes
오후 9:48:58 server.1 | Find all translated documents (ja): 18.781s
오후 9:49:29 server.1 | Find all translated documents (ko): 6.671s
### server restart (source-commit.json file exist)
오전 1:43:41 server.1 | Find all translated documents (ja): 14.861s
오전 1:43:49 server.1 | Find all translated documents (ko): 7.179s
### reload
오후 10:09:03 server.1 | Find all translated documents (ja): 1.389s
오후 10:08:35 server.1 | Find all translated documents (ko): 1.274s MacBook Air(M2, 8GB, 2022 model)### server start (source-commit.json file not exist)
(1)
오후 9:16:15 server.1 | rss before: 230408192 bytes
오후 9:16:18 server.1 | rss after: 251691008 bytes
오후 9:16:18 server.1 | rss diff: 21282816 bytes
오후 9:16:19 server.1 | rss before: 259538944 bytes
오후 9:16:19 server.1 | rss after: 259637248 bytes
오후 9:16:19 server.1 | rss diff: 98304 bytes
오후 9:16:23 server.1 | rss before: 281657344 bytes
오후 9:16:23 server.1 | rss after: 281690112 bytes
오후 9:16:23 server.1 | rss diff: 32768 bytes
오후 9:16:26 server.1 | Find all translated documents (ja): 12.284s
오후 9:17:23 server.1 | Find all translated documents (ko): 4.172s
(2)
오후 9:43:25 server.1 | rss before: 228950016 bytes
오후 9:43:28 server.1 | rss after: 248283136 bytes
오후 9:43:28 server.1 | rss diff: 19333120 bytes
오후 9:43:29 server.1 | rss before: 254115840 bytes
오후 9:43:29 server.1 | rss after: 254197760 bytes
오후 9:43:29 server.1 | rss diff: 81920 bytes
오후 9:43:34 server.1 | rss before: 277807104 bytes
오후 9:43:34 server.1 | rss after: 277839872 bytes
오후 9:43:34 server.1 | rss diff: 32768 bytes
오후 9:43:36 server.1 | Find all translated documents (ja): 12.200s
오후 9:44:54 server.1 | Find all translated documents (ko): 4.227s
(3)
오후 9:48:37 server.1 | rss before: 227262464 bytes
오후 9:48:39 server.1 | rss after: 256802816 bytes
오후 9:48:39 server.1 | rss diff: 29540352 bytes
오후 9:48:40 server.1 | rss before: 258621440 bytes
오후 9:48:40 server.1 | rss after: 258686976 bytes
오후 9:48:40 server.1 | rss diff: 65536 bytes
오후 9:48:45 server.1 | rss before: 280870912 bytes
오후 9:48:45 server.1 | rss after: 280887296 bytes
오후 9:48:45 server.1 | rss diff: 16384 bytes
오후 9:48:48 server.1 | Find all translated documents (ja): 11.466s
오후 9:49:21 server.1 | Find all translated documents (ko): 3.920s
### server restart (source-commit.json file exist)
오후 10:06:29 server.1 | Find all translated documents (ja): 12.128s
오후 10:06:39 server.1 | Find all translated documents (ko): 3.867s
### reload
오후 10:06:31 server.1 | Find all translated documents (ja): 272.352ms
오후 10:06:54 server.1 | Find all translated documents (ko): 242.033ms Surface Laptop2 (Intel® Core™ 8-i7, 8GB)### server start (source-commit.json file not exist)
(1)
오후 9:24:53 server.1 | rss before: 261705728 bytes
오후 9:25:12 server.1 | rss after: 274558976 bytes
오후 9:25:12 server.1 | rss diff: 12853248 bytes
오후 9:25:21 server.1 | rss before: 273227776 bytes
오후 9:25:21 server.1 | rss after: 272945152 bytes
오후 9:25:21 server.1 | rss diff: -282624 bytes
오후 9:25:56 server.1 | rss before: 303996928 bytes
오후 9:25:56 server.1 | rss after: 303996928 bytes
오후 9:25:56 server.1 | rss diff: 0 bytes
오후 9:26:19 server.1 | Find all translated documents (ja): 1:31.642 (m:ss.mmm)
오후 9:28:21 server.1 | Find all translated documents (ko): 31.898s
(2)
오후 9:50:56 server.1 | rss before: 265187328 bytes
오후 9:51:16 server.1 | rss after: 277553152 bytes
오후 9:51:16 server.1 | rss diff: 12365824 bytes
오후 9:51:24 server.1 | rss before: 284254208 bytes
오후 9:51:24 server.1 | rss after: 284266496 bytes
오후 9:51:24 server.1 | rss diff: 12288 bytes
오후 9:52:07 server.1 | rss before: 309460992 bytes
오후 9:52:07 server.1 | rss after: 309469184 bytes
오후 9:52:07 server.1 | rss diff: 8192 bytes
오후 9:52:43 server.1 | Find all translated documents (ja): 1:52.605 (m:ss.mmm)
오후 9:53:18 server.1 | Find all translated documents (ko): 30.816s
(3)
오후 10:01:01 server.1 | rss before: 252461056 bytes
오후 10:01:22 server.1 | rss after: 283996160 bytes
오후 10:01:22 server.1 | rss diff: 31535104 bytes
오후 10:01:30 server.1 | rss before: 288923648 bytes
오후 10:01:30 server.1 | rss after: 275550208 bytes
오후 10:01:30 server.1 | rss diff: -13373440 bytes
오후 10:02:12 server.1 | rss before: 300892160 bytes
오후 10:02:12 server.1 | rss after: 299499520 bytes
오후 10:02:12 server.1 | rss diff: -1392640 bytes
오후 10:02:38 server.1 | Find all translated documents (ja): 1:41.551 (m:ss.mmm)
오후 10:03:25 server.1 | Find all translated documents (ko): 30.716s
### server restart (source-commit.json file exist)
오전 1:34:51 server.1 | Find all translated documents (ja): 31.946s
오전 1:34:19 server.1 | Find all translated documents (ko): 18.694s
### reload
오후 10:04:09 server.1 | Find all translated documents (ja): 3.377s
오후 10:04:21 server.1 | Find all translated documents (ko): 2.752s |
@LeoMcA is there anything blocking from getting this merged? (cc @queengooborg working on similar topics) |
@mdn/mdn-community-engagement |
Any process on this PR? |
@LeoMcA Hello. I've verified that the page loads swiftly enough and is functioning properly. I believe the current pull request is prepared for merging. Should there be any additional requirements or missing elements for the merge, please let me know🙇🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR and your patience!
We (@argl @fiji-flo @caugner) took a look and approve the changes with the following constraint: We don't actively maintain this part of the translation dashboard, and we're working on a front-end rewrite this year, so this feature might break or be removed and if that happens we cannot promise that it will come back. However, we understand the immediate value this feature brings to translators, so we merge the changes for now. 🙂
Summary
Visualization of how far behind the latest commit of content is the traslated-content commit.
@mdn/localization-team-leads @queengooborg I mentioned that I need your help on the following two things. Let me know your opinions.
execSync
to use the git command. This means there will be delays. Any good workaround? feat(translations/differences): visualize how many commits behind translations are (#8338) #8338 (comment)Problem
In translated-content, l10n.sourceCommit is recorded as meta-data, but there is no area that can be checked in the dashboard.
TODO
6840 files (with source commit 1825 files): within 120 seconds.Solution
Add a source commit element to the
_translations/differences
page.On the dashboard page, the source commit element provides information about how far behind the latest hash of the content page.
Screenshots
l10n.sourceCommit exist
l10n.sourceCommit not exist
Before
No
source commit
element.After
Added
source commit
element.Test
Optimization
git rev-list --count ${commitHash}..HEAD -- ${filename}