Skip to content

Commit

Permalink
more updates for 15.1 data workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
markusicu committed Feb 7, 2023
1 parent 3ab01ec commit 5f2992b
Show file tree
Hide file tree
Showing 4 changed files with 50 additions and 44 deletions.
25 changes: 14 additions & 11 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,9 +217,7 @@ See the top level `pom.xml` under `<properties>`.

The input data files for the Unicode Tools are checked into the repo since
2012-dec-21:

* <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd>
* <https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd>
* https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/

This is inside the unicodetools file tree, and the Java code has been updated to
assume that. Any old Eclipse setup needs its path variables checked.
Expand All @@ -242,7 +240,9 @@ Starting with Unicode 15, we are developing most of the Unicode data files
in this Unicode Tools project, and publish them to the Public folder
only for alpha/beta/final releases.
That is, we are reversing the flow of files.
(See [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

See [data workflow](data-workflow.md). (Based on
[issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

We are also no longer generating and posting files with version suffixes.
(We now generate files into an output folder with the Unicode version number.)
Expand All @@ -255,7 +255,7 @@ unversioned "dev" folders in this repo.

#### Unicode 15.1+ workflow

See data-workflow.md .
See [data workflow](data-workflow.md).

### Unicode 15.0.0 changes

Expand Down Expand Up @@ -374,10 +374,10 @@ to generate new files). For all the new ones:
Make a pull request to incorporate these updates, and upload the generated files
in a way that can be shared with ucd-dev.

Unicode 15 TODO:
We plan to
Unicode 15+:
- make a commit for changes in input data files
- copy the output files back into the input folders, review, and commit again

... instead of posting draft files elsewhere and re-ingesting them later.

Ideally, diff the files to check for any discrepancies. The script will do this
Expand Down Expand Up @@ -530,13 +530,16 @@ If there are new break rules (or changes), see
Unicode.
4. On Windows you can run these BATs to compare files: TODO??
### Upload for Ken Whistler & editorial committee
### Upload for Ken Whistler & other reviewers
Unicode 15 TODO: See above; commit new input data, run tools, review output, copy back to input, commit, pull request...
Unicode 15+: See above; commit new input data, run tools, review output, copy back to input, commit, pull request...
1. Check diffs for problems
2. First drop for a version: Upload **all** files
3. Subsequent drop for a version: Upload *only modified* files
2. Ask for reviews on the pull request.
3. For & during alpha & beta we publish whole snapshots of multiple repo data folders
using publication scripts: See [data workflow](data-workflow.md).
We no longer post files to FTP folders, nor publish individual files without consistent changes in others.
### Invariant Checking
Expand Down
52 changes: 32 additions & 20 deletions docs/inputdata.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ Starting with Unicode 15, we are developing most of the Unicode data files
in this Unicode Tools project, and publish them to the Public folder
only for alpha/beta/final releases.
That is, we are reversing the flow of files.
(See [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

See [data workflow](data-workflow.md). (Based on
[issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

We are also no longer generating and posting files with version suffixes.

Expand All @@ -15,6 +17,34 @@ and we continue to ingest them as before.

## Source Files

*Starting with Unicode 15.1, the “source of truth” for most data files is in the repo,
and most of this section is obsolete. See [data workflow](data-workflow.md).
The biggest exception is Unihan.zip, which we don't track in the repo; see the Unihan section below.
Also, it's still useful to delete the BIN files/folders after changing data files.*

### Unihan

You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".

Unzip the Unihan.zip file into a "Unihan" subfolder.

Starting with Unicode 13, we split the Unihan data into single-property files
and parse those.

Run the script that is checked in at
[py/splitunihan.py](../py/splitunihan.py)
with one argument, the path to the Unihan folder.

Ignore or delete the Unihan\*.txt files now. Do not check them into the tools
any more.

Check for new and no-longer-present files (Unihan properties).
`git add` and `git rm` as necessary.

### Fetching files from Public

Only for Unicode 15.0 and earlier:

The source files that you will need for a release such as 8.0.0 are in:

* [ftp://unicode.org/Public/8.0.0/ucd](ftp://unicode.org/Public/8.0.0/ucd)
Expand Down Expand Up @@ -68,6 +98,7 @@ files have the version suffix.
### Removing Suffixes

Only for Unicode 14 and earlier:

For the ucd and uca files, you will have to remove the suffixes.

Tip: On Linux, you can remove version suffixes on the command line like this:
Expand Down Expand Up @@ -134,25 +165,6 @@ $ cd {workspace}/unicodetools/data/ucd/staging
$ ../../desuffixucd.py .
```

### Unihan

You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".

Unzip the Unihan.zip file into a "Unihan" subfolder.

Starting with Unicode 13, we split the Unihan data into single-property files
and parse those.

Run the script that is checked in at
[py/splitunihan.py](../py/splitunihan.py)
with one argument, the path to the Unihan folder.

Ignore or delete the Unihan\*.txt files now. Do not check them into the tools
any more.

Check for new and no-longer-present files (Unihan properties).
`git add` and `git rm` as necessary.

## Original data file setup instructions

### 2. Download all of the UnicodeData files for each version into UCD_DIR.
Expand Down
7 changes: 0 additions & 7 deletions docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,6 @@

## Modifying

Create new revision directory, such as .../unicodetools/data/security/6.3.0. The
folder will match the version of the UCD used (perhaps with an incrementing 3rd
field).

* As usual, use `git cp` to copy the previous directory to the new one. Do not
just "mkdir" and copy the files!

To add or fix xidmodifications, look at source/removals.txt.

To add or fix confusables, there are multiple source files. Many were
Expand Down
10 changes: 4 additions & 6 deletions docs/uca/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@ the character properties are pretty stable (coming up on the beta),
Ken inserts all of the new characters into the default sort order.

For a few releases, he has documented his incremental progress with valuable notes
sent to the ucd-dev mailing list.
sent to the properties mailing list (formerly the ucd-dev list).
Markus has been taking the incremental file changes, and the notes, into this repo.

See the history of commits that changed decomps.txt and allkeys.txt.
(We lost some of that history in the Unicode server crash of 2020.)
- For UCA 15.1 see https://github.com/unicode-org/unicodetools/pull/403
- For UCA 15 see https://github.com/unicode-org/unicodetools/pull/246
- For UCA 14 see https://github.com/unicode-org/unicodetools/pull/71
- For the collection of notes for UCA 10 see ducet.md.
Expand All @@ -34,12 +35,9 @@ for the CLDR/ICU FractionalUCA.txt data.
2. We also need the UCA/DUCET files in
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/uca/dev
When they become first available for a new version, or when they are updated:
1. Note that the following steps are probably no longer necessary.
Instead, we get the updated files from Ken, or we run the sifter tool, and
1. We get the updated files from Ken, or we run the sifter tool, and
update the files in .../data/uca/dev.
1. Download UCA files (mostly allkeys.txt) from
`https://www.unicode.org/Public/UCA/{beta version}/`
1. Run `desuffixucd.py` (see the [inputdata](../inputdata.md) page)
1. Download Ken's UCA files (allkeys.txt & decomps.txt).
1. Update the input files for the UCA tools, at
{this repo}/unicodetools/data/uca/dev
3. You will use `org.unicode.text.UCA.Main` as your main class.
Expand Down

0 comments on commit 5f2992b

Please sign in to comment.