[WIP] Support source and target features #2289

anderleich · 2023-01-07T21:08:42Z

ATTENTION! I'm opening a new PR as v3.0 branch has already been merged in master
The previous PR was #2227. Closed

This PR intends to add target features support to OpenNMT-py v3.0. All the code has been adapted for this new version.

Both source and target features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character ￨ as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

 I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data:
    train:
        path_src: src_with_features.txt  #  I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1
        path_tgt: tgt_with_features.txt  #  Me￨1 gusta￨0 comer￨0 pizza￨0
        transforms: [onmt_tokenize, inferfeats, filtertoolong]
    valid:
        path_src: src_with_features.txt
        path_tgt: tgt_with_features.txt
        transforms: [onmt_tokenize, inferfeats]

save_data: ./data
n_sample: -1

# # Vocab opts
src_vocab: data.vocab.src
tgt_vocab: data.vocab.tgt
n_src_feats: 2
n_tgt_feats: 1
src_feats_defaults: "0￨1"
tgt_feats_defaults: "1"
feat_merge: "sum"

For now, I've made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work.

…atures

vince62s · 2023-01-27T14:23:43Z

@anderleich is there anything (at least source features) that we can test so far ? are you still working on this ?

anderleich · 2023-01-28T11:05:09Z

Hi @vince62s ,

Yes, I'm still working on this. I haven't had much time for this lately, but I'm on the right path. I've already managed to modify the vocabulary building and training scripts. These both work for source and target features. I need to ensure nothing is broken for other use cases (some debugging to make tests pass). After that the only thing left would be to modify the inference (onmt_translate and the server).

…ch/OpenNMT-py into support_target_features

anderleich · 2023-02-02T13:04:59Z

Hi @vince62s ,

I've made code changes to make other configurations pass the tests. I still have some issues with the beam search translation and LM generation in the tests. Maybe you can shed some light on this? However, I think we are at a stage where we can begin to disccuss and test the changes I've made so far.

Note: I took the changes @francoishernandez made in this PR #1710 as a guide.

onmt/inputters/inputter.py

onmt/inputters/text_utils.py

vince62s · 2023-02-02T14:28:02Z

ok, I have made a first pass. I need to clone your repo and test locally on my system. Will try to do it asap.
Jusr make sure you have rebased and added the few things.

anderleich · 2023-02-03T20:03:55Z

I think that having dictionary like examples was the source of many incosistencies with source or target features. Therefore, I've added a new class onmt.inputters.example.Example to store all the necessary information for the input data, as well as to numericalize and transform the examples. This new class handles all the possible combinations for the input data, that is, the existence of the target sentence, source features, target features, alignments, the copy mechanism...

I could not come up with a reason for keeping the dict like examples, do you?

anderleich · 2023-02-03T20:05:48Z

I've also made the necessary changes for beam search decoding. It works with my working pipeline, however some tests are failing... I'll keep working on it

vince62s · 2023-02-03T20:15:34Z

@anderleich I don't mind discussing this new stuff but I really think it is too many changes for a single PR, especially when having done a first review.
Might be good to settle a bit with some changes and step by step include new concepts.

anderleich · 2023-02-03T20:21:45Z

I agree that adding the Example class resulted in many small changes in the code, specially in the tranforms. However, I think that overall it helped solving some of the inconsistencies with data I was dealing with, speacially now that I've also added target features.

vince62s · 2023-02-03T20:47:44Z

I get it but the issue is that we have plenty of missing unit tests a bit everywhere so we'll never be sure that it does not break things. I really would prefer to do things in at least two steps.

anderleich · 2023-02-03T21:26:47Z

Do you know a quick method to split the changes in two steps? Quicker than typing everything again...
I think is worth a try, if I get to pass all the unit tests implemented so far.

vince62s · 2023-02-04T07:54:01Z

You need to

create a new clean branch from master (don't forget to pull master too)
then you cherry pick commits from your working branch (google this you'll find,not difficult)

anderleich · 2023-02-06T09:23:54Z

Hi @vince62s ,

This is what I've finally planned on this. I will submit 3 different PRs to keep changes simpler and make reviews easier:

Restore back source features to have them back as soon as possible
Add the Example class to set the ground for the more complex scenario with target features
Add target features support

What do you think?

vince62s · 2023-02-06T10:03:54Z

We can try, let's do step 1. and we'll see how it goes.

anderleich · 2023-02-06T13:09:05Z

@vince62s I've created a new PR #2308 to restore back source features. All the test are passing. I'm planning to carry out some more checks to ensure everything is working as expected but overall the source features functionality is back. You can start reviewing the code.

TODO: I need to update the docs

anderleich and others added 7 commits October 24, 2022 16:16

Added target features support to build_vocab

ff6a605

Merge branch 'v3.0' into support_target_features

0e5cc73

reinstate apex.amp (O1 O2) (OpenNMT#2220)

4cb2e0c

Merge remote-tracking branch 'upstream/master' into support_target_fe…

156f646

…atures

Update comment

8b5600f

Add target features support to training part

a8e6fe6

Merge branch 'v3.0' into support_target_features

471e12c

anderleich mentioned this pull request Jan 7, 2023

[WIP] Support target features #2227

Closed

anderleich added 4 commits February 1, 2023 15:46

Fixed flake8 errors

ae08ccb

Merge branch 'master' into support_target_features

252f81f

Code changes to make other configurations pass tests

fe06fe6

Merge branch 'support_target_features' of https://github.com/anderlei…

df2e3f8

…ch/OpenNMT-py into support_target_features

anderleich changed the title ~~[WIP] Support target features~~ [WIP] Support source and target features Feb 2, 2023