Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support source and target features #2289

Closed
wants to merge 15 commits into from

Conversation

anderleich
Copy link
Contributor

@anderleich anderleich commented Jan 7, 2023

ATTENTION! I'm opening a new PR as v3.0 branch has already been merged in master
The previous PR was #2227. Closed

This PR intends to add target features support to OpenNMT-py v3.0. All the code has been adapted for this new version.

Both source and target features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

 I│1│3 love│0│1 eating│0│1 pizza│0│1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data:
    train:
        path_src: src_with_features.txt  #  I│1│3 love│0│1 eating│0│1 pizza│0│1
        path_tgt: tgt_with_features.txt  #  Me│1 gusta│0 comer│0 pizza│0
        transforms: [onmt_tokenize, inferfeats, filtertoolong]
    valid:
        path_src: src_with_features.txt
        path_tgt: tgt_with_features.txt
        transforms: [onmt_tokenize, inferfeats]

save_data: ./data
n_sample: -1

# # Vocab opts
src_vocab: data.vocab.src
tgt_vocab: data.vocab.tgt
n_src_feats: 2
n_tgt_feats: 1
src_feats_defaults: "0│1"
tgt_feats_defaults: "1"
feat_merge: "sum"

For now, I've made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work.

@vince62s
Copy link
Member

@anderleich is there anything (at least source features) that we can test so far ? are you still working on this ?

@anderleich
Copy link
Contributor Author

Hi @vince62s ,

Yes, I'm still working on this. I haven't had much time for this lately, but I'm on the right path. I've already managed to modify the vocabulary building and training scripts. These both work for source and target features. I need to ensure nothing is broken for other use cases (some debugging to make tests pass). After that the only thing left would be to modify the inference (onmt_translate and the server).

@anderleich
Copy link
Contributor Author

Hi @vince62s ,

I've made code changes to make other configurations pass the tests. I still have some issues with the beam search translation and LM generation in the tests. Maybe you can shed some light on this? However, I think we are at a stage where we can begin to disccuss and test the changes I've made so far.

Note: I took the changes @francoishernandez made in this PR #1710 as a guide.

@anderleich anderleich changed the title [WIP] Support target features [WIP] Support source and target features Feb 2, 2023
@vince62s
Copy link
Member

vince62s commented Feb 2, 2023

ok, I have made a first pass. I need to clone your repo and test locally on my system. Will try to do it asap.
Jusr make sure you have rebased and added the few things.

@anderleich
Copy link
Contributor Author

anderleich commented Feb 3, 2023

I think that having dictionary like examples was the source of many incosistencies with source or target features. Therefore, I've added a new class onmt.inputters.example.Example to store all the necessary information for the input data, as well as to numericalize and transform the examples. This new class handles all the possible combinations for the input data, that is, the existence of the target sentence, source features, target features, alignments, the copy mechanism...

I could not come up with a reason for keeping the dict like examples, do you?

@anderleich
Copy link
Contributor Author

I've also made the necessary changes for beam search decoding. It works with my working pipeline, however some tests are failing... I'll keep working on it

@vince62s
Copy link
Member

vince62s commented Feb 3, 2023

@anderleich I don't mind discussing this new stuff but I really think it is too many changes for a single PR, especially when having done a first review.
Might be good to settle a bit with some changes and step by step include new concepts.

@anderleich
Copy link
Contributor Author

I agree that adding the Example class resulted in many small changes in the code, specially in the tranforms. However, I think that overall it helped solving some of the inconsistencies with data I was dealing with, speacially now that I've also added target features.

@vince62s
Copy link
Member

vince62s commented Feb 3, 2023

I get it but the issue is that we have plenty of missing unit tests a bit everywhere so we'll never be sure that it does not break things. I really would prefer to do things in at least two steps.

@anderleich
Copy link
Contributor Author

Do you know a quick method to split the changes in two steps? Quicker than typing everything again...
I think is worth a try, if I get to pass all the unit tests implemented so far.

@vince62s
Copy link
Member

vince62s commented Feb 4, 2023

You need to

  1. create a new clean branch from master (don't forget to pull master too)
  2. then you cherry pick commits from your working branch (google this you'll find,not difficult)

@anderleich
Copy link
Contributor Author

anderleich commented Feb 6, 2023

Hi @vince62s ,

This is what I've finally planned on this. I will submit 3 different PRs to keep changes simpler and make reviews easier:

  1. Restore back source features to have them back as soon as possible
  2. Add the Example class to set the ground for the more complex scenario with target features
  3. Add target features support

What do you think?

@vince62s
Copy link
Member

vince62s commented Feb 6, 2023

We can try, let's do step 1. and we'll see how it goes.

@anderleich
Copy link
Contributor Author

@vince62s I've created a new PR #2308 to restore back source features. All the test are passing. I'm planning to carry out some more checks to ensure everything is working as expected but overall the source features functionality is back. You can start reviewing the code.

TODO: I need to update the docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants