Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve random state handling #801

Merged
merged 13 commits into from
May 20, 2019
Merged

Improve random state handling #801

merged 13 commits into from
May 20, 2019

Conversation

ClemDoum
Copy link
Collaborator

@ClemDoum ClemDoum commented May 13, 2019

Description:
Currenlty

  • Due to some scikit-learn bug the intent classification training was not deterministic
  • Some data augmentation code was also making the training non deterministic

Done:

  • Integrated sklearn==0.21 which contains a fix which makes SGDClassifier training deterministic
  • Moved the NLU random state from the config to the share resources
  • Fixed a couple of bugs in data augmentation which made the training non deterministic

Checklist:

  • My PR is ready for code review
  • I have added some tests, if applicable, and run the whole test suite, including linting tests
  • I have updated the documentation, if applicable

@ClemDoum ClemDoum force-pushed the task/improve-random-seed branch from e29cc06 to 6ff0eea Compare May 13, 2019 14:07
@codecov-io
Copy link

codecov-io commented May 16, 2019

Codecov Report

Merging #801 into develop will increase coverage by 0.04%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #801      +/-   ##
===========================================
+ Coverage    88.42%   88.47%   +0.04%     
===========================================
  Files           76       76              
  Lines         4571     4571              
  Branches       882      882              
===========================================
+ Hits          4042     4044       +2     
+ Misses         397      395       -2     
  Partials       132      132

@ClemDoum ClemDoum force-pushed the task/improve-random-seed branch from f956a78 to 9e43dfd Compare May 16, 2019 09:21
@ClemDoum ClemDoum requested a review from adrienball May 16, 2019 09:47
while True:
noise_length = int(random_state.normal(mean_length, std_length))
i += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable i

@@ -35,8 +35,9 @@ def test_should_get_slots(self):
- make me [number_of_cups:snips/number](five) cups of tea
- please I want [number_of_cups](two) cups of tea""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is not needed anymore here.

@@ -101,10 +104,11 @@ def test_should_get_sub_builtin_slots(self):
- find an activity from [start](6pm) to [end](8pm)
- Book me a trip from [start](this friday) to [end](next tuesday)""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

@@ -65,9 +66,11 @@ def test_should_get_builtin_slots(self):
- Can you tell me the weather [datetime] please ?
- what is the weather forecast [datetime] in [location](paris)""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is not needed anymore here.

@@ -356,9 +360,10 @@ def test_should_get_slots_after_deserialization(self):
- i want [number_of_cups] cups of tea please
- can you prepare [number_of_cups] cups of tea ?""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
config = CRFSlotFillerConfig(random_seed=42)
config = CRFSlotFillerConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

classifier_config = LogRegIntentClassifierConfig(random_seed=42)
slot_filler_config = CRFSlotFillerConfig(random_seed=42)
classifier_config = LogRegIntentClassifierConfig()
slot_filler_config = CRFSlotFillerConfig()
parser_config = ProbabilisticIntentParserConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

classifier_config = LogRegIntentClassifierConfig(random_seed=42)
slot_filler_config = CRFSlotFillerConfig(random_seed=42)
classifier_config = LogRegIntentClassifierConfig()
slot_filler_config = CRFSlotFillerConfig()
parser_config = ProbabilisticIntentParserConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

@@ -162,9 +169,12 @@ def test_should_get_intents(self):
utterances:
- yili yulu yele""")
dataset = Dataset.from_yaml_files("en", [dataset_stream]).json
classifier_config = LogRegIntentClassifierConfig(random_seed=42)
classifier_config = LogRegIntentClassifierConfig()
parser_config = ProbabilisticIntentParserConfig(classifier_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

random_seed=seed1),
slot_filler_config=CRFSlotFillerConfig(random_seed=seed2)
intent_classifier_config=LogRegIntentClassifierConfig(),
slot_filler_config=CRFSlotFillerConfig()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

different outputs.

If you want to run training in a reproducible way you can pass a random seed to
your engine:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using a more impersonal form in the documentation, but that's just a suggestion. That would be something like:

Reproducible training and testing can be achieved by passing a 
**random seed** to the engine:

@@ -174,6 +174,26 @@ the dataset we generated earlier:

engine.fit(dataset)

Note that by default, the training of the engine is non-deterministic: if you
train your NLU twice on the same data and test it on the same input, you'll get
different outputs.
Copy link
Contributor

@adrienball adrienball May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be a bit more optimistic in the formulation:

Note that, by default, the training of the NLU engine is a non-deterministic process: 
training and testing multiple times on the same data may produce different outputs.

@ClemDoum ClemDoum force-pushed the task/improve-random-seed branch from ae0633e to 45f4fd4 Compare May 20, 2019 14:03
@ClemDoum ClemDoum requested a review from adrienball May 20, 2019 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants