Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync vocab in vectors and components sourced in configs #9335

Conversation

adrianeboyd
Copy link
Contributor

@adrianeboyd adrianeboyd commented Sep 30, 2021

Description

Since a component may reference anything in the vocab, share the full vocab when loading source components and vectors (which will include strings as of #8909).

When loading a source component from a config, save and restore the vocab state after loading source pipelines, in particular to preserve the original state without vectors, since [initialize.vectors] = null skips rather than resets the vectors.

The vocab references are not synced for components loaded with Language.add_pipe(source=) because the pipelines are already loaded and not necessarily with the same vocab. A warning could be added in Language.create_pipe_from_source that it may be necessary to save and reload before training, but it's a rare enough case that this kind of warning may be too noisy overall.

Types of change

Bug fix.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components labels Sep 30, 2021
@adrianeboyd adrianeboyd force-pushed the bugfix/strings-in-sourced-components branch 3 times, most recently from 48f1413 to c808c1f Compare September 30, 2021 12:12
@adrianeboyd adrianeboyd marked this pull request as draft September 30, 2021 17:19
Since a component may reference anything in the vocab, share the full
vocab when loading source components and vectors (which will include
`strings` as of explosion#8909).

When loading a source component from a config, save and restore the
vocab state after loading source pipelines, in particular to preserve
the original state without vectors, since `[initialize.vectors]
= null` skips rather than resets the vectors.

The vocab references are not synced for components loaded with
`Language.add_pipe(source=)` because the pipelines are already loaded
and not necessarily with the same vocab. A warning could be added in
`Language.create_pipe_from_source` that it may be necessary to save and
reload before training, but it's a rare enough case that this kind of
warning may be too noisy overall.
@adrianeboyd adrianeboyd force-pushed the bugfix/strings-in-sourced-components branch from b9e77b5 to 9776ffd Compare October 1, 2021 12:16
@adrianeboyd adrianeboyd changed the title Sync string store in components sourced in configs Sync vocab in vectors and components sourced in configs Oct 1, 2021
@adrianeboyd adrianeboyd marked this pull request as ready for review October 1, 2021 17:25
@adrianeboyd adrianeboyd closed this Oct 2, 2021
@adrianeboyd adrianeboyd reopened this Oct 2, 2021
@adrianeboyd adrianeboyd merged commit 4192e71 into explosion:master Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants