Skip to content

Latest commit

 

History

History
238 lines (207 loc) · 7.96 KB

SimpleSampleDatasetsProvider.md

File metadata and controls

238 lines (207 loc) · 7.96 KB

runner.SimpleSampleDatasetsProvider

View source on GitHub

Builds a sampling tf.data.Dataset from multiple filenames.

Inherits From: DatasetProvider

runner.SimpleSampleDatasetsProvider(
    principal_file_pattern: Optional[str] = None,
    extra_file_patterns: Optional[Sequence[str]] = None,
    principal_weight: Optional[float] = None,
    extra_weights: Optional[Sequence[float]] = None,
    *,
    principal_filenames: Optional[Sequence[str]] = None,
    extra_filenames: Optional[Sequence[Sequence[str]]] = None,
    principal_cardinality: Optional[int] = None,
    fixed_cardinality: bool = False,
    shuffle_filenames: bool = False,
    interleave_fn: Callable[..., tf.data.Dataset],
    examples_shuffle_size: Optional[int] = None
)

For complete explanations regarding sampling see _process_sampled_dataset().

This SimpleSampleDatasetsProvider builds a tf.data.Dataset as follows:

  • The object is initialized with a list of filenames specified by principle_filenames and extra_filenames argument. For convenience, the corresponding file pattern principal_file_pattern and extra_file_patterns can be specified instead, which will be expanded to a sorted list.
  • The filenames are sharded between replicas according to the InputContext (order matters).
  • Filenames are shuffled per replica (if requested).
  • Examples from all file patterns are sampled according to principal_weight and extra_weights.
  • The files in each shard are interleaved after being read by the interleave_fn.
  • Examples are shuffled (if requested), auto-prefetched, and returned for use in one replica of the trainer.

Args

principal_file_pattern A principal file pattern for sampling, to be expanded by tf.io.gfile.glob and sorted into the list of principal_filenames.
extra_file_patterns File patterns, to be expanded by tf.io.gfile.glob and sorted into the list of extra_filenames.
principal_weight An optional weight for the dataset corresponding to principal_file_pattern. Required iff extra_weights are also provided.
extra_weights Optional weights corresponding to file_patterns for sampling. Required iff principal_weight is also provided.
principal_filenames A list of principal filenames, specified explicitly. This argument is mutually exclusive with principal_file_pattern.
extra_filenames A list of extra filenames, specified explicitly. This argument is mutually exclusive with extra_file_patterns.
principal_cardinality Iff fixed_cardinality=True, the size of the returned dataset is computed as principal_cardinality / principal_weight (with a default of uniform weights).
fixed_cardinality Whether to take a fixed number of elements.
shuffle_filenames If enabled, filenames will be shuffled after sharding between replicas, before any file reads. Through interleaving, some files may be read in parallel: the details are auto-tuned for throughput.
interleave_fn A fn applied with tf.data.Dataset.interleave.
examples_shuffle_size An optional buffer size for example shuffling. If specified, the size is adjusted to shuffle_size // (len(file_patterns) + 1).

Methods

get_dataset

View source

get_dataset(
    context: tf.distribute.InputContext
) -> tf.data.Dataset

Creates a tf.data.Dataset by sampling.

The contents of the resulting tf.data.Dataset are sampled from several sources, each stored as a sharded dataset: * one principal input, whose size determines the size of the resulting tf.data.Dataset; * zero or more side inputs, which are repeated if necessary to preserve the requested samping weights.

Each input dataset is shared before interleaving. The result of interleaving is only shuffled if a examples_shuffle_size is provided.

Datasets are sampled from with tf.data.Dataset.sample_from_datasets. For sampling details, please refer to the TensorFlow documentation at: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#sample_from_datasets.

Two methods are supported to determine the end of the resulting tf.data.Dataset:

fixed_cardinality=True) Returns a dataset with a fixed cardinality, set at principal_cardinality // principal_weight. principal_dataset and principal_cardinality are required for this method. principal_weight is required iff extra_weights are also provided.

fixed_cardinality=False) Returns a dataset that ends after the principal input has been exhausted, subject to the random selection of samples. principal_dataset is required for this method. principal_weight is required iff extra_weights are also provided.

The choice of principal_dataset is important and should, in most cases, be chosen as the largest underlying dataset as compared to extra_datasets. positives and negatives where len(negatives) >> len(positives) and with positives corresponding to principal_dataset, the desired behavior of epochs determined by the exhaustion of positives and the continued mixing of unique elements from negatives may not occur: On sampled dataset reiteration positives will again be exhausted but elements from negatives may be those same seen in the previous epoch (as they occur at the beginning of the same, reiterated underlying negatives dataset). In this case, the recommendations are to:

  1. Reformulate the sampling in terms of the larger dataset (negatives), where, with fixed_cardinality=False, if the exhaustion of negatives is desired, or, with fixed_cardinality=True, when principal_cardinality can be used to specify the desired number of elements from negatives. 2) Ensure that the underlying principal_dataset of negatives are well-sharded. In this way, the nondeterminism of interleaving will randomly access elements of negatives on reiteration.
Args
context An tf.distribute.InputContext for sharding.
Returns
A tf.data.Dataset.