runner.SimpleSampleDatasetsProvider

Builds a sampling tf.data.Dataset from multiple filenames.

Inherits From: DatasetProvider

runner.SimpleSampleDatasetsProvider(
    principal_file_pattern: Optional[str] = None,
    extra_file_patterns: Optional[Sequence[str]] = None,
    principal_weight: Optional[float] = None,
    extra_weights: Optional[Sequence[float]] = None,
    *,
    principal_filenames: Optional[Sequence[str]] = None,
    extra_filenames: Optional[Sequence[Sequence[str]]] = None,
    principal_cardinality: Optional[int] = None,
    fixed_cardinality: bool = False,
    shuffle_filenames: bool = False,
    interleave_fn: Callable[..., tf.data.Dataset],
    examples_shuffle_size: Optional[int] = None
)

For complete explanations regarding sampling see _process_sampled_dataset().

This SimpleSampleDatasetsProvider builds a tf.data.Dataset as follows:

The object is initialized with a list of filenames specified by principle_filenames and extra_filenames argument. For convenience, the corresponding file pattern principal_file_pattern and extra_file_patterns can be specified instead, which will be expanded to a sorted list.
The filenames are sharded between replicas according to the InputContext (order matters).
Filenames are shuffled per replica (if requested).
Examples from all file patterns are sampled according to principal_weight and extra_weights.
The files in each shard are interleaved after being read by the interleave_fn.
Examples are shuffled (if requested), auto-prefetched, and returned for use in one replica of the trainer.

Args
`principal_file_pattern`	A principal file pattern for sampling, to be expanded by `tf.io.gfile.glob` and sorted into the list of `principal_filenames`.
`extra_file_patterns`	File patterns, to be expanded by `tf.io.gfile.glob` and sorted into the list of `extra_filenames`.
`principal_weight`	An optional weight for the dataset corresponding to `principal_file_pattern.` Required iff `extra_weights` are also provided.
`extra_weights`	Optional weights corresponding to `file_patterns` for sampling. Required iff `principal_weight` is also provided.
`principal_filenames`	A list of principal filenames, specified explicitly. This argument is mutually exclusive with `principal_file_pattern`.
`extra_filenames`	A list of extra filenames, specified explicitly. This argument is mutually exclusive with `extra_file_patterns`.
`principal_cardinality`	Iff `fixed_cardinality`=True, the size of the returned dataset is computed as `principal_cardinality` / `principal_weight` (with a default of uniform weights).
`fixed_cardinality`	Whether to take a fixed number of elements.
`shuffle_filenames`	If enabled, filenames will be shuffled after sharding between replicas, before any file reads. Through interleaving, some files may be read in parallel: the details are auto-tuned for throughput.
`interleave_fn`	A fn applied with `tf.data.Dataset.interleave.`
`examples_shuffle_size`	An optional buffer size for example shuffling. If specified, the size is adjusted to `shuffle_size // (len(file_patterns) + 1).`

Methods

`get_dataset`

View source

get_dataset(
    context: tf.distribute.InputContext
) -> tf.data.Dataset

Creates a tf.data.Dataset by sampling.

The contents of the resulting tf.data.Dataset are sampled from several sources, each stored as a sharded dataset: * one principal input, whose size determines the size of the resulting tf.data.Dataset; * zero or more side inputs, which are repeated if necessary to preserve the requested samping weights.

Each input dataset is shared before interleaving. The result of interleaving is only shuffled if a examples_shuffle_size is provided.

Datasets are sampled from with tf.data.Dataset.sample_from_datasets. For sampling details, please refer to the TensorFlow documentation at: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#sample_from_datasets.

Two methods are supported to determine the end of the resulting tf.data.Dataset:

fixed_cardinality=True) Returns a dataset with a fixed cardinality, set at principal_cardinality // principal_weight. principal_dataset and principal_cardinality are required for this method. principal_weight is required iff extra_weights are also provided.

fixed_cardinality=False) Returns a dataset that ends after the principal input has been exhausted, subject to the random selection of samples. principal_dataset is required for this method. principal_weight is required iff extra_weights are also provided.

The choice of principal_dataset is important and should, in most cases, be chosen as the largest underlying dataset as compared to extra_datasets. positives and negatives where len(negatives) >> len(positives) and with positives corresponding to principal_dataset, the desired behavior of epochs determined by the exhaustion of positives and the continued mixing of unique elements from negatives may not occur: On sampled dataset reiteration positives will again be exhausted but elements from negatives may be those same seen in the previous epoch (as they occur at the beginning of the same, reiterated underlying negatives dataset). In this case, the recommendations are to:

Reformulate the sampling in terms of the larger dataset (negatives), where, with fixed_cardinality=False, if the exhaustion of negatives is desired, or, with fixed_cardinality=True, when principal_cardinality can be used to specify the desired number of elements from negatives. 2) Ensure that the underlying principal_dataset of negatives are well-sharded. In this way, the nondeterminism of interleaving will randomly access elements of negatives on reiteration.

Args
`context`	An `tf.distribute.InputContext` for sharding.

Returns
A `tf.data.Dataset.`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!