Builds a sampling tf.data.Dataset
from multiple filenames.
Inherits From: DatasetProvider
runner.SimpleSampleDatasetsProvider(
principal_file_pattern: Optional[str] = None,
extra_file_patterns: Optional[Sequence[str]] = None,
principal_weight: Optional[float] = None,
extra_weights: Optional[Sequence[float]] = None,
*,
principal_filenames: Optional[Sequence[str]] = None,
extra_filenames: Optional[Sequence[Sequence[str]]] = None,
principal_cardinality: Optional[int] = None,
fixed_cardinality: bool = False,
shuffle_filenames: bool = False,
interleave_fn: Callable[..., tf.data.Dataset],
examples_shuffle_size: Optional[int] = None
)
For complete explanations regarding sampling see _process_sampled_dataset()
.
This SimpleSampleDatasetsProvider
builds a tf.data.Dataset
as follows:
- The object is initialized with a list of filenames specified by
principle_filenames
andextra_filenames
argument. For convenience, the corresponding file patternprincipal_file_pattern
andextra_file_patterns
can be specified instead, which will be expanded to a sorted list. - The filenames are sharded between replicas according to the
InputContext
(order matters). - Filenames are shuffled per replica (if requested).
- Examples from all file patterns are sampled according to
principal_weight
andextra_weights.
- The files in each shard are interleaved after being read by the
interleave_fn
. - Examples are shuffled (if requested), auto-prefetched, and returned for use in one replica of the trainer.
get_dataset(
context: tf.distribute.InputContext
) -> tf.data.Dataset
Creates a tf.data.Dataset
by sampling.
The contents of the resulting tf.data.Dataset
are sampled from several
sources, each stored as a sharded dataset: * one principal input, whose size
determines the size of the resulting tf.data.Dataset
; * zero or more side
inputs, which are repeated if necessary to preserve the requested samping
weights.
Each input dataset is shared before interleaving. The result of interleaving is
only shuffled if a examples_shuffle_size
is provided.
Datasets are sampled from with tf.data.Dataset.sample_from_datasets.
For
sampling details, please refer to the TensorFlow documentation at:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#sample_from_datasets.
Two methods are supported to determine the end of the resulting
tf.data.Dataset
:
fixed_cardinality=True) Returns a dataset with a fixed cardinality, set at
principal_cardinality
// principal_weight.
principal_dataset
and
principal_cardinality
are required for this method. principal_weight
is
required iff extra_weights
are also provided.
fixed_cardinality=False) Returns a dataset that ends after the principal input
has been exhausted, subject to the random selection of samples.
principal_dataset
is required for this method. principal_weight
is required
iff extra_weights
are also provided.
The choice of principal_dataset
is important and should, in most cases, be
chosen as the largest underlying dataset as compared to extra_datasets.
positives
and negatives
where len(negatives)
>> len(positives)
and with
positives
corresponding to principal_dataset,
the desired behavior of epochs
determined by the exhaustion of positives
and the continued mixing of unique
elements from negatives
may not occur: On sampled dataset reiteration
positives
will again be exhausted but elements from negatives
may be those
same seen in the previous epoch (as they occur at the beginning of the same,
reiterated underlying negatives
dataset). In this case, the recommendations
are to:
- Reformulate the sampling in terms of the larger dataset (
negatives
), where, withfixed_cardinality=False
, if the exhaustion ofnegatives
is desired, or, withfixed_cardinality=True
, whenprincipal_cardinality
can be used to specify the desired number of elements fromnegatives.
2) Ensure that the underlyingprincipal_dataset
ofnegatives
are well-sharded. In this way, the nondeterminism of interleaving will randomly access elements ofnegatives
on reiteration.
Args | |
---|---|
context
|
An tf.distribute.InputContext for sharding.
|
Returns | |
---|---|
A tf.data.Dataset.
|