You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: beginner_source/quickstart/data_quickstart_tutorial.py
+73-62Lines changed: 73 additions & 62 deletions
Original file line number
Diff line number
Diff line change
@@ -7,20 +7,22 @@
7
7
# Getting Started With Data in PyTorch
8
8
# -----------------
9
9
#
10
-
# Before we can even think about building a model with PyTorch, we need to first learn how to load and process data. Data can be sourced from local files, cloud datastores and database queries. It comes in all sorts of forms and formats from structured tables to image, audio, text, video files and more.
10
+
# Before we start building models with PyTorch, let's first learn how to load and process data. Data can be sourced from local files, cloud datastores and database queries. It comes in all sorts of forms and formats from structured tables to image, audio, text, video files and more.
# Different data types require different python libraries to load and process such as `openCV <https://opencv.org/>`_ and `PIL <https://pillow.readthedocs.io/en/stable/reference/Image.html>`_ for images, `NLTK <https://www.nltk.org/>`_ and `spaCy <https://spacy.io/>`_ for text and `Librosa <https://librosa.org/doc/latest/index.html>`_ for audio.
18
-
#
19
-
# If not properly organized, code for processing data samples can quickly get messy and become hard to maintain. Since different model architectures can be applied to many data types, we ideally want our dataset code to be decoupled from our model training code. To this end, PyTorch provides a simple Datasets interface for linking managing collections of data.
20
-
#
21
-
# A whole set of example datasets such as Fashion MNIST that implement this interface are built into PyTorch extension libraries. These are useful for benchmarking and testing your models before training on your own custom datasets.
# Different data types require different python libraries to load and process such as `openCV <https://opencv.org/>`_ and `PIL <https://pillow.readthedocs.io/en/stable/reference/Image.html>`_ for images, `NLTK <https://www.nltk.org/>`_ and `spaCy <https://spacy.io/>`_ for text and `Librosa <https://librosa.org/doc/latest/index.html>`_ for audio.
20
+
#
21
+
# If not properly organized, code for processing data samples can quickly get messy and become hard to maintain. Since different model architectures can be applied to many data types, we ideally want our dataset code to be decoupled from our model training code. To this end, PyTorch provides a simple Datasets interface for linking managing collections of data.
22
+
#
23
+
# A whole set of example datasets such as Fashion MNIST that implement this interface are built into PyTorch extension libraries. They are subclasses of torch.utils.data.Dataset that have parameters and functions specific to the type of data and the particular dataset. The actual data samples can be downloaded from the internet.These are useful for benchmarking and testing your models before training on your own custom datasets.
# Once we have a Dataset we can index it manually like a list `clothing[index]`.
35
-
#
36
-
# Here is an example of how to load the fashion MNIST dataset from torch vision.
37
-
#
35
+
#
36
+
# Once we have a Dataset we can index it manually like a list `clothing[index]`.
37
+
#
38
+
# Here is an example of how to load the [Fashion-MNIST](https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/) dataset from torch vision. "[Fashion-MNIST](https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/) is a dataset of Zalando’s article images consisting of of 60,000 training examples and 10,000 test examples. Each example is comprised of a 28×28 grayscale image, associated with a label from one of 10 classes. Read more [here](https://pytorch.org/docs/stable/torchvision/datasets.html#fashion-mnist).
39
+
# To load the FashionMNIST Dataset we need to provide the following three parameters:
40
+
# - root is the path where the train/test data is stored.
41
+
# - train includes the training dataset.
42
+
# - setting download to true downloads the data from the internet if it's not available at root.
# Import os for file handling, torch for PyTorch, `pandas <https://pandas.pydata.org/>`_ for loading labels, `torch vision <https://pytorch.org/blog/pytorch-1.7-released/>`_ to read image files, and Dataset to implement the Dataset interface.
# The init function is used for all the first time operations when our Dataset is loaded. In this case we use it to load our annotation labels to memory and the keep track of directory of our image file. Note that different types of data can take different init inputs you are not limited to just an annotations file, directory_path and transforms but for images this is a standard practice.
116
-
#
128
+
# A sample csv annotations file may look as follows:
# The __len__ function is very simple here we just need to return the number of samples in our dataset.
131
-
#
146
+
# The __len__ function is very simple here we just need to return the number of samples in our dataset.
147
+
#
132
148
# Example:
133
149
134
-
135
150
def__len__(self):
136
151
returnlen(self.img_labels)
137
152
@@ -140,46 +155,42 @@ def __len__(self):
140
155
# -----------------
141
156
#
142
157
# The __getitem__ function is the most important function in the Datasets interface this. It takes a tensor or an index as input and returns a loaded sample from you dataset at from the given indecies.
143
-
#
144
-
# In this sample if provided a tensor we convert the tensor to a list containing our index. We then load the file at the given index from our image directory as well as the image label from our pandas annotations DataFrame. This image and label are then wrapped in a single sample dictionary which we can apply a Transform on and return. To learn more about Transforms see the next section of the Blitz.
145
-
#
158
+
#
159
+
# In this sample if provided a tensor we convert the tensor to a list containing our index. We then load the file at the given index from our image directory as well as the image label from our pandas annotations DataFrame. This image and label are then wrapped in a single sample dictionary which we can apply a Transform on and return. To learn more about Transforms see the next section of the Blitz.
# Preparing your data for training with DataLoaders
163
-
# -----------------
164
-
#
165
-
# Now we have a organized mechansim for managing data which is great, but there is still a lot of manual work we would have to do train a model with our Dataset.
166
-
#
167
-
# For example we would have to manually maintain the code for:
# Now we have a organized mechansim for managing data which is great, but there is still a lot of manual work we would have to do train a model with our Dataset.
181
+
#
182
+
# For example we would have to manually maintain the code for:
183
+
# * Batching
184
+
# * Suffling
185
+
# * Parallel batch distribution
186
+
#
172
187
# The PyTorch Dataloader *torch.utils.data.DataLoader* is an iterator that handles all of this complexity for us enabling us to load a dataset and focusing on train our model.
0 commit comments