Skip to content

Commit

Permalink
changes to to_traintest_folders function (#109)
Browse files Browse the repository at this point in the history
* changes to to_traintest_folders function

* flake8 changes
  • Loading branch information
PatBall1 authored Jun 30, 2023
1 parent 524a5be commit 3cae008
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 14 deletions.
25 changes: 17 additions & 8 deletions detectree2/preprocessing/tiling.py
Original file line number Diff line number Diff line change
Expand Up @@ -439,18 +439,23 @@ def record_data(crowns,
f.close()


def to_traintest_folders(tiles_folder: str = "./",
out_folder: str = "./data/",
test_frac: float = 0.2,
folds: int = 1,
seed: int = None) -> None:
"""Send tiles to training (+validation) and test dir and automatically make sure no overlap.
def to_traintest_folders( # noqa: C901
tiles_folder: str = "./",
out_folder: str = "./data/",
test_frac: float = 0.2,
folds: int = 1,
strict: bool = False,
seed: int = None) -> None:
"""Send tiles to training (+validation) and test dir
With "strict" it is possible to automatically ensure no overlap between train/val and test tiles.
Args:
tiles_folder: folder with tiles
out_folder: folder to save train and test folders
test_frac: fraction of tiles to be used for testing
folds: number of folds to split the data into
strict: if True, training/validation files will be removed if there is any overlap with test files (inc buffer)
Returns:
None
Expand Down Expand Up @@ -482,14 +487,18 @@ def to_traintest_folders(tiles_folder: str = "./",

for i in range(0, len(file_roots)):
# copy to test
if i <= len(file_roots) * test_frac:
if i < len(file_roots) * test_frac:
test_boxes.append(image_details(file_roots[num[i]]))
shutil.copy((tiles_dir / file_roots[num[i]]).with_suffix(
Path(file_roots[num[i]]).suffix + ".geojson"), out_dir / "test")
else:
# copy to train
train_box = image_details(file_roots[num[i]])
if not is_overlapping_box(test_boxes, train_box):
if strict: # check if there is overlap with test boxes
if not is_overlapping_box(test_boxes, train_box):
shutil.copy((tiles_dir / file_roots[num[i]]).with_suffix(
Path(file_roots[num[i]]).suffix + ".geojson"), out_dir / "train")
else:
shutil.copy((tiles_dir / file_roots[num[i]]).with_suffix(
Path(file_roots[num[i]]).suffix + ".geojson"), out_dir / "train")

Expand Down
19 changes: 13 additions & 6 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,15 @@ To train a model you will need an orthomosaic (as ``<orthmosaic>.tif``) and
corresponding tree crown polgons that are readable by Geopandas
(e.g. ``<crowns_polygon>.gpkg``, ``<crowns_polygon>.shp``). For the best
results, manual crowns should be supplied as dense clusters rather than
sparsely scattered across in the landscape
sparsely scattered across in the landscape. See below for an example of the
required input crowns and image.

.. image:: ../../report/figures/Danum_example_data.png
:width: 400
:alt: Example Danum training data
:align: center

|
If you would just like to make predictions on an orthomosaic with a pre-trained
model from the ``model_garden``, skip to part 4 (Generating landscape predictions).

Expand Down Expand Up @@ -119,13 +125,14 @@ Send geojsons to train folder (with sub-folders for k-fold cross validation) and
.. code-block:: python
data_folder = out_dir # data_folder is the folder where the .png, .tif, .geojson tiles have been stored
to_traintest_folders(data_folder, out_dir, test_frac=0.15, folds=5)
to_traintest_folders(data_folder, out_dir, test_frac=0.15, strict=True, folds=5)
.. note::
The ``to_traintest_folders`` function automatically removes training/validation geojsons that overlap with test
tiles, ensuring strict spatial separation of the test data. However, this can remove a significant proportion of the
data available to train on so if validation accuracy is a sufficient test of model performance ``test_frac`` can be
set to ``0``. Alternatively, just set a ``test_frac`` value that is smaller than you might otherwise have put.
If ``strict=True``, the ``to_traintest_folders`` function will automatically removes training/validation geojsons
that have any overlap with test tiles (including the buffers), ensuring strict spatial separation of the test data.
However, this can remove a significant proportion of the data available to train on so if validation accuracy is a
sufficient test of model performance ``test_frac`` can be set to ``0`` or set ``strict=False`` (which allows for
some overlap in the buffers between test and train/val tiles).


The data has now been tiled and partitioned for model training, tuning and evaluation.
Expand Down
Binary file added report/figures/Danum_example_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 3cae008

Please sign in to comment.