Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DETR #9998

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions docs/source/model_doc/detr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

DETR
-----------------------------------------------------------------------------------------------------------------------

Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The DETR model was proposed in `End-to-End Object Detection with Transformers
<https://arxiv.org/abs/2005.12872>`__ by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov
and Sergey Zagoruyko. DETR consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained
end-to-end for object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which
use things like region proposals, non-maximum suppression procedure and anchor generation. Moreover, DETR can also be naturally extended
to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.

The abstract from the paper is the following:

*We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline,
effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly
encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based
global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of
learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions
in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates
accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection
dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms
competitive baselines.*

The original code can be found `here <https://github.com/facebookresearch/detr>`__.

Here's a TLDR explaining how the model works:

First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use ResNet-50/ResNet-101). Let's assume we also add a
batch dimension. This means that the input to the backbone is a tensor of shape :obj:`(1, 3, height, width)`, assuming the image has 3 color channels (RGB).
The CNN backbone outputs a new lower-resolution feature map, typically of shape :obj:`(1, 2048, height/32, width/32)`. This is then projected to match
the hidden dimension of the Transformer of DETR, which is :obj:`256` by default, using a :obj:`nn.Conv2D` layer. So now, we have a tensor of shape
:obj:`(1, 256, height/32, width/32).` Next, the image is flattened and transposed to obtain a tensor of shape :obj:`(batch_size, seq_len, d_model)` =
:obj:`(1, width/32*height/32, 256)`. So a difference with NLP models is that the sequence length is actually longer than usual, but with a smaller
:obj:`d_model` (which in NLP is typically 768 or higher).

Next, this is sent through the encoder, outputting :obj:`encoder_hidden_states` of the same shape (you can consider these as image features). Next, so-called
**object queries** are sent through the decoder. This is a tensor of shape :obj:`(batch_size, num_queries, d_model)`, with :obj:`num_queries` typically set
to 100 and is initialized with zeros. Each object query looks for a particular object in the image. Next, the decoder updates these object queries through
multiple self-attention and encoder-decoder attention layers to output :obj:`decoder_hidden_states` of the same shape: :obj:`(batch_size, num_queries, d_model)`.
Next, two heads are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no object", and a MLP
to predict bounding boxes for each query.

The model is trained using a **bipartite matching loss**: so what we actually do is compare the predicted classes + bounding boxes of each of the
N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will
just have a "no object" as class and "no bounding box" as bounding box). The `Hungarian matching algorithm <https://en.wikipedia.org/wiki/Hungarian_algorithm>`__ is used to create a one-to-one mapping of
each of the N queries to each of the N annotations. Next, standard cross-entropy and L1 bounding box losses are used to optimize the parameters of
the model.

Tips:

- DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum number of objects that
can be detected in a single image, and is set to 100 by default (see parameter :obj:`num_queries` of :class:`~transformers.DetrConfig`).
Note that it's good to have some slack (in COCO, the authors used 100, while the maximum number of objects in a COCO image is ~70).
- The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2, which use autoregressive decoding
instead of parallel. Hence, no causal attention mask is used.
- DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting to queries and keys.
For the position embeddings of the image, one can choose between fixed sinusoidal or learned absolute position embeddings. By default,
the parameter :obj:`position_embedding_type` of :class:`~transformers.DetrConfig` is set to :obj:`"sine"`.
- During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct
number of objects of each class. If you set the parameter :obj:`auxiliary_loss` of :class:`~transformers.DetrConfig` to :obj:`True`, then prediction
feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).

DetrConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.DetrConfig
:members:


DetrTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.DetrTokenizer
:members: __call__


DetrModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.DetrModel
:members: forward


DetrForObjectDetection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.DetrForObjectDetection
:members: forward



18 changes: 18 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@
"Wav2Vec2FeatureExtractor",
"Wav2Vec2Processor",
],
"models.detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig", "DetrTokenizer"],
"models.convbert": ["CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvBertConfig", "ConvBertTokenizer"],
"models.albert": ["ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "AlbertConfig"],
"models.auto": [
Expand Down Expand Up @@ -288,6 +289,7 @@
# tokenziers-backed objects
if is_tokenizers_available():
# Fast tokenizers
_import_structure["models.detr"].append("DetrTokenizerFast")
_import_structure["models.convbert"].append("ConvBertTokenizerFast")
_import_structure["models.albert"].append("AlbertTokenizerFast")
_import_structure["models.bart"].append("BartTokenizerFast")
Expand Down Expand Up @@ -376,6 +378,14 @@
_import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]
# PyTorch models structure

_import_structure["models.detr"].extend(
[
"DETR_PRETRAINED_MODEL_ARCHIVE_LIST",
"DetrForObjectDetection",
"DetrModel",
]
)

_import_structure["models.wav2vec2"].extend(
[
"WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST",
Expand Down Expand Up @@ -1297,6 +1307,7 @@
load_tf2_weights_in_pytorch_model,
)
from .models.albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
from .models.detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig, DetrTokenizer
from .models.auto import (
ALL_PRETRAINED_CONFIG_ARCHIVE_MAP,
CONFIG_MAPPING,
Expand Down Expand Up @@ -1452,6 +1463,7 @@
from .utils.dummy_sentencepiece_objects import *

if is_tokenizers_available():
from .models.detr import DetrTokenizerFast
from .models.albert import AlbertTokenizerFast
from .models.bart import BartTokenizerFast
from .models.barthez import BarthezTokenizerFast
Expand Down Expand Up @@ -1491,6 +1503,12 @@
# Modeling
if is_torch_available():

from .models.detr import (
DETR_PRETRAINED_MODEL_ARCHIVE_LIST,
DetrForObjectDetection,
DetrModel,
)

# Benchmarks
from .benchmark.benchmark import PyTorchBenchmark
from .benchmark.benchmark_args import PyTorchBenchmarkArguments
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
# limitations under the License.

from . import (
detr,
albert,
auto,
bart,
Expand Down
4 changes: 4 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from ...configuration_utils import PretrainedConfig
from ..albert.configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
from ..detr.configuration_detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig
from ..bart.configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
from ..bert.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from ..bert_generation.configuration_bert_generation import BertGenerationConfig
Expand Down Expand Up @@ -75,6 +76,7 @@
(key, value)
for pretrained_map in [
# Add archive maps here
DETR_PRETRAINED_CONFIG_ARCHIVE_MAP,
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
LED_PRETRAINED_CONFIG_ARCHIVE_MAP,
Expand Down Expand Up @@ -120,6 +122,7 @@
CONFIG_MAPPING = OrderedDict(
[
# Add configs here
("detr", DetrConfig),
("wav2vec2", Wav2Vec2Config),
("convbert", ConvBertConfig),
("led", LEDConfig),
Expand Down Expand Up @@ -171,6 +174,7 @@
MODEL_NAMES_MAPPING = OrderedDict(
[
# Add full (and cased) model names here
("detr", "Detr"),
("wav2vec2", "Wav2Vec2"),
("convbert", "ConvBERT"),
("led", "LED"),
Expand Down
12 changes: 12 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@
from ...utils import logging

# Add modeling imports here
from ..detr.modeling_detr import (
DetrForObjectDetection,
DetrModel,
)
from ..albert.modeling_albert import (
AlbertForMaskedLM,
AlbertForMultipleChoice,
Expand Down Expand Up @@ -68,6 +72,10 @@
)

# Add modeling imports here
from ..detr.modeling_detr import (
DetrForObjectDetection,
DetrModel,
)
from ..convbert.modeling_convbert import (
ConvBertForMaskedLM,
ConvBertForMultipleChoice,
Expand Down Expand Up @@ -258,6 +266,7 @@
XLNetModel,
)
from .configuration_auto import (
DetrConfig,
AlbertConfig,
AutoConfig,
BartConfig,
Expand Down Expand Up @@ -313,6 +322,7 @@
MODEL_MAPPING = OrderedDict(
[
# Base model mapping
(DetrConfig, DetrModel),
(Wav2Vec2Config, Wav2Vec2Model),
(ConvBertConfig, ConvBertModel),
(LEDConfig, LEDModel),
Expand Down Expand Up @@ -396,6 +406,7 @@
MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
[
# Model with LM heads mapping

(Wav2Vec2Config, Wav2Vec2ForMaskedLM),
(ConvBertConfig, ConvBertForMaskedLM),
(LEDConfig, LEDForConditionalGeneration),
Expand Down Expand Up @@ -495,6 +506,7 @@
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
[
# Model for Seq2Seq Causal LM mapping

(LEDConfig, LEDForConditionalGeneration),
(BlenderbotSmallConfig, BlenderbotSmallForConditionalGeneration),
(MT5Config, MT5ForConditionalGeneration),
Expand Down
71 changes: 71 additions & 0 deletions src/transformers/models/detr/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.

# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...file_utils import _BaseLazyModule, is_torch_available, is_tokenizers_available
_import_structure = {
"configuration_detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig"],
"tokenization_detr": ["DetrTokenizer"],
}

if is_tokenizers_available():
_import_structure["tokenization_detr_fast"] = ["DetrTokenizerFast"]

if is_torch_available():
_import_structure["modeling_detr"] = [
"DETR_PRETRAINED_MODEL_ARCHIVE_LIST",
"DetrForObjectDetection",
"DetrModel",
"DetrPreTrainedModel",
]




if TYPE_CHECKING:
from .configuration_detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig
from .tokenization_detr import DetrTokenizer

if is_tokenizers_available():
from .tokenization_detr_fast import DetrTokenizerFast

if is_torch_available():
from .modeling_detr import (
DETR_PRETRAINED_MODEL_ARCHIVE_LIST,
DetrForObjectDetection,
DetrModel,
DetrPreTrainedModel,
)


else:
import importlib
import os
import sys

class _LazyModule(_BaseLazyModule):
"""
Module class that surfaces all objects but only performs associated imports when the objects are requested.
"""

__file__ = globals()["__file__"]
__path__ = [os.path.dirname(__file__)]

def _get_module(self, module_name: str):
return importlib.import_module("." + module_name, self.__name__)

sys.modules[__name__] = _LazyModule(__name__, _import_structure)
Loading