（DETR）End-to-End Object Detection with Transformer¶

2024-11-26 22:43:322025-09-30 13:16:28

约 3272 个字 2 张图片预计阅读时间 16 分钟

论文

源码

题目：端到端的、基于Transformer的目标检测

作者：Facebook

时间：2020年5月28日

期刊：

⭐️ 摘要¶

We present a new method that views object detection as a direct set prediction problem.

将目标检测看做是一个预测问题

Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.

简化了检测过程、去掉了很多人工步骤：NMS、生成锚框

The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture.

我们的模型名字：DETR、全局损失、匈牙利二分匹配算法、基于Transformer Encoder decoder的检测结构

Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.

给定learned object queries？，DETR推理对象和全局图像的上下文关系，并行的输出最终的预测集合

The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors.

DETR概念上简单、不需要专门的库、跟其他检测器不太一样

（结果）DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

DETR的准确性和运行时间性能都可以媲美很成熟的、优化很好的 faster RCNN

检测数据集的baseline：COCO

Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines.

任务的泛化性能，DETR可以推广到全景分割任务

⭐️ Introduction¶

第一段 Modern detectors¶

The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest.

目标检测任务的定义

Modern detectors address this set prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals [37,5], anchors [23], or window centers [53,46].

现在的检测方法：间接的方法进行检测

Their performances are significantly influenced by postprocessing steps to collapse near-duplicate predictions, by the design of the anchor sets and by the heuristics that assign target boxes to anchors [52].

现在的检测方法：表现受到后处理步骤的显著影响，这些步骤包括合并近重复的预测、锚点集的设计，以及将目标框分配给锚点的启发式方法[52]。

To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks.

为了简化这些流程，我们提出了一种直接的集合预测方法

This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection: previous attempts [43,16,4,39] either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks.

这种端到端的理念在复杂的结构化预测任务中取得了显著进展，比如机器翻译或语音识别，但在目标检测领域尚未实现：之前的尝试[43,16,4,39]要么增加了其他形式的先验知识，要么在具有挑战性的基准测试中未能证明与强大的基线相竞争。

This paper aims to bridge this gap.

图1¶

DETR直接、并行的预测最终的检测集合
DETR：CNN和Transformer架构组合
训练阶段：二部图匹配将预测和真实框关联起来、没有匹配的预测能够产生一个没有对象的类别预测

第二段 based on transformers & self-attention mechanisms¶

We streamline the training pipeline by viewing object detection as a direct set prediction problem.

将目标检测视为一个直接的集合预测问题来简化训练流程

We adopt an encoder-decoder architecture based on transformers [47], a popular architecture for sequence prediction.

采用基于Transformer的结构

The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.

Transformer的自注意力机制：明确给出序列中所有元素的成对交互，特别适合于集合预测的特定约束，例如去除重复预测

（就是说明Transformer的自注意力机制很适合目标检测任务，给出成对元素之间的关系）

第三段 trained end-to-end¶

Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.

一次性、端到段预测
通过一个集合损失函数进行端到端训练，该函数在预测对象和真实对象之间执行二分图匹配

DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression.

DETR通过舍弃多个编码先验知识的手工设计组件，如空间锚点或非极大值抑制，简化了检测流程。

Unlike most existing detection methods, DETR doesn’t require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.1

与大多数现有的检测方法不同，DETR不需要任何定制层，因此可以在包含标准CNN和变换器类的任何框架中轻松复现。

第四段 bipartite matching loss¶

Compared to most previous work on direct set prediction, the main features of DETR are the conjunction of the bipartite matching loss and transformers with (non-autoregressive) parallel decoding [29,12,10,8].

DETR的特点：结合二分图匹配损失和Transformer的并行解码损失

📢 transformers with (non-autoregressive) 非自回归

In contrast, previous work focused on autoregressive decoding with RNNs [43,41,30,36,42].

先前的损失是：RNN解码器的自回归损失

Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.

匹配损失函数唯一地将预测分配给真实对象，并且对预测对象的排列是不变的，因此我们可以并行地输出它们。

第五段 baseline¶

We evaluate DETR on one of the most popular object detection datasets, COCO [24], against a very competitive Faster R-CNN baseline [37].

数据集：object detection datasets, COCO

模型：Faster R-CNN baseline

More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer.

DETR在大目标的检测性能比较好

可能的原因：Transformer的非局部计算（是的，Transformer是对所有词，两两之间任意可能得关系都进行建模，是一种全局建模方法，问题就是会对不那么重要的词也进行了关注）

It obtains, however, lower performances on small objects. We expect that future work will improve this aspect in the same way the development of FPN [22] did for Faster R-CNN.

但是，小物体检测性能就不那么好

第六段 differ from standard object detectors¶

The new model requires extra-long training schedule and benefits from auxiliary decoding losses in the transformer.

新模型（DETR）需要一个超长的训练时间表，并且从transformer中的辅助解码损失中受益

第七段 extend to more complex tasks¶

The design ethos of DETR easily extend to more complex tasks.

可以扩展到其他任务

In our experiments, we show that a simple segmentation head trained on top of a pretrained DETR outperfoms competitive baselines on Panoptic Segmentation [19], a challenging pixel-level recognition task that has recently gained popularity.

（全景分割）在我们的实验中，我们展示了一个简单的分割头，它在预训练的DETR之上训练，其性能在Panoptic分割[19]上超越了竞争基线，Panoptic分割是一个具有挑战性的像素级识别任务，最近变得流行起来。

Our work build on prior work in several domains: bipartite matching losses for set prediction, encoder-decoder architectures based on the transformer, parallel decoding, and object detection methods.

DETR涉及到的相关领域：

bipartite matching losses for set prediction 对于集合预测的二分图匹配损失
encoder-decoder architectures based on the transformer
parallel decoding 并行解码
object detection methods.

⭐️ 结论¶

第一段¶

We presented DETR, a new design for object detection systems based on transformers and bipartite matching loss for direct set prediction. The approach achieves comparable results to an optimized Faster R-CNN baseline on the challenging COCO dataset. DETR is straightforward to implement and has a flexible architecture that is easily extensible to panoptic segmentation, with competitive results. In addition, it achieves significantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention.

第二段¶

This new design for detectors also comes with new challenges, in particular regarding training, optimization and performances on small objects. Current detectors required several years of improvements to cope with similar issues, and we expect future work to successfully address them for DETR.

⭐️ The DETR model¶

Two ingredients are essential for direct set predictions in detection: (1) a set prediction loss that forces unique matching between predicted and ground truth boxes; (2) an architecture that predicts (in a single pass) a set of objects and models their relation. We describe our architecture in detail in Figure 2.

Object detection set prediction loss

（DETR）End-to-End Object Detection with Transformer¶

⭐️ 摘要¶

⭐️ Introduction¶

第一段 Modern detectors¶

图1¶

第二段 based on transformers & self-attention mechanisms¶

第三段 trained end-to-end¶

第四段 bipartite matching loss¶

第五段 baseline¶

第六段 differ from standard object detectors¶

第七段 extend to more complex tasks¶

相关领域工作1：Set Prediction $\rightarrow $ 匈牙利损失¶

相关工作2：Transformers and Parallel Decoding¶

第一段介绍Transformer是什么，以及优点¶

第二段 tr的缺点以及我们¶

相关工作3 Object detection¶

第一段是什么¶

第一部分 Set-based loss.基于集合的损失¶

第二段¶

第三段¶

第二部分 Recurrent detectors. 检测方法¶

第四段¶

⭐️ 结论¶

第一段¶

第二段¶

⭐️ The DETR model¶

（DETR）End-to-End Object Detection with Transformer¶

⭐️ 摘要¶

⭐️ Introduction¶

第一段 Modern detectors¶

图1¶

第二段 based on transformers & self-attention mechanisms¶

第三段 trained end-to-end¶

第四段 bipartite matching loss¶

第五段 baseline¶

第六段 differ from standard object detectors¶

第七段 extend to more complex tasks¶

⭐️ Related work¶

相关领域工作1：Set Prediction $\rightarrow $ 匈牙利损失¶

相关工作2：Transformers and Parallel Decoding¶

第一段 介绍Transformer是什么，以及优点¶

第二段 tr的缺点 以及 我们¶

相关工作3 Object detection¶

第一段 是什么¶

第一部分 Set-based loss.基于集合的损失¶

第二段¶

第三段¶

第二部分 Recurrent detectors. 检测方法¶

第四段¶

⭐️ 结论¶

第一段¶

第二段¶

⭐️ The DETR model¶

第一段介绍Transformer是什么，以及优点¶

第二段 tr的缺点以及我们¶

第一段是什么¶