(DETR)End-to-End Object Detection with Transformer¶
约 3272 个字 2 张图片 预计阅读时间 16 分钟
题目:端到端的、基于Transformer的目标检测
作者:Facebook
时间:2020年5月28日
期刊:
⭐️ 摘要¶
We present a new method that views object detection as a direct set prediction problem.
将目标检测看做是一个预测问题
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.
简化了检测过程、去掉了很多人工步骤:NMS、生成锚框
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture.
我们的模型名字:DETR、全局损失、匈牙利二分匹配算法、基于Transformer Encoder decoder的检测结构
Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.
给定learned object queries?,DETR推理 对象和全局图像的上下文关系,并行的输出最终的预测集合
The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors.
DETR概念上简单、不需要专门的库、跟其他检测器不太一样
(结果)DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.
DETR的准确性和运行时间性能 都可以媲美 很成熟的、优化很好的 faster RCNN
检测数据集的baseline:COCO
Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines.
任务的泛化性能,DETR可以推广到全景分割任务
⭐️ Introduction¶
第一段 Modern detectors¶
The goal of object detection is to predict a set of bounding boxes and category labels for each object of interest.
目标检测任务的定义
Modern detectors address this set prediction task in an indirect way, by defining surrogate regression and classification problems on a large set of proposals [37,5], anchors [23], or window centers [53,46].
现在的检测方法:间接的方法进行检测
- Their performances are significantly influenced by postprocessing steps to collapse near-duplicate predictions, by the design of the anchor sets and by the heuristics that assign target boxes to anchors [52].
现在的检测方法:表现受到后处理步骤的显著影响,这些步骤包括合并近重复的预测、锚点集的设计,以及将目标框分配给锚点的启发式方法[52]。
To simplify these pipelines, we propose a direct set prediction approach to bypass the surrogate tasks.
为了简化这些流程,我们提出了一种直接的集合预测方法
This end-to-end philosophy has led to significant advances in complex structured prediction tasks such as machine translation or speech recognition, but not yet in object detection: previous attempts [43,16,4,39] either add other forms of prior knowledge, or have not proven to be competitive with strong baselines on challenging benchmarks.
这种端到端的理念在复杂的结构化预测任务中取得了显著进展,比如机器翻译或语音识别,但在目标检测领域尚未实现:之前的尝试[43,16,4,39]要么增加了其他形式的先验知识,要么在具有挑战性的基准测试中未能证明与强大的基线相竞争。
This paper aims to bridge this gap.
图1¶
- DETR直接、并行的预测最终的检测集合
- DETR:CNN和Transformer架构组合
- 训练阶段:二部图匹配将预测和真实框关联起来、没有匹配的预测能够产生一个没有对象的类别预测
第二段 based on transformers & self-attention mechanisms¶
We streamline the training pipeline by viewing object detection as a direct set prediction problem.
将目标检测视为一个直接的集合预测问题来简化训练流程
We adopt an encoder-decoder architecture based on transformers [47], a popular architecture for sequence prediction.
采用基于Transformer的结构
The self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.
Transformer的自注意力机制:明确给出序列中所有元素的成对交互,特别适合于集合预测的特定约束,例如去除重复预测
(就是说明Transformer的自注意力机制很适合目标检测任务,给出成对元素之间的关系)
第三段 trained end-to-end¶
Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.
- 一次性、端到段预测
- 通过一个集合损失函数进行端到端训练,该函数在预测对象和真实对象之间执行二分图匹配
DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression.
DETR通过舍弃多个编码先验知识的手工设计组件,如空间锚点或非极大值抑制,简化了检测流程。
Unlike most existing detection methods, DETR doesn’t require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.1
与大多数现有的检测方法不同,DETR不需要任何定制层,因此可以在包含标准CNN和变换器类的任何框架中轻松复现。
第四段 bipartite matching loss¶
Compared to most previous work on direct set prediction, the main features of DETR are the conjunction of the bipartite matching loss and transformers with (non-autoregressive) parallel decoding [29,12,10,8].
DETR的特点:结合二分图匹配损失和Transformer的并行解码损失
📢 transformers with (non-autoregressive) 非自回归
In contrast, previous work focused on autoregressive decoding with RNNs [43,41,30,36,42].
先前的损失是:RNN解码器的自回归损失
Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.
匹配损失函数唯一地将预测分配给真实对象,并且对预测对象的排列是不变的,因此我们可以并行地输出它们。
第五段 baseline¶
We evaluate DETR on one of the most popular object detection datasets, COCO [24], against a very competitive Faster R-CNN baseline [37].
数据集:object detection datasets, COCO
模型:Faster R-CNN baseline
More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer.
DETR在大目标的检测性能比较好
可能的原因:Transformer的非局部计算(是的,Transformer是对所有词,两两之间任意可能得关系都进行建模,是一种全局建模方法,问题就是 会对不那么重要的词 也进行了关注)
It obtains, however, lower performances on small objects. We expect that future work will improve this aspect in the same way the development of FPN [22] did for Faster R-CNN.
但是,小物体检测性能就不那么好
第六段 differ from standard object detectors¶
The new model requires extra-long training schedule and benefits from auxiliary decoding losses in the transformer.
新模型(DETR)需要一个超长的训练时间表,并且从transformer中的辅助解码损失中受益
第七段 extend to more complex tasks¶
The design ethos of DETR easily extend to more complex tasks.
可以扩展到其他任务
In our experiments, we show that a simple segmentation head trained on top of a pretrained DETR outperfoms competitive baselines on Panoptic Segmentation [19], a challenging pixel-level recognition task that has recently gained popularity.
(全景分割)在我们的实验中,我们展示了一个简单的分割头,它在预训练的DETR之上训练,其性能在Panoptic分割[19]上超越了竞争基线,Panoptic分割是一个具有挑战性的像素级识别任务,最近变得流行起来。
⭐️ Related work¶
Our work build on prior work in several domains: bipartite matching losses for set prediction, encoder-decoder architectures based on the transformer, parallel decoding, and object detection methods.
DETR涉及到的相关领域:
- bipartite matching losses for set prediction 对于集合预测的二分图匹配损失
- encoder-decoder architectures based on the transformer
- parallel decoding 并行解码
- object detection methods.
相关领域工作1:Set Prediction $\rightarrow $ 匈牙利损失¶
There is no canonical(标准的) deep learning model to directly predict sets.
对于集合预测,没有标准的深度学习模型
The basic set prediction task is multilabel classification for which the baseline approach, one-vs-rest, does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes).
集合预测任务是一种多标签分类任务,一对多,不适合检测任务
检测任务的特点:元素之间有结构关系,例如,检测中的近相同框
The first difficulty in these tasks is to avoid near-duplicates.(避免近似重复)
检测任务的第一个难点:去重
Most current detectors use postprocessings such as non-maximal suppression to address this issue, but direct set prediction are postprocessing-free.
检测任务的后处理方法:NMS
但是,集合预测是 不需要进行后处理的
They need global inference schemes that model interactions between all predicted elements to avoid redundancy.
集合预测需要 全局推理
(说的是IOU、NMS)需要全局推理方案,这些方案模拟所有预测元素之间的交互以避免冗余
For constant-size set prediction, dense fully connected networks [9] are sufficient but costly. A general approach is to use auto-regressive sequence models such as recurrent neural networks [48].
对于固定大小的集合预测问题:全连接网,缺点:成本高
解决:自回归模型 RNN
In all cases, the loss function should be invariant by a permutation of the predictions.
集合预测的损失函数 应该与预测的顺序 无关
The usual solution is to design a loss based on the Hungarian algorithm [20], to find a bipartite matching between ground-truth and prediction.
如何实现 损失函数与顺序无关? 匈牙利算法
寻找gt和预测的二分匹配
This enforces permutation-invariance, and guarantees that each target element has a unique match.
特点:
- 顺序排列不变形
- 唯一匹配
We follow the bipartite matching loss approach. In contrast to most prior work however, we step away from autoregressive models and use transformers with parallel decoding.
- DETR 使用 二分匹配损失方法
- 使用Transformer的并行解码模型
- 没有用自回归模型
相关工作2:Transformers and Parallel Decoding¶
第一段 介绍Transformer是什么,以及优点¶
Transformers were introduced by Vaswani et al . [47] as a new attention-based building block for machine translation.
Attention mechanisms [2] are neural network layers that aggregate information from the entire input sequence.
Transformers introduced self-attention layers, which, similarly to Non-Local Neural Networks [49], scan through each element of a sequence and update it by aggregating information from the whole sequence.
自注意力层,类似非局部神经网络
扫描序列中每个元素,聚合整个序列的信息并进行更新
One of the main advantages of attention-based models is their global computations and perfect memory, which makes them more suitable than RNNs on long sequences. Transformers are now replacing RNNs in many problems in natural language processing, speech processing and computer vision [8,27,45,34,31].
第二段 tr的缺点 以及 我们¶
Transformers were first used in auto-regressive models, following early sequence-to-sequence models [44], generating output tokens one by one.
Transforme的应用领域:自回归模型,seq2seq模型,one by one输出
However, the prohibitive inference cost (proportional to output length, and hard to batch) lead to the development of parallel sequence generation, in the domains of audio [29], machine translation [12,10], word representation learning [8], and more recently speech recognition [6].
Transformer的缺点:平方复杂度,高昂的计算成本
We also combine transformers and parallel decoding for their suitable trade-off between computational cost and the ability to perform the global computations required for set prediction.
我们的做法:Transformer+并行解码
权衡 计算成本 和 全局计算能力
相关工作3 Object detection¶
第一段 是什么¶
Most modern object detection methods make predictions relative to some initial guesses.
现在的目标检测方法 需要给出 初始猜测
Two-stage detectors [37,5] predict boxes w.r.t. proposals, whereas single-stage methods make predictions w.r.t. anchors [23] or a grid of possible object centers [53,46].
两阶段检测方法 & 单阶段检测方法
两阶段预测 预测框
单阶段预测 预测锚点 or 目标中心的网格
Recent work [52] demonstrate that the final performance of these systems heavily depends on the exact way these initial guesses are set.
缺点:依赖初值
In our model we are able to remove this hand-crafted process and streamline the detection process by directly predicting the set of detections with absolute box prediction w.r.t. the input image rather than an anchor.
我们:
how?
- 移除手工过程
- 简化检测过程
what?
- (直接说是什么)直接预测检测集:直接预测框
- (用相对关系说是什么)是图像而不是锚点
第一部分 Set-based loss.基于集合的损失¶
第二段¶
Several object detectors [9,25,35] used the bipartite matching loss. However, in these early deep learning models, the relation between different prediction was modeled with convolutional or fully-connected layers only and a hand-designed NMS post-processing can improve their performance. More recent detectors [37,23,53] use non-unique assignment rules between ground truth and predictions together with an NMS.
第三段¶
Learnable NMS methods [16,4] and relation networks [17] explicitly model relations between different predictions with attention. Using direct set losses, they do not require any post-processing steps. However, these methods employ additional hand-crafted context features like proposal box coordinates to model relations between detections efficiently, while we look for solutions that reduce the prior knowledge encoded in the model.
第二部分 Recurrent detectors. 检测方法¶
第四段¶
Closest to our approach are end-to-end set predictions for object detection [43] and instance segmentation [41,30,36,42]. Similarly to us, they use bipartite-matching losses with encoder-decoder architectures based on CNN activations to directly produce a set of bounding boxes. These approaches, however, were only evaluated on small datasets and not against modern baselines. In particular, they are based on autoregressive models (more precisely RNNs), so they do not leverage the recent transformers with parallel decoding.
⭐️ 结论¶
第一段¶
We presented DETR, a new design for object detection systems based on transformers and bipartite matching loss for direct set prediction. The approach achieves comparable results to an optimized Faster R-CNN baseline on the challenging COCO dataset. DETR is straightforward to implement and has a flexible architecture that is easily extensible to panoptic segmentation, with competitive results. In addition, it achieves significantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention.
第二段¶
This new design for detectors also comes with new challenges, in particular regarding training, optimization and performances on small objects. Current detectors required several years of improvements to cope with similar issues, and we expect future work to successfully address them for DETR.
⭐️ The DETR model¶
Two ingredients are essential for direct set predictions in detection: (1) a set prediction loss that forces unique matching between predicted and ground truth boxes; (2) an architecture that predicts (in a single pass) a set of objects and models their relation. We describe our architecture in detail in Figure 2.
Object detection set prediction loss