Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

301,000

Published on June 28, 2020 3:09:19 PM ● Video Link: https://www.youtube.com/watch?v=eI8xTdcZ6VY

Duration: 34:22

12,059 views

401

Object detection often does not occur in a vacuum. Static cameras, such as wildlife traps, collect lots of irregularly sampled data over a large time frame and often capture repeating or similar events. This model learns to dynamically incorporate other frames taken by the same camera into its object detection pipeline.

OUTLINE:
0:00 - Intro & Overview
1:10 - Problem Formulation
2:10 - Static Camera Data
6:45 - Architecture Overview
10:00 - Short-Term Memory
15:40 - Long-Term Memory
20:10 - Quantitative Results
22:30 - Qualitative Results
30:10 - False Positives
32:50 - Appendix & Conclusion

Paper: https://arxiv.org/abs/1912.03538

My Video On Attention Is All You Need: https://youtu.be/iDulhoQ2pro

Abstract:
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame.
We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.

Authors: Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-07-08	Addendum for Supermasks in Superposition: A Closer Look (Paper Explained)
2020-07-07	SupSup: Supermasks in Superposition (Paper Explained)
2020-07-06	[Live Machine Learning Research] Plain Self-Ensembles (I actually DISCOVER SOMETHING) - Part 1
2020-07-05	SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)
2020-07-04	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)
2020-07-03	On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained)
2020-07-02	BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)
2020-07-01	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)
2020-06-30	Object-Centric Learning with Slot Attention (Paper Explained)
2020-06-29	Set Distribution Networks: a Generative Model for Sets of Images (Paper Explained)
2020-06-28	Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)
2020-06-27	Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures (Paper Explained)
2020-06-26	On the Measure of Intelligence by François Chollet - Part 3: The Math (Paper Explained)
2020-06-25	Discovering Symbolic Models from Deep Learning with Inductive Biases (Paper Explained)
2020-06-24	How I Read a Paper: Facebook's DETR (Video Tutorial)
2020-06-23	RepNet: Counting Out Time - Class Agnostic Video Repetition Counting in the Wild (Paper Explained)
2020-06-22	[Drama] Yann LeCun against Twitter on Dataset Bias
2020-06-21	SIREN: Implicit Neural Representations with Periodic Activation Functions (Paper Explained)
2020-06-20	Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)
2020-06-19	On the Measure of Intelligence by François Chollet - Part 2: Human Priors (Paper Explained)
2020-06-18	Image GPT: Generative Pretraining from Pixels (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

vision

cnn

convolutional neural network

coco

object detection

region of interest

rcnn

r-cnn

attention

attention mechanism

google

caltech

gazelle

wildlife

wild trap

traffic

object

car

bus

vehicle

lighting

time

sampling

frames

memory

long-term

query

Channel	Latest
Wisethug	6 hours ago
Fact-On	6 hours ago
Resisurfer	6 hours ago
TheUltimateHero1	7 hours ago
PuppleStorm	7 hours ago
imNoveria 🎮	7 hours ago
恭一郎のゲーム放送局	7 hours ago
elpierrot17	7 hours ago
TaxOwlbear	7 hours ago
Mandal King07	7 hours ago
HENRI9 CLIPS	7 hours ago
Mark S.Fernandez	7 hours ago
The Escapist	7 hours ago
악령쿤AKTUBE	7 hours ago
Saharul YT	7 hours ago
MESSI GAMING	7 hours ago
Holzbub 66	7 hours ago
Planet Jeep	7 hours ago
Corey Broom	7 hours ago
Hero ヒーロー	7 hours ago
Syster Yster	7 hours ago
Dota 2 Stream - Yaroslav Tekcac	7 hours ago
Quvades Gaming	7 hours ago
MCJr the Daisy Lover & Koopa Kid Hater	8 hours ago
Dido Gaming	8 hours ago