Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones

Sebastian-Ion Nae; Mihai-Eugen Barbu; Sebastian Mocanu; Marius Leordeanu

Overview Video

Demo showing the closed-loop target tracking with dynamic target switching across diverse environments.

Motivation and Contribution

Challenge: Autonomous indoor drones must learn new object classes in real-time while limiting catastrophic forgetting. Most UAV datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. Indoor deployments demand efficient class-incremental learning under strict compute and memory budgets.

Our Solution: We introduce an indoor drone dataset of 14,400 frames with semi-automatic annotations (98.6% first-pass agreement), and benchmark three replay-based Class-Incremental Learning (CIL) strategies using YOLOv11-nano as a resource-efficient detector. Our best method, Forgetting-Aware Replay (FAR), achieves 82.96% ACC (mAP_50-95) with only 5% replay.

Key Contributions

Indoor UAV dataset (UAV-IndoorCL): 14,400 temporally coherent frames with drone-to-drone and drone-to-ground-vehicle interactions, annotated via a semi-automatic GroundingSAM pipeline with 98.6% first-pass labeling agreement.
Class-incremental benchmark: Controlled evaluation of three replay-based CIL strategies (ER, MIR, FAR) under strict memory budgets (5–50% replay) on resource-constrained YOLOv11-nano.
Forgetting-Aware Replay: FAR consistently outperforms other methods at low budgets by prioritizing samples with the largest recall degradation, maintaining accuracy near the joint-training upper bound.
Attention analysis & deployment: Grad-CAM reveals human-dominant saliency bias in mixed scenes that impairs drone localization. Closed-loop PID tracking validates the learned model in unseen environments.

UAV-IndoorCL Dataset

To address the outdoor bias in existing drone datasets, we collected an indoor aerial dataset in a controlled laboratory setting. Two human-piloted drones executed repeated circular trajectories around classroom objects, yielding four video streams: two on-board drone views and two third-person recordings. We sampled each stream at 1 FPS, producing 14,400 frames at 3840×2160 resolution.

Semi-Automatic Labeling Pipeline

Pseudo-Label Examples

Successful pseudo-label detection of drone — Successful Detection

Rare miss where GroundingSAM fails to localize a drone — Rare Miss (~1.4% of frames)

Successful pseudo-label detection of drone from different angle — Successful Detection

Dataset Comparison

Dataset comparison with prior drone-centric detection datasets
Paper	Dataset	#Videos	#Images	Indoor	Temporal	CIL
RT Flying OD	Dataset 1	0	15,064	✗	✗	✗
RT Flying OD	Dataset 2	0	11,998	✗	✗	✗
Dogfight	NPS-Drones	14	70,250	✗	✓	✗
Dogfight	FL-Drones	50	38,948	✗	✓	✗
Ours	UAV-IndoorCL	4	14,400	✓	✓	✓

Table 1. Dataset comparison with prior drone-centric detection datasets. Our UAV-IndoorCL dataset provides indoor scenes, temporal coherence, and a class-incremental learning protocol.

Task Design

We define five sequential tasks, each introducing one new class: three ground vehicle types (Tasks 1–3) with fine-grained visual differences, aerial drones (Task 4), and humans (Task 5). This design stress-tests replay under subtle inter-class differences and induces natural distribution shift across tasks.

Our Approach

We train YOLOv11-nano for indoor detection of drones and ground vehicles using a class-incremental learning pipeline with replay-based continual learning. Training proceeds over a task stream {T₁, …, T₅} where each task introduces exactly one new class.

Continual Learning Pipeline

Diagram of the class-incremental learning pipeline showing sequential task training with replay buffer — Class-incremental learning pipeline. For each new task, the learner trains on current task data plus a replay subset from prior tasks, evaluated on all seen tasks to measure plasticity and retention.

Replay Strategies

Naïve Fine-Tuning (lower bound): Sequential training on each task using only current-task data, no replay or regularization. Serves as a forgetting reference.
Experience Replay (ER): Uniformly samples a fixed-budget subset of prior-task images and merges them with the current task training set.
Maximally Interfered Retrieval (MIR): Interference-based, selects K images with the lowest image-level detection recall@0.5 under the current model from a capped candidate pool.
Forgetting-Aware Replay (FAR): Forgetting-prioritizes replay by recall drop: max(0, recall_baseline − recall_current), selecting images with the largest degradation.
Joint Training (upper bound): Trains on the union of all task data simultaneously, providing a performance ceiling.

All methods share the same training configuration: AdamW optimizer, 640×640 input, moderate augmentations (HSV, geometric, mosaic, mixup, copy-paste), and early stopping with patience 3 epochs.

Results

Continual Learning Performance

Continual learning results (ACC and BWT) across replay buffer budgets
Buffer	mAP_50-95 ACC ↑			mAP_50-95 BWT ↑
Buffer	ER	FAR	MIR	ER	FAR	MIR
Box Detection
5%	61.18 ± 8.18	82.96 ± 4.18	78.43 ± 3.45	−11.65 ± 10.44	−5.28 ± 5.25	−9.47 ± 2.09
10%	75.07 ± 6.77	86.48 ± 3.00	82.24 ± 3.73	−3.44 ± 4.99	−1.42 ± 1.83	−5.61 ± 4.45
25%	85.11 ± 2.52	87.86 ± 0.18	85.61 ± 1.88	2.84 ± 3.61	3.21 ± 0.53	2.03 ± 1.64
50%	88.69 ± 1.02	87.92 ± 0.45	87.30 ± 0.43	3.66 ± 0.41	3.64 ± 1.05	2.40 ± 1.07
Instance Segmentation
5%	56.93 ± 5.66	77.74 ± 3.72	74.39 ± 4.60	−11.76 ± 6.98	−4.22 ± 4.58	−7.51 ± 2.99
10%	69.18 ± 2.76	79.61 ± 2.87	77.08 ± 1.65	−4.09 ± 5.45	−2.29 ± 3.16	−5.08 ± 2.90
25%	78.59 ± 2.62	80.37 ± 0.09	78.43 ± 0.82	2.87 ± 3.81	1.02 ± 0.83	0.19 ± 1.57
50%	80.87 ± 0.30	80.59 ± 0.82	79.71 ± 0.84	2.83 ± 1.08	2.54 ± 2.00	2.18 ± 1.29

Table 2. Continual learning results (ACC and BWT) across replay buffer budgets. Values are mean ± standard deviation over 3 seeds. Best per row is bolded. FAR consistently leads at small budgets (5–10%), while all replay methods converge at 25–50%.

Performance Trajectories

Task-wise mAP50-95 for box detection and instance segmentation at 10% replay — Method Comparison at 10% Replay

Task-wise mAP50-95 for box detection and instance segmentation at 5% replay — Method Comparison at 5% Replay

Task-wise mAP50-95 for box detection and instance segmentation at 25% replay — Method Comparison at 25% Replay

Task-wise mAP50-95 for box detection and instance segmentation at 50% replay — Method Comparison at 50% Replay

Task-wise Performance Trajectories. mAP_50-95 for box detection and instance segmentation at 10% and 5% replay (top row), and 25% and 50% replay (bottom row). Naïve fine-tuning exhibits severe forgetting on earlier tasks, while replay-based methods improve retention. FAR and MIR remain closest to the joint-training upper bound under these low-memory settings.

Inference Latency

Inference latency across hardware platforms
Platform	Mean (ms)	Median (ms)	Std Dev (ms)	FPS
RTX 4060Ti GPU	6.73	6.67	0.55	148.7
Ryzen 3700x CPU	51.27	51.28	5.73	19.5
Raspberry Pi 5 (4GB)	207.15	206.99	1.51	4.8

Table 3. Inference latency (ms) over 30 runs on video frames. YOLOv11-nano enables high-frequency real-time deployment even on edge hardware.

GradCAM Analysis

Attention Patterns in Mixed vs. Drone-Only Scenes

GradCAM highlighting human-biased attention pattern in mixed scenes — Mixed Scene: Human-biased attention

GradCAM showing distributed target-focused attention in UAV-only scenes — Drone-Only Scene: Distributed attention

These qualitative observations mirror the quantitative results: with tight memory, human-centric clutter increases interference, whereas drone-only scenes yield cleaner multi-target attention and more robust tracking.

Attention Evolution Across Replay Strategies

GradCAM visualizations across five sequential tasks at 5% replay buffer for ER, MIR, and FAR — GradCAM visualizations of the final convolutional layer across five sequential tasks at 5% replay. ER loses focused attention after Task 3, while MIR and FAR maintain localized, task-relevant attention. FAR produces the most concentrated activations in Tasks 3–4.

Qualitative Results

We validate the FAR model (5% replay) in environments not seen during training: an indoor laboratory without the textured carpet present in the training data and outdoor settings for drone/person tracking.

Cross-Environment Target Following

Indoor Drone Tracking

Dynamic Target Switching

Outdoor Drone Tracking

Outdoor Car Tracking

The UAV tracks ground vehicles and aerial drones across indoor classrooms, constrained spaces, and outdoor scenes, demonstrating cross-environment robustness despite training exclusively on indoor data with textured flooring. Closed-loop tracking uses a PID controller to center the selected target in the camera frame.

Contact

For questions and collaboration inquiries, please contact the authors through the GitHub repository or academic channels.

Acknowledgements

This work is supported in part by projects "Romanian Hub for Artificial Intelligence - HRIA", Smart Growth, Digitization and Financial Instruments Program, 2021-2027 (MySMIS No. 334906), European Health and Digital Executive Agency (HADEA) through DIGITWIN4CIUE (Grant No. 101084054), and "European Lighthouse of AI for Sustainability - ELIAS", Horizon Europe program (Grant No. 101120237).

Citation - Waiting for proceedings

@article{nae2026learningflyreplaybasedcontinual,
    title   = {Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones},
    author  = {Nae, Sebastian-Ion and Barbu, Mihai-Eugen and Mocanu, Sebastian and Leordeanu, Marius},
    journal = {arXiv preprint arXiv:2602.13440},
    year    = {2026}
}