Overview Video
Demo showing the closed-loop target tracking with dynamic target switching across diverse environments.
Motivation and Contribution
Challenge: Autonomous indoor drones must learn new object classes in real-time while limiting catastrophic forgetting. Most UAV datasets focus on outdoor scenes and offer limited temporally coherent indoor videos. Indoor deployments demand efficient class-incremental learning under strict compute and memory budgets.
Our Solution: We introduce an indoor drone dataset of 14,400 frames with semi-automatic annotations (98.6% first-pass agreement), and benchmark three replay-based Class-Incremental Learning (CIL) strategies using YOLOv11-nano as a resource-efficient detector. Our best method, Forgetting-Aware Replay (FAR), achieves 82.96% ACC (mAP50-95) with only 5% replay.
Key Contributions
- Indoor UAV dataset (UAV-IndoorCL): 14,400 temporally coherent frames with drone-to-drone and drone-to-ground-vehicle interactions, annotated via a semi-automatic GroundingSAM pipeline with 98.6% first-pass labeling agreement.
- Class-incremental benchmark: Controlled evaluation of three replay-based CIL strategies (ER, MIR, FAR) under strict memory budgets (5–50% replay) on resource-constrained YOLOv11-nano.
- Forgetting-Aware Replay: FAR consistently outperforms other methods at low budgets by prioritizing samples with the largest recall degradation, maintaining accuracy near the joint-training upper bound.
- Attention analysis & deployment: Grad-CAM reveals human-dominant saliency bias in mixed scenes that impairs drone localization. Closed-loop PID tracking validates the learned model in unseen environments.
UAV-IndoorCL Dataset
To address the outdoor bias in existing drone datasets, we collected an indoor aerial dataset in a controlled laboratory setting. Two human-piloted drones executed repeated circular trajectories around classroom objects, yielding four video streams: two on-board drone views and two third-person recordings. We sampled each stream at 1 FPS, producing 14,400 frames at 3840×2160 resolution.
Semi-Automatic Labeling Pipeline
Pseudo-Label Examples
Dataset Comparison
| Paper | Dataset | #Videos | #Images | Indoor | Temporal | CIL |
|---|---|---|---|---|---|---|
| RT Flying OD | Dataset 1 | 0 | 15,064 | ✗ | ✗ | ✗ |
| RT Flying OD | Dataset 2 | 0 | 11,998 | ✗ | ✗ | ✗ |
| Dogfight | NPS-Drones | 14 | 70,250 | ✗ | ✓ | ✗ |
| Dogfight | FL-Drones | 50 | 38,948 | ✗ | ✓ | ✗ |
| Ours | UAV-IndoorCL | 4 | 14,400 | ✓ | ✓ | ✓ |
Table 1. Dataset comparison with prior drone-centric detection datasets. Our UAV-IndoorCL dataset provides indoor scenes, temporal coherence, and a class-incremental learning protocol.
Task Design
We define five sequential tasks, each introducing one new class: three ground vehicle types (Tasks 1–3) with fine-grained visual differences, aerial drones (Task 4), and humans (Task 5). This design stress-tests replay under subtle inter-class differences and induces natural distribution shift across tasks.
Our Approach
We train YOLOv11-nano for indoor detection of drones and ground vehicles using a class-incremental learning pipeline with replay-based continual learning. Training proceeds over a task stream {T1, …, T5} where each task introduces exactly one new class.
Continual Learning Pipeline
Replay Strategies
- Naïve Fine-Tuning (lower bound): Sequential training on each task using only current-task data, no replay or regularization. Serves as a forgetting reference.
- Experience Replay (ER): Uniformly samples a fixed-budget subset of prior-task images and merges them with the current task training set.
- Maximally Interfered Retrieval (MIR): Interference-based, selects K images with the lowest image-level detection recall@0.5 under the current model from a capped candidate pool.
- Forgetting-Aware Replay (FAR): Forgetting-prioritizes replay by recall drop: max(0, recallbaseline − recallcurrent), selecting images with the largest degradation.
- Joint Training (upper bound): Trains on the union of all task data simultaneously, providing a performance ceiling.
All methods share the same training configuration: AdamW optimizer, 640×640 input, moderate augmentations (HSV, geometric, mosaic, mixup, copy-paste), and early stopping with patience 3 epochs.
Results
Continual Learning Performance
| Buffer | mAP50-95 ACC ↑ | mAP50-95 BWT ↑ | ||||
|---|---|---|---|---|---|---|
| ER | FAR | MIR | ER | FAR | MIR | |
| Box Detection | ||||||
| 5% | 61.18 ± 8.18 | 82.96 ± 4.18 | 78.43 ± 3.45 | −11.65 ± 10.44 | −5.28 ± 5.25 | −9.47 ± 2.09 |
| 10% | 75.07 ± 6.77 | 86.48 ± 3.00 | 82.24 ± 3.73 | −3.44 ± 4.99 | −1.42 ± 1.83 | −5.61 ± 4.45 |
| 25% | 85.11 ± 2.52 | 87.86 ± 0.18 | 85.61 ± 1.88 | 2.84 ± 3.61 | 3.21 ± 0.53 | 2.03 ± 1.64 |
| 50% | 88.69 ± 1.02 | 87.92 ± 0.45 | 87.30 ± 0.43 | 3.66 ± 0.41 | 3.64 ± 1.05 | 2.40 ± 1.07 |
| Instance Segmentation | ||||||
| 5% | 56.93 ± 5.66 | 77.74 ± 3.72 | 74.39 ± 4.60 | −11.76 ± 6.98 | −4.22 ± 4.58 | −7.51 ± 2.99 |
| 10% | 69.18 ± 2.76 | 79.61 ± 2.87 | 77.08 ± 1.65 | −4.09 ± 5.45 | −2.29 ± 3.16 | −5.08 ± 2.90 |
| 25% | 78.59 ± 2.62 | 80.37 ± 0.09 | 78.43 ± 0.82 | 2.87 ± 3.81 | 1.02 ± 0.83 | 0.19 ± 1.57 |
| 50% | 80.87 ± 0.30 | 80.59 ± 0.82 | 79.71 ± 0.84 | 2.83 ± 1.08 | 2.54 ± 2.00 | 2.18 ± 1.29 |
Table 2. Continual learning results (ACC and BWT) across replay buffer budgets. Values are mean ± standard deviation over 3 seeds. Best per row is bolded. FAR consistently leads at small budgets (5–10%), while all replay methods converge at 25–50%.
Performance Trajectories
Task-wise Performance Trajectories. mAP50-95 for box detection and instance segmentation at 10% and 5% replay (top row), and 25% and 50% replay (bottom row). Naïve fine-tuning exhibits severe forgetting on earlier tasks, while replay-based methods improve retention. FAR and MIR remain closest to the joint-training upper bound under these low-memory settings.
Inference Latency
| Platform | Mean (ms) | Median (ms) | Std Dev (ms) | FPS |
|---|---|---|---|---|
| RTX 4060Ti GPU | 6.73 | 6.67 | 0.55 | 148.7 |
| Ryzen 3700x CPU | 51.27 | 51.28 | 5.73 | 19.5 |
| Raspberry Pi 5 (4GB) | 207.15 | 206.99 | 1.51 | 4.8 |
Table 3. Inference latency (ms) over 30 runs on video frames. YOLOv11-nano enables high-frequency real-time deployment even on edge hardware.
GradCAM Analysis
Attention Patterns in Mixed vs. Drone-Only Scenes
These qualitative observations mirror the quantitative results: with tight memory, human-centric clutter increases interference, whereas drone-only scenes yield cleaner multi-target attention and more robust tracking.
Attention Evolution Across Replay Strategies
Qualitative Results
We validate the FAR model (5% replay) in environments not seen during training: an indoor laboratory without the textured carpet present in the training data and outdoor settings for drone/person tracking.
Cross-Environment Target Following
The UAV tracks ground vehicles and aerial drones across indoor classrooms, constrained spaces, and outdoor scenes, demonstrating cross-environment robustness despite training exclusively on indoor data with textured flooring. Closed-loop tracking uses a PID controller to center the selected target in the camera frame.
Contact
For questions and collaboration inquiries, please contact the authors through the GitHub repository or academic channels.
Acknowledgements
This work is supported in part by projects "Romanian Hub for Artificial Intelligence - HRIA", Smart Growth, Digitization and Financial Instruments Program, 2021-2027 (MySMIS No. 334906), European Health and Digital Executive Agency (HADEA) through DIGITWIN4CIUE (Grant No. 101084054), and "European Lighthouse of AI for Sustainability - ELIAS", Horizon Europe program (Grant No. 101120237).
Citation - Waiting for proceedings
@article{nae2026learningflyreplaybasedcontinual,
title = {Learning on the Fly: Replay-Based Continual Object Perception for Indoor Drones},
author = {Nae, Sebastian-Ion and Barbu, Mihai-Eugen and Mocanu, Sebastian and Leordeanu, Marius},
journal = {arXiv preprint arXiv:2602.13440},
year = {2026}
}