Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control

Sebastian Mocanu; Sebastian-Ion Nae; Mihai-Eugen Barbu; Marius Leordeanu

Overview Video

Demo comparing the teacher and the student methods showcasing examples where the teacher is faster, where the student is faster and when the teacher fails to do the task.

Motivation and Contribution

Challenge: Classical IBVS methods suffer from numerical instabilities and singularities, while marker-based approaches (ArUco, AprilTags) limit deployment in dynamic indoor environments. GPS-denied scenarios demand efficient, marker-free visual servoing for quadrotor control.

Our Solution: We present a self-supervised neuro-analytical framework featuring a Numerically Stable Efficient and Reduced (NSER) Image-Based Visual Servoing (IBVS) teacher model, distilled into a lightweight 1.7M parameter student network achieving 11x real-time performance with improved control accuracy.

Key Contributions

Stable analytical teacher: Improved IBVS controller solving numerical instabilities through reduced classical equations, enabling robust marker-free control.
Two-stage segmentation: YOLOv11 + U-Net mask splitter for anterior-posterior vehicle segmentation, accurately estimating target orientation.
Efficient knowledge distillation: Dual-path system transferring geometric visual servoing from teacher to compact student neural network that outperforms the teacher while suitable for onboard deployment.
Practical sim-to-real transfer: Digital-twin training with real-world fine-tuning, validated in GPS-denied indoor environments with minimal hardware.

Visual Performance for Teacher

Front-Left Approach

Up-Left Approach

Up-Right Approach

Left Approach

Right Approach

Down-Left Approach

Down-Right Approach

Front-Center Approach (Teacher Fails)

Visual Performance for Student

Front-Left Approach

Up-Left Approach

Up-Right Approach

Left Approach

Right Approach

Down-Left Approach

Down-Right Approach

Front-Center Approach (Student Succeeds)

Our Approach

We propose a Teacher-Student architecture to combine the stability of analytical methods with the efficiency of neural networks. The Teacher (NSER-IBVS) uses a numerically stable analytic control law to generate robust velocity commands. The Student, a lightweight CNN, learns to regress these commands directly from raw images, bypassing the expensive feature extraction pipeline.

1. Knowledge Distillation & Self-Supervised Learning Pipeline

Diagram showing teacher-student knowledge distillation pipeline with the teacher IBVS model transferring knowledge to a lightweight student neural network — Lightweight 1.7M parameter student network trained via knowledge distillation from teacher. Simulator-based data generation with sim-to-real transfer achieves 11x faster inference (1.85ms vs 20.69ms).

2. Numerically Stable IBVS Teacher (NSER)

Architecture diagram of the NSER IBVS teacher showing the reduced formulation for numerical stability — Analytically stable Image-Based Visual Servoing controller addressing classical IBVS numerical instabilities through reduced formulations and robust error computation from corner points.

3. Two-Stage Target Segmentation

Diagram of the two-stage segmentation pipeline: YOLOv11 detection followed by U-Net mask splitter for anterior-posterior separation — YOLOv11 detection refined with U-Net segmentation splitter to distinguish anterior-posterior vehicle halves for precise orientation estimation.

Why Mask Splitting Matters: Solving Keypoint Ordering

Naive keypoint detection showing ambiguous corner assignment from YOLO bounding box — Naive: YOLO bounding box corners lack orientation awareness

Analytically recomputed keypoints with improved but still ambiguous ordering — Analytical: Recomputed corners, still orientation-ambiguous

Mask splitter output with consistently ordered keypoints based on front-back segmentation — Ours: Mask splitter enables consistent pose-based ordering

Standard bounding box approaches suffer from keypoint ordering ambiguity, the four corners can be assigned inconsistently across frames, destabilizing the IBVS control loop. Our mask splitter determines which part of the vehicle is front vs. back, enabling consistent clockwise ordering of keypoints for stable visual servoing.

4. Student Neural Network

Diagram of the student architecture: Frame as input and goes through neural network layers to produce angular and linear velocities. — Compact 1.7M parameter CNN directly regresses velocity commands from RGB input, bypassing explicit visual servoing computation.

Goal State Reference

The IBVS controller computes velocity commands by comparing current keypoints to a reference (goal) configuration. Below are the reference images used for real-world and simulated environments:

Reference goal image for real-world flights showing target vehicle at desired pose — Real-World Reference

Reference goal image for simulated flights showing target vehicle at desired pose — Simulator Reference

This integrated approach combines the robustness of analytical control theory with the efficiency and adaptability of neural networks, enabling practical deployment on resource-constrained aerial platforms.

Experimental Setup

Diagram showing 8 directional starting positions around the target vehicle: up-left, up-right, front-left, front-right, left, right, down-left, down-right — 8 directional starting positions

Evaluation Framework: High-fidelity digital twin simulator + real-world indoor GPS-denied flights

Evaluation Metrics

Flight performance: Distance, time
Control accuracy: Final norm error (px)
Tracking quality: IoU (last 3s)
Efficiency: Inference time, FPS

Teacher: Numerically stable IBVS
Student: 1.7M ConvNet (11x faster)

Digital Twin Generation

Our simulation environment addresses the Sim-to-Real gap using a high-fidelity pipeline:

Physics Engine: Parrot Sphinx for accurate aerodynamic modeling.
Rendering: Unreal Engine 4 for photorealistic visual feedback.
Assets: Custom .FBX vehicle and environment assets matching real-world measurements.

Drone flying in the simulator with Teacher model running from third party and drone perspectives — Teacher model in the simulated environment with both drone and third party actor perspectives

Hardware Requirements

Simulation: Ubuntu 22.04/24.04, NVIDIA GPU (CUDA), 8GB+ RAM
Real-World: Parrot Anafi 4K, laptop with WiFi + GPU
Environment: Indoor with Lambertian floor surface

Mission Termination

Hard Goal: Median error < 40px for 3 consecutive seconds
Soft Goal: Median error < 80px AND all velocities = 0 for 3s
Timeout: 75 seconds maximum flight duration

Pre-trained Models

All models are available in the code repository.

Pre-trained model weights and specifications
Model	Params	Description
YOLOv11n Segmentation	2.84M	Vehicle segmentation (sim / real)
Mask Splitter (U-Net)	1.94M	Anterior-posterior splitting (sim / real)
Student Network	1.7M	Direct velocity regression (sim-pretrained + real fine-tuned)

Visual Results

Control command and error evolutions over time

Graph showing control command and error evolution over time for the teacher IBVS in real flight conditions — 1. Real flight - Teacher (IBVS)

Graph showing control command and error evolution over time for the student network in real flight conditions — 2. Real flight - Student

Graph showing control command and error evolution over time for the teacher IBVS in simulation — 3. Simulation - Teacher (IBVS)

Graph showing control command and error evolution over time for the student network in simulation — 4. Simulation - Student

Temporal Evolution of Control. Comparison of control commands and error evolution on novel test sequences across 8 different starting points. Top Row: Real-world flights. Bottom Row: Simulation. Solid lines represent mean values; shaded areas indicate variability across runs. All control commands and errors converge toward zero, indicating robust trajectory tracking. Note the striking similarity in control behavior between the complex analytical Teacher (1 and 3) and the lightweight Student ConvNet (2 and 4) in both domains, validating our Sim-to-Real transfer pipeline.

Command Distribution Comparison

Sim-to-Real Domain Alignment. Aggregated probability density functions of control commands (linear velocities x, y and angular yaw rate rot) across all experiments. The strong overlap between the Real-World and Digital-Twin distributions validates the fidelity of our simulation environment.

Trajectory Analysis

3D visualization of drone flight trajectories from front left-right starting positions — Front-Left & Front-Right Initialization Trajectories

3D visualization of drone flight trajectories from up-left and up-right starting positions — Up-Left & Up-Right Initialization Trajectories

2D Flight Trajectories. Trajectories of teacher and student simulation flights with mean and standard deviation across 4 starting poses. Left: Front approaches. Right: Up approaches. Green circles indicate starting points, the Star represents the goal pose.
Lines represent mean trajectory over the experiments where Solid Lines = Teacher NSER IBVS and Dashed Lines = Student. Shaded regions indicate trajectory variability across runs. Note that while the Student displays slightly more path variation, it frequently achieves a shorter average path to the target than the analytical Teacher.

Numeric Results

Teacher-student comparison across different starting positions
Direction	Flight SIM			Flight
Direction	Distance (m) / Time (s) ↓	Norm Error SIM (px) ↓	IoU SIM ↑	Distance (m) / Time (s) ↓	Norm Error (px) ↓	IoU ↑
Up-Left	5.312 / 23.466	29.256	0.530	5.798 / 36.721	30.134	0.620
Up-Left	5.193 / 24.164	13.319	0.759	5.735 / 43.334	28.600	0.6263
Up-Right	5.675 / 24.226	31.800	0.503	5.622 / 41.581	31.499	0.621
Up-Right	6.064 / 28.298	13.172	0.766	5.716 / 45.885	22.802	0.6919
Front-Left	6.196 / 27.315	30.706	0.517	6.493 / 37.535	28.540	0.611
Front-Left	6.041 / 27.917	13.430	0.758	6.490 / 47.238	33.981	0.560
Front-Right	6.846 / 32.358	32.608	0.488	6.197 / 37.166	33.150	0.658
Front-Right	7.043 / 35.535	18.028	0.718	6.316 / 46.363	31.035	0.627
Left	4.177 / 20.228	31.453	0.519	4.559 / 29.921	28.015	0.629
Left	4.089 / 20.481	13.243	0.762	5.065 / 43.841	30.853	0.6482
Right	4.317 / 19.637	31.137	0.494	4.831 / 41.409	32.423	0.612
Right	4.518 / 21.987	13.798	0.759	4.811 / 57.245	43.672	0.500
Down-Left	2.779 / 15.988	28.473	0.518	4.384 / 31.622	28.000	0.611
Down-Left	2.777 / 14.900	13.257	0.763	4.326 / 41.044	39.531	0.5253
Down-Right	2.893 / 13.667	22.618	0.606	4.137 / 33.523	27.890	0.654
Down-Right	3.145 / 17.035	15.839	0.728	3.938 / 38.001	36.195	0.5478
Mean	4.774 / 22.111	29.756	0.522	5.253 / 36.185	29.956	0.627
Mean	4.859 / 23.790	14.261	0.752	5.300 / 45.369	33.334	0.591

Table 1. Teacher-student comparison across different starting positions. The left side shows results in the simulator; the right side shows results on flights in the real world. Metrics include total flight distance/time, final norm error in pixels (L2 norm of the error vector for all 4 corner points combined), and final IoU (L2 norm and IOU are computed in the last 3 secs of the flight). The teacher is slightly faster in flight time. Student is 11 faster in computation time (see Tab. 2). The student is slightly more accurate (lower errors at destination than the teacher) in simulation, where it was trained more, but it is slightly less accurate in real-world, where it was trained on far fewer flights.

Inference Time Analysis

Computation times comparison between NSER IBVS and Student model
Evaluator	Avg	Std	Med	Min	Max	FPS
NSER IBVS	20.69	7.63	24.56	6.45	82.55	48.30
Student	1.85	0.93	1.84	1.79	235.64	540.8

Table 2. Computation times (in milliseconds) over 30 trials. The small 1.7M params student ConvNet is 11x faster than the teacher.

Contact

For questions and collaboration inquiries, please contact the authors through the GitHub repository or academic channels.

Acknowledgements

This work is supported by projects "Romanian Hub for Artificial Intelligence - HRIA", Smart Growth, Digitization and Financial Instruments Program, 2021-2027 (MySMIS No. 334906), European Health and Digital Executive Agency (HADEA) through DIGITWIN4CIUE (Grant No. 101084054), and "European Lighthouse of AI for Sustainability - ELIAS", Horizon Europe program (Grant No. 101120237).

Citation

@InProceedings{Mocanu_2025_ICCV,
    author    = {Mocanu, Sebastian and Nae, Sebastian-Ion and Barbu, Mihai-Eugen and Leordeanu, Marius},
    title     = {Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    month     = {October},
    year      = {2025},
    pages     = {1744-1753}
}