Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control

ICCV 2025 - The 21st Embedded Vision Workshop (Oral)
1National University of Science and Technology POLITEHNICA Bucharest, Romania
2Institute of Mathematics "Simion Stoilow" of the Romanian Academy, Romania
3NORCE Norwegian Research Center, Norway

Overview Video

Demo comparing the teacher and the student methods showcasing examples where the teacher is faster, where the student is faster and when the teacher fails to do the task.

Motivation and Contribution

Challenge: Classical IBVS methods suffer from numerical instabilities and singularities, while marker-based approaches (ArUco, AprilTags) limit deployment in dynamic indoor environments. GPS-denied scenarios demand efficient, marker-free visual servoing for quadrotor control.

Our Solution: We present a self-supervised neuro-analytical framework featuring a Numerically Stable Efficient and Reduced (NSER) Image-Based Visual Servoing (IBVS) teacher model, distilled into a lightweight 1.7M parameter student network achieving 11x real-time performance with improved control accuracy.

Key Contributions

  • Stable analytical teacher: Improved IBVS controller solving numerical instabilities through reduced classical equations, enabling robust marker-free control.
  • Two-stage segmentation: YOLOv11 + U-Net mask splitter for anterior-posterior vehicle segmentation, accurately estimating target orientation.
  • Efficient knowledge distillation: Dual-path system transferring geometric visual servoing from teacher to compact student neural network that outperforms the teacher while suitable for onboard deployment.
  • Practical sim-to-real transfer: Digital-twin training with real-world fine-tuning, validated in GPS-denied indoor environments with minimal hardware.

Visual Performance for Teacher

Front-Left Approach
Up-Left Approach
Up-Right Approach
Left Approach
Right Approach
Down-Left Approach
Down-Right Approach
Front-Center Approach (Teacher Fails)

Visual Performance for Student

Front-Left Approach
Up-Left Approach
Up-Right Approach
Left Approach
Right Approach
Down-Left Approach
Down-Right Approach
Front-Center Approach (Student Succeeds)

Our Approach

We propose a Teacher-Student architecture to combine the stability of analytical methods with the efficiency of neural networks. The Teacher (NSER-IBVS) uses a numerically stable analytic control law to generate robust velocity commands. The Student, a lightweight CNN, learns to regress these commands directly from raw images, bypassing the expensive feature extraction pipeline.

1. Knowledge Distillation & Self-Supervised Learning Pipeline

Diagram showing teacher-student knowledge distillation pipeline with the teacher IBVS model transferring knowledge to a lightweight student neural network
Lightweight 1.7M parameter student network trained via knowledge distillation from teacher. Simulator-based data generation with sim-to-real transfer achieves 11x faster inference (1.85ms vs 20.69ms).

2. Numerically Stable IBVS Teacher (NSER)

Architecture diagram of the NSER IBVS teacher showing the reduced formulation for numerical stability
Analytically stable Image-Based Visual Servoing controller addressing classical IBVS numerical instabilities through reduced formulations and robust error computation from corner points.

3. Two-Stage Target Segmentation

Diagram of the two-stage segmentation pipeline: YOLOv11 detection followed by U-Net mask splitter for anterior-posterior separation
YOLOv11 detection refined with U-Net segmentation splitter to distinguish anterior-posterior vehicle halves for precise orientation estimation.

Why Mask Splitting Matters: Solving Keypoint Ordering

Naive keypoint detection showing ambiguous corner assignment from YOLO bounding box
Naive: YOLO bounding box corners lack orientation awareness
Analytically recomputed keypoints with improved but still ambiguous ordering
Analytical: Recomputed corners, still orientation-ambiguous
Mask splitter output with consistently ordered keypoints based on front-back segmentation
Ours: Mask splitter enables consistent pose-based ordering

Standard bounding box approaches suffer from keypoint ordering ambiguity, the four corners can be assigned inconsistently across frames, destabilizing the IBVS control loop. Our mask splitter determines which part of the vehicle is front vs. back, enabling consistent clockwise ordering of keypoints for stable visual servoing.

4. Student Neural Network

Diagram of the student architecture: Frame as input and goes through neural network layers to produce angular and linear velocities.
Compact 1.7M parameter CNN directly regresses velocity commands from RGB input, bypassing explicit visual servoing computation.

Goal State Reference

The IBVS controller computes velocity commands by comparing current keypoints to a reference (goal) configuration. Below are the reference images used for real-world and simulated environments:

Reference goal image for real-world flights showing target vehicle at desired pose
Real-World Reference
Reference goal image for simulated flights showing target vehicle at desired pose
Simulator Reference

This integrated approach combines the robustness of analytical control theory with the efficiency and adaptability of neural networks, enabling practical deployment on resource-constrained aerial platforms.

Experimental Setup

Diagram showing 8 directional starting positions around the target vehicle: up-left, up-right, front-left, front-right, left, right, down-left, down-right
8 directional starting positions

Evaluation Framework: High-fidelity digital twin simulator + real-world indoor GPS-denied flights

Evaluation Metrics

  • Flight performance: Distance, time
  • Control accuracy: Final norm error (px)
  • Tracking quality: IoU (last 3s)
  • Efficiency: Inference time, FPS

Teacher: Numerically stable IBVS
Student: 1.7M ConvNet (11x faster)

Digital Twin Generation

Our simulation environment addresses the Sim-to-Real gap using a high-fidelity pipeline:

  • Physics Engine: Parrot Sphinx for accurate aerodynamic modeling.
  • Rendering: Unreal Engine 4 for photorealistic visual feedback.
  • Assets: Custom .FBX vehicle and environment assets matching real-world measurements.
Drone flying in the simulator with Teacher model running from third party and drone perspectives
Teacher model in the simulated environment with both drone and third party actor perspectives

Hardware Requirements

  • Simulation: Ubuntu 22.04/24.04, NVIDIA GPU (CUDA), 8GB+ RAM
  • Real-World: Parrot Anafi 4K, laptop with WiFi + GPU
  • Environment: Indoor with Lambertian floor surface

Mission Termination

  • Hard Goal: Median error < 40px for 3 consecutive seconds
  • Soft Goal: Median error < 80px AND all velocities = 0 for 3s
  • Timeout: 75 seconds maximum flight duration

Pre-trained Models

All models are available in the code repository.

Pre-trained model weights and specifications
Model Params Description
YOLOv11n Segmentation 2.84M Vehicle segmentation (sim / real)
Mask Splitter (U-Net) 1.94M Anterior-posterior splitting (sim / real)
Student Network 1.7M Direct velocity regression (sim-pretrained + real fine-tuned)

Visual Results

Control command and error evolutions over time

Graph showing control command and error evolution over time for the teacher IBVS in real flight conditions
1. Real flight - Teacher (IBVS)
Graph showing control command and error evolution over time for the student network in real flight conditions
2. Real flight - Student
Graph showing control command and error evolution over time for the teacher IBVS in simulation
3. Simulation - Teacher (IBVS)
Graph showing control command and error evolution over time for the student network in simulation
4. Simulation - Student

Temporal Evolution of Control. Comparison of control commands and error evolution on novel test sequences across 8 different starting points. Top Row: Real-world flights. Bottom Row: Simulation. Solid lines represent mean values; shaded areas indicate variability across runs. All control commands and errors converge toward zero, indicating robust trajectory tracking. Note the striking similarity in control behavior between the complex analytical Teacher (1 and 3) and the lightweight Student ConvNet (2 and 4) in both domains, validating our Sim-to-Real transfer pipeline.

Command Distribution Comparison

Graph showing control command and error evolution over time for the student network in simulation
Real-world vs Digital-twin distribution of commands, linear and angular (yaw) velocities, over all experiments comparison.

Sim-to-Real Domain Alignment. Aggregated probability density functions of control commands (linear velocities x, y and angular yaw rate rot) across all experiments. The strong overlap between the Real-World and Digital-Twin distributions validates the fidelity of our simulation environment.

Trajectory Analysis

3D visualization of drone flight trajectories from front left-right starting positions
Front-Left & Front-Right Initialization Trajectories
3D visualization of drone flight trajectories from up-left and up-right starting positions
Up-Left & Up-Right Initialization Trajectories

2D Flight Trajectories. Trajectories of teacher and student simulation flights with mean and standard deviation across 4 starting poses. Left: Front approaches. Right: Up approaches. Green circles indicate starting points, the Star represents the goal pose.
Lines represent mean trajectory over the experiments where Solid Lines = Teacher NSER IBVS and Dashed Lines = Student. Shaded regions indicate trajectory variability across runs. Note that while the Student displays slightly more path variation, it frequently achieves a shorter average path to the target than the analytical Teacher.

Numeric Results

Teacher-student comparison across different starting positions
Direction Flight SIM Flight
Distance (m) / Time (s) ↓ Norm Error SIM (px) ↓ IoU SIM ↑ Distance (m) / Time (s) ↓ Norm Error (px) ↓ IoU ↑
Up-Left5.312 / 23.46629.2560.5305.798 / 36.72130.1340.620
5.193 / 24.16413.3190.7595.735 / 43.33428.6000.6263
Up-Right5.675 / 24.22631.8000.5035.622 / 41.58131.4990.621
6.064 / 28.29813.1720.7665.716 / 45.88522.8020.6919
Front-Left6.196 / 27.31530.7060.5176.493 / 37.53528.5400.611
6.041 / 27.91713.4300.7586.490 / 47.23833.9810.560
Front-Right6.846 / 32.35832.6080.4886.197 / 37.16633.1500.658
7.043 / 35.53518.0280.7186.316 / 46.36331.0350.627
Left4.177 / 20.22831.4530.5194.559 / 29.92128.0150.629
4.089 / 20.48113.2430.7625.065 / 43.84130.8530.6482
Right4.317 / 19.63731.1370.4944.831 / 41.40932.4230.612
4.518 / 21.98713.7980.7594.811 / 57.24543.6720.500
Down-Left2.779 / 15.98828.4730.5184.384 / 31.62228.0000.611
2.777 / 14.90013.2570.7634.326 / 41.04439.5310.5253
Down-Right2.893 / 13.66722.6180.6064.137 / 33.52327.8900.654
3.145 / 17.03515.8390.7283.938 / 38.00136.1950.5478
Mean4.774 / 22.11129.7560.5225.253 / 36.18529.9560.627
4.859 / 23.79014.2610.7525.300 / 45.36933.3340.591

Table 1. Teacher-student comparison across different starting positions. The left side shows results in the simulator; the right side shows results on flights in the real world. Metrics include total flight distance/time, final norm error in pixels (L2 norm of the error vector for all 4 corner points combined), and final IoU (L2 norm and IOU are computed in the last 3 secs of the flight). The teacher is slightly faster in flight time. Student is 11 faster in computation time (see Tab. 2). The student is slightly more accurate (lower errors at destination than the teacher) in simulation, where it was trained more, but it is slightly less accurate in real-world, where it was trained on far fewer flights.

Inference Time Analysis

Computation times comparison between NSER IBVS and Student model
Evaluator Avg Std Med Min Max FPS
NSER IBVS 20.69 7.63 24.56 6.45 82.55 48.30
Student 1.85 0.93 1.84 1.79 235.64 540.8

Table 2. Computation times (in milliseconds) over 30 trials. The small 1.7M params student ConvNet is 11x faster than the teacher.

Contact

For questions and collaboration inquiries, please contact the authors through the GitHub repository or academic channels.

Acknowledgements

This work is supported by projects "Romanian Hub for Artificial Intelligence - HRIA", Smart Growth, Digitization and Financial Instruments Program, 2021-2027 (MySMIS No. 334906), European Health and Digital Executive Agency (HADEA) through DIGITWIN4CIUE (Grant No. 101084054), and "European Lighthouse of AI for Sustainability - ELIAS", Horizon Europe program (Grant No. 101120237).

Citation

@InProceedings{Mocanu_2025_ICCV,
    author    = {Mocanu, Sebastian and Nae, Sebastian-Ion and Barbu, Mihai-Eugen and Leordeanu, Marius},
    title     = {Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    month     = {October},
    year      = {2025},
    pages     = {1744-1753}
}