Swapnil Bhosale | PhD Researcher in Audio-visual Correspondence learning

I am a PhD student at the University of Surrey's People-Centred AI Institute specializing in multimodal deep learning, working at the intersection of vision, language, and audio processing. My focus is to leverage foundational AI models to learn audio-visual correspondence and solve real-world challenges. I work within the Universal Perception (UP) Lab under the guidance of Dr. Xiatian Zhu and Dr. Diptesh Kanojia.
Before starting my PhD, I worked as a researcher at TCS-Research, Mumbai under Dr. Sunil Kumar Kopparapu, where I developed cutting-edge solutions in audio event detection, multimodal emotion recognition, and pathological speech processing, contributing to impactful publications and patents in speech and audio signal processing.

Selected Publications

(A complete list of my publications is available here).

More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks
IEEE ICASSP (2026)

Unsupervised Audio-Visual Segmentation with Modality Alignment
AAAI (2025)

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
NeurIPS (2024)

Centrality-aware Product Retrieval and Ranking
EMNLP (2024) [Industry Track]

DiffSED: Sound Event Detection with Denoising Diffusion
AAAI (2024) [Oral]

A Novel Metric For Evaluating Audio Caption Similarity
IEEE ICASSP (2023)

Calibration Free Meta learning based approach for Subject Independent EEG Emotion Recognition
Biomedical Signal Processing and Control 72 (2022)

Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis
Interspeech (2021)

Deep Lung Auscultation using Acoustic biomarkers for Abnormal Respiratory Sound Event Detection
IEEE ICASSP (2021)

Deep Encoded Linguistic and Acoustic cues for Attention based End-to-End Speech Emotion Recognition
IEEE ICASSP (2020)

Automatic Speaker Independent Dysarthric Speech Intelligibility Assessment System
Computer Speech and Language (2021)

End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios
Interspeech (2019)

Research Experience

Research Scientist Intern, Adobe Research, London, UK
Present
Multimodal LLMs.

Research Scientist Intern, Mitsubishi Electric Research Labortaories, Speech and Audio team, Cambridge, US
2025.09 ~ 2026.03
Geometry-aware spatial room impulse response modeling; 3D Audio-visual correspondence learning.

Research Scientist Intern, Meta, Reality Labs Research Audio, Cambridge, UK
2025.03 ~ 2025.09
Real-time sound recognition for smart glasses; Encoding acoustic awareness for AI interactions.

PhD, Universal Perception (UP) lab, People-Centred AI, University of Surrey, UK
2022.09 ~ Present
Multimodal AI.

Researcher
Speech and NLP team, TCS-Research,
Mumbai, India.
2019.08 ~ 2022.09
Audio and Speech Signal Processing, Few-shot Audio Event Detection, Audio Captioning.

Research Intern
Speech and NLP team, TCS-Research,
Mumbai, India.
2019.01 ~ 2019.06
End-to-End Spoken Language Understanding.