Swapnil Bhosale | PhD Researcher in Audio-visual Correspondence learning

I am a PhD student at the University of Surrey's People-Centred AI Institute specializing in multimodal deep learning, working at the intersection of vision, language, and audio processing. My focus is to leverage foundational AI models to learn audio-visual correspondence and solve real-world challenges. I work within the Universal Perception (UP) Lab under the guidance of Dr. Xiatian Zhu and Dr. Diptesh Kanojia.
Before starting my PhD, I worked as a researcher at TCS-Research, Mumbai under Dr. Sunil Kumar Kopparapu, where I developed cutting-edge solutions in audio event detection, multimodal emotion recognition, and pathological speech processing, contributing to impactful publications and patents in speech and audio signal processing.

Selected Publications

(A complete list of my publications is available here).

Unsupervised Audio-Visual Segmentation with Modality Alignment
AAAI (2025)

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
NeurIPS (2024)

Centrality-aware Product Retrieval and Ranking
EMNLP (2024) [Industry Track]

DiffSED: Sound Event Detection with Denoising Diffusion
AAAI (2024) [Oral]

A Novel Metric For Evaluating Audio Caption Similarity
IEEE ICASSP (2023)

Calibration Free Meta learning based approach for Subject Independent EEG Emotion Recognition
Biomedical Signal Processing and Control 72 (2022)

Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis
Interspeech (2021)

Deep Lung Auscultation using Acoustic biomarkers for Abnormal Respiratory Sound Event Detection
IEEE ICASSP (2021)

Deep Encoded Linguistic and Acoustic cues for Attention based End-to-End Speech Emotion Recognition
IEEE ICASSP (2020)

Automatic Speaker Independent Dysarthric Speech Intelligibility Assessment System
Computer Speech and Language 69 (2021)

End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios
Interspeech (2019)

Research Experience

PhD Researcher, Universal Perception (UP) lab, People-Centred AI, University of Surrey, UK
2022.09 ~ Present
Vision and Language Processing.

Researcher
Speech and NLP team, TCS-Research,
Mumbai, India.
2019.08 ~ 2022.09
Audio and Speech Signal Processing, Few-shot Audio Event Detection, Audio Captioning.

Research Intern
Speech and NLP team, TCS-Research,
Mumbai, India.
2019.01 ~ 2019.06
End-to-End Spoken Language Understanding.