<p><strong>Research Scientist – VLM Generalist</strong></p>
<p><strong>Location:</strong> Remote&nbsp;</p>
<p><strong>About the Role</strong></p>
<p>We’re looking for a Research Scientist with deep expertise in <str...
...
<p><strong>Research Scientist – VLM Generalist</strong></p>
<p><strong>Location:</strong> Remote </p>
<p><strong>About the Role</strong></p>
<p>We’re looking for a Research Scientist with deep expertise in <strong>training and fine-tuning large Vision-Language and Language Models (VLMs / LLMs)</strong> for downstream multimodal tasks. You’ll help push the next frontier of models that reason across <strong>vision, language, and 3D</strong>, bridging research breakthroughs with scalable engineering.</p>
<p><strong>What You’ll Do</strong></p>
<ul>
<li>Design and fine-tune large-scale VLMs / LLMs — and hybrid architectures — for tasks such as visual reasoning, retrieval, 3D understanding, and embodied interaction.</li>
<li>Build robust, efficient training and evaluation pipelines (data curation, distributed training, mixed precision, scalable fine-tuning).</li>
<li>Conduct in-depth analysis of model performance: ablations, bias / robustness checks, and generalisation studies.</li>
<li>Collaborate across research, engineering, and 3D / graphics teams to bring models from prototype to production.</li>
<li>Publish impactful research and help establish best practices for multimodal model adaptation.</li>
</ul>
<p><strong>What You Bring</strong></p>
<ul>
<li>PhD (or equivalent experience) in Machine Learning, Computer Vision, NLP, Robotics, or Computer Graphics.</li>
<li>Proven track record in <strong>fine-tuning or training large-scale VLMs / LLMs</strong> for real-world downstream tasks.</li>
<li>Strong <strong>engineering mindset</strong> — you can design, debug, and scale training systems end-to-end.</li>
<li>Deep understanding of <strong>multimodal alignment and representation learning</strong> (vision–language fusion, CLIP-style pre-training, retrieval-augmented generation).</li>
<li>Familiarity with recent trends, including <strong>video-language and long-context VLMs</strong>, <strong>spatio-temporal grounding</strong>, <strong>agentic multimodal reasoning</strong>, and <strong>Mixture-of-Experts (MoE)</strong> fine-tuning.</li>
<li>Awareness of <strong>3D-aware multimodal models</strong> — using NeRFs, Gaussian splatting, or differentiable renderers for grounded reasoning and 3D scene understanding.</li>
<li>Hands-on experience with PyTorch / DeepSpeed / Ray and distributed or mixed-precision training.</li>
<li>Excellent communication skills and a collaborative mindset.</li>
</ul>
<p><strong>Bonus / Preferred</strong></p>
<ul>
<li>Experience integrating <strong>3D and graphics pipelines</strong> into training workflows (e.g., mesh or point-cloud encoding, differentiable rendering, 3D VLMs).</li>
<li>Research or implementation experience with <strong>vision-language-action models</strong>, <strong>world-model-style architectures</strong>, or <strong>multimodal agents</strong> that perceive and act.</li>
<li>Familiarity with <strong>efficient adaptation methods</strong> — LoRA, adapters, QLoRA, parameter-efficient finetuning, and distillation for edge deployment.</li>
<li>Knowledge of <strong>video and 4D generation</strong> trends, <strong>latent diffusion / rectified flow</strong> methods, or <strong>multimodal retrieval and reasoning pipelines</strong>.</li>
<li>Background in <strong>GPU optimisation, quantisation, or model compression</strong> for real-time inference.</li>
<li>Open-source or publication track record in top-tier ML / CV / NLP venues.</li>
</ul>
<p><strong>Equal Employment Opportunity:</strong></p>
<p>We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.</p>