Job Description:
We are seeking a Machine Learning Researcher to join our team and help advance the state of the art in human-centric generative video models. Your work will focus on improving expression control, lip synchronisation, and overall realism in models such as WAN and Hunyuan. You’ll collaborate with a world-class team of researchers and engineers to build systems that can generate lifelike talking-head videos from text, audio, or motion signals—pushing the boundaries of neural rendering and avatar animation. We are hiring remotely across the EMEA region.
Key Responsibilities
-
Research and develop cutting-edge generative video models, with a focus on controllable facial expression, head motion, and audio-driven lip synchronisation.
-
Fine-tune and extend video diffusion models such as WAN and Hunyuan for better visual realism and audio-visual alignment.
-
Design robust training pipelines and large-scale video/audio datasets tailored for talking-head synthesis.
-
Explore techniques for controllable expression editing, multi-view consistency, and high-fidelity lip sync from speech or text prompts.
-
Work closely with product and creative teams to ensure models meet quality and production constraints.
-
Stay current with the latest research in video generation, speech-driven animation, and 3D-aware neural rendering.
Must Haves
-
Strong background in machine learning and deep learning, especially in generative models for video, vision, or speech.
-
Hands-on experience with video synthesis tasks such as face reenactment, lip sync, audio-to-video generation, or avatar animation.
-
Proficient in Python and PyTorch; familiar with libraries like MMPose, MediaPipe, DLIB, or image/video generation frameworks.
-
Experience training large models and working with high-resolution audio/video datasets.
-
Deep understanding of architectures such as transformers, diffusion models, GANs and motion representation techniques.
-
Proven ability to work independently and drive research from idea to implementation.
-
Strong problem-solving skills, ability to work autonomously in a remote-first environment.
Nice to Have
-
PhD in Computer Vision, Machine Learning, or a related field, with publications in top-tier conferences (CVPR, ICCV, ICLR, NeurIPS, etc.).
-
Familiarity with or contributions to open-source projects in lip sync, video generation, or 3D face modelling.
-
Experience with real-time inference, model optimisation, or deployment for production applications.
-
Knowledge of adjacent areas like emotion modelling, multimodal learning, or audio-driven animation.
-
Experience working with or adapting models like WAN, Hunyuan or similar.