High Performance Computing Software Engineer - Supercomputing
Institute Of Foundation Models
Posted 30+ days ago
Send me Jobs like this
Experience
1 - 7 Years
Job Location
Education
Bachelor of Science(Computers), Master of Science(Computers)
Nationality
Any Nationality
Gender
Not Mentioned
Vacancy
1 Vacancy
Job Description
Roles & Responsibilities
IFM is building the foundational compute infrastructure that will power tomorrow s breakthroughs in AI and computational science. We re looking for a High Performance Computing Software Engineer to help us design, develop, and operate the software systems that run our large-scale AI workloads.
In this role, you ll work at the intersection of high-performance computing and machine learning. You ll be part of a team responsible for crafting the software stack that enables training of cutting-edge ML models spanning 1000+ GPUs and ensuring our infrastructure is robust, performant, and developer-friendly.
- Design and implement high-performance, distributed software solutions for large-scale AI/ML training.
- Optimize low-level system components including Linux kernel, GPU/accelerator kernels, and interconnects.
- Develop and tune communication libraries such as NCCL, MPI, UCX, RCCL, and RDMA-based systems.
- Partner with ML researchers and engineers to support frameworks like PyTorch, MegatronLM, and DeepSpeed in large-scale production environments.
- Contribute to our scheduling, orchestration, and job management systems, including Slurm and Kubernetes.
- Debug and resolve complex issues across the stack from kernel to container to model.
- Work closely with hardware vendors, upstream open-source communities, and internal teams to drive performance and reliability improvements.
Desired Candidate Profile
Skills & Experience
- Proven experience developing and optimizing software for large-scale ML workloads (1000+ GPUs preferred).
- Deep understanding of Linux kernel internals and accelerator (GPU) kernel development.
- Proficiency with distributed communication libraries (e.g., NCCL, RCCL, MPI, UCX, SHARP, Libfabric).
- Experience with ML frameworks like PyTorch, TensorFlow, JAX, or MegatronLM.
- Strong knowledge of HPC job scheduling and orchestration tools (e.g., Slurm, Kubernetes, Pyxis).
- Excellent debugging and systems performance tuning skills.
- A collaborative mindset with a focus on shared success and technical excellence.
Company Industry
- Education
- Training
- Teaching
- Academics
Department / Functional Area
- Engineering
Keywords
- High Performance Computing Software Engineer - Supercomputing
Disclaimer: Naukrigulf.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@naukrigulf.com
Institute Of Foundation Models
https://jobs.lever.co/ifm-us/5c662f6d-0043-4bba-8b58-53dd4afd8e4e
Similar Jobs
Developer
OU
- 0 - 6 Years
- Dubai - United Arab Emirates (UAE)
Software Programmer
Software AWD
- 0 - 2 Years
- Abu Dhabi , Ajman - United Arab Emirates (UAE)
Full Stack Developer
Dios IT Infrastructure LLC
- 0 - 5 Years
- Dubai - United Arab Emirates (UAE)
Full Stack Developer (Java | React.js | AWS)
Confidential Company
- 1 - 2 Years
- Dubai - United Arab Emirates (UAE)
Java Programmer
EYL Consultant
- 0 - 2 Years
- Al Ain - United Arab Emirates (UAE)