Operations Manager, GPUaaS

Apply now »

Date: 19 Mar 2026

Location: Singapore, Singapore

Company: Singtel Group

About Singtel Digital InfraCo – RE:AI

Singtel Digital InfraCo’s RE:AI division is building Asia’s most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises, research institutions, and digital-native businesses to accelerate innovation through responsible, high-performance AI compute and connectivity solutions.

Be a Part of Something BIG!

Operations Manager, GPU Operations is responsible for leading the day-to-day operations of Singtel’s GPU-as-a-Service (GPUaaS) platform. This role ensures high levels of system availability, performance, security, and reliability across GPU infrastructure and supporting data centre operations.

The role serves as the primary operational interface with GPU infrastructure engineering teams, collaborating on platform upgrades, observability, security enhancements, and continuous operational improvements.

Make an Impact by

Acting as the overall coordinator and primary point of contact for end-to-end GPUaaS operations, including data centre operations and operational reporting.
Leading daily GPUaaS and data centre operations covering hardware, environmental controls, networking, security, and supporting software platforms.
Managing operations teams, vendors, and consultants during both normal operations and emergency situations.
Coordinating with internal teams and external partners to implement GPUaaS enhancements and data centre initiatives.
Implementing, validating, and continuously improving operational plans to ensure platform stability across GPU hardware, software, and data centre infrastructure (e.g. power and cooling).
Leading incident response and resolution for GPUaaS environments, including root cause analysis (RCA) and timely communication to customers and stakeholders.
Presenting operational status, risks, and improvement plans to senior management and relevant stakeholders.
Ensuring incidents are addressed or escalated in accordance with criticality, impact, and SLA/SLO requirements.
Building and leading a high-performing operations team, fostering collaboration, innovation, and continuous improvement.
Setting clear goals, mentoring team members, and supporting professional development.
Leading security incident management and enforcing security and compliance best practices within the GPUaaS environment.
Monitoring industry security trends and implementing measures to protect customer data and platform integrity.
Participating in scheduled or on-call support outside standard working hours as required.

Skills for Success

Bachelor’s degree in Computer Science, Information Technology, or a related discipline.
Minimum of 8 years’ experience in data centre operations and management, including at least 3 years in a leadership or managerial role.
Strong knowledge of data centre infrastructure, including servers, networking, storage, physical security, and cybersecurity.
Experience with electrical and mechanical systems, maintenance, and facilities operations.
Proven people leadership and vendor management capabilities.
Strong organisational skills and adaptability to changing operational requirements.
Effective interpersonal, communication, and presentation skills.
Experience managing customer interactions and driving service quality improvements.

Desirable qualifications

Experienced in Linux and hypervisor administration for GPU infrastructure and GPUaaS.
Complex technical problem-solving with a proactive approach to system operation and optimization.
Knowledge of storage technologies and experience in capacity planning, troubleshooting, and data protection.
Experience in GPU and GPU infrastructure management, including configuration, monitoring, and performance
Experience with liquid cooling systems specific to GPU infrastructure operation and monitoring.
Understanding of GPU cluster architectures and operations, including GPU-based systems, collective communications (e.g. NCCL, RDMA), AI/HPC networking (e.g. InfiniBand), and containerized or orchestrated environments supporting AI and HPC workloads.

Rewards that Go Beyond

Flexible work arrangements
Full suite of health and wellness benefits
Ongoing training and development programs
Internal mobility opportunities

Your Career Growth Starts Here. Apply Now!

Apply now »