Infra Engr, Infra Hybrid IT

Date: 25 Apr 2024

Location: Singapore, Singapore

Company: Singtel Group

RESPONSIBILITIES:

  • Support High Power Computing and ITSM
  • The System engineer is responsible in specializing in High-Performance Computing (HPC), you will be a key contributor to the design, implementation, and optimization of complex computational systems. Leveraging your expertise in HPC technologies, you will collaborate with cross-functional teams to ensure the seamless integration and performance of high-performance computing environments. 

System Design and Implementation: 

  • Design, implement, and maintain high-performance computing systems to meet the organization's computational needs. 
  • Collaborate with stakeholders to understand performance requirements and hardware specifications. 

Parallel Computing: 

  • Implement and optimize parallel computing techniques to enhance system performance. 
  • Leverage parallel programming languages and frameworks for efficient task execution. 

Cluster Management: 

  • Manage and optimize HPC clusters, ensuring scalability and reliability. 
  • Implement and maintain cluster management tools for efficient resource utilization. 

Performance Tuning: 

  • Analyze and fine-tune system configurations, hardware, and software for optimal performance. 
  • Identify and resolve performance bottlenecks in HPC applications. 

Job Scheduling: 

  • Utilize job scheduling systems to allocate computational resources and manage workloads efficiently. 
  • Collaborate with users to understand job requirements and prioritize computing tasks. 

Networking and Interconnects: 

  • Configure and optimize high-speed interconnects, such as InfiniBand, for fast data transfer between nodes. 
  • Collaborate with network administrators to ensure seamless communication within HPC environments. 

Distributed File Systems: 

  • Implement and manage distributed file systems for efficient data storage and retrieval. 
  • Optimize data access and transfer mechanisms to support large-scale computations. 

Fault Tolerance and Reliability: 

  • Implement strategies for fault tolerance to ensure system reliability during long-running computations. 
  • Troubleshoot and resolve system issues to minimize downtime. 

Documentation: 

  • Create and maintain detailed documentation of HPC system configurations, processes, and best practices. 
  • Develop user guides and training materials for HPC users. 

Stay Updated: 

  • Keep abreast of emerging trends and advancements in HPC technologies. 
  • Evaluate and recommend new hardware and software solutions to enhance system capabilities.

 

REQUIREMENTS

  • Bachelor’s or master’s degree in computer science, Information Technology, or a related field.
  • Proven experience as a Systems Engineer with a focus on High-Performance Computing.
  • Knowledge of HPC architectures, technologies, and parallel programming languages.

Technical Proficiency:

  • Familiarity with cluster management tools, job scheduling systems, and distributed file systems.
  • Experience with high-speed interconnects (e.g., InfiniBand) and networking in HPC environments.

Problem-Solving Skills:

  • Strong analytical and problem-solving skills to address complex HPC challenges.

Communication:

  • Excellent communication and collaboration skills to work effectively in interdisciplinary teams.