HPC Site Reliability Engineer Job at Trust In SODA, Alameda, CA

dCtDK2t1QVI1L0oxMHYyNGJLNlkzdG5PYUE9PQ==
  • Trust In SODA
  • Alameda, CA

Job Description

Love solving gnarly problems in AI infrastructure?

Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.

You’ll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.

This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.

You’ll:

  • Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
  • Debug low-level networking issues with Cisco, Juniper, and more
  • Automate configs with Ansible + Terraform
  • Monitor everything with Grafana, UFM, ELK, NetQ
  • Own 24/7 reliability, on-call, and root cause analysis

This role is perfect if you:

  • Have 6+ years in HPC or networking-heavy roles
  • Know BGP, EVPN, VxLAN, RDMA inside and out
  • Have SRE experience in high-stakes environments
  • Love solving infra puzzles at scale

Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.

Sound like your kind of challenge? Hit apply and let’s talk.

Job Tags

Remote job,

Similar Jobs

VIP Staffing

Pipe Welder Job at VIP Staffing

 ...Primary Job Title: Pipe Welder Location: Irving, Texas Contract Details: Employment Type: Full-time (40+ hours per week) Shift/Schedule: MondayFriday, 6:00 AM4:00 PM; must be able to work some Saturdays and Sundays Compensation: $18$25 per hour,... 

Celanese

Senior I&E Reliability Engineer (Low & Medium Voltage) Job at Celanese

 ...The Senior I&E Reliability Engineer is responsible for supporting the operation, maintenance, and reliability of electrical equipment...  ...incorporating site best practices and standards Ensure all projects and repairs meet local codes and regulations i.e. IEEE, NFPA 70E, NEC etc.... 

Fresh Baguette

Night Packing Clerk Job at Fresh Baguette

Fresh Baguett e is a fast-growing, artisanal bakery known for its high-quality standards and modern atmosphere. The company was founded in Bethesda, MD in 2013 inspired by bakeries in France. We have six retail locations and a thriving wholesale business. We are passionate...

Fladger Associates

Lab Tech Cleaner Job at Fladger Associates

As a Cleaning Employee you are responsible for carrying out cleaning activities in the cleanrooms (a ""clean"" work environment where pharmaceutical treatments are applied). Your activities therefore take place under strict quality requirements that the pharmaceutical ...

THE BAY AREA TUTORING CENTERS, INC.

Bio and Chem Tutor Job at THE BAY AREA TUTORING CENTERS, INC.

 ...love studying science, we want to meet you! At BATC, we believe a tutor's academic expertise is just as important as his or her ability...  ...year (a combination of shifts), Mon-Thurs 3:30-10 pm and weekends 10 am-9 pm. Your schedule at BATC will be based on your availability...