ABOUT ASTRUS

ABOUT ASTRUS

ABOUT ASTRUS

📍 Location: Toronto or Waterloo, Canada


At Astrus, we are using AI to automate microchip design, starting with the biggest bottleneck, analog layout. Our mission is to radically improve global computation and empower chip designers to create the world's most advanced microchips with AI. Astrus is backed by top-tier VC firms: Khosla Ventures, HOF Capital, and 1517 Fund.

📍 Location: Toronto or Waterloo, Canada


At Astrus, we are using AI to automate microchip design, starting with the biggest bottleneck, analog layout. Our mission is to radically improve global computation and empower chip designers to create the world's most advanced microchips with AI. Astrus is backed by top-tier VC firms: Khosla Ventures, HOF Capital, and 1517 Fund.

ABOUT THE ROLE

ABOUT THE ROLE

ABOUT THE ROLE

We’re scaling one of the largest reinforcement learning systems ever built for chip design — generating and evaluating billions of layouts in simulation.

As the Dev Ops / ML Infrastructure Lead, you’ll build and scale the core platform that powers our agent training stack. You’ll work at the heart of Astrus’s research loop — owning the systems that manage GPU clusters, orchestrate distributed training, store and version agents, and enable continuous evaluation.

You’ll work closely with the AI team and own infra end-to-end: from CI/CD, cluster management, and deployment tooling, to observability and research infra integrations.

We’re scaling one of the largest reinforcement learning systems ever built for chip design — generating and evaluating billions of layouts in simulation.

As the Dev Ops / ML Infrastructure Lead, you’ll build and scale the core platform that powers our agent training stack. You’ll work at the heart of Astrus’s research loop — owning the systems that manage GPU clusters, orchestrate distributed training, store and version agents, and enable continuous evaluation.

You’ll work closely with the AI team and own infra end-to-end: from CI/CD, cluster management, and deployment tooling, to observability and research infra integrations.

WHAT YOU WILL DO

WHAT YOU WILL DO

WHAT YOU WILL DO

  • Build and scale cluster infrastructure (GCP, Ray, Kubernetes, Anyscale) for training at massive scale.

  • Design CI pipelines, machine image workflows, and automated teardown / setup for training workloads.

  • Create and maintain artifact registries and structured model storage, with metadata and version tracking.

  • Develop infra tools for research engineers to interact with the cluster (job submission, monitoring, data pipelines).

  • Set up robust observability tools: evaluation monitoring, system health dashboards, and experiment tracking.

  • Manage system integration and model delivery pipelines — bridging research to production.

  • Optimize training throughput, memory footprint, and cost-performance using infra-aware strategies.

  • (Future) Support version-controlled inference endpoints, blue/green deployment, and staged rollouts.

  • Build and scale cluster infrastructure (GCP, Ray, Kubernetes, Anyscale) for training at massive scale.

  • Design CI pipelines, machine image workflows, and automated teardown / setup for training workloads.

  • Create and maintain artifact registries and structured model storage, with metadata and version tracking.

  • Develop infra tools for research engineers to interact with the cluster (job submission, monitoring, data pipelines).

  • Set up robust observability tools: evaluation monitoring, system health dashboards, and experiment tracking.

  • Manage system integration and model delivery pipelines — bridging research to production.

  • Optimize training throughput, memory footprint, and cost-performance using infra-aware strategies.

  • (Future) Support version-controlled inference endpoints, blue/green deployment, and staged rollouts.

WHO YOU ARE

WHO YOU ARE

WHO YOU ARE

  • Have 3–7+ years of experience in infrastructure engineering, MLOps, or distributed systems.

  • Are proficient with Kubernetes, GCP, Ray, Terraform, and modern infra-as-code workflows.

  • Have worked with deep learning platforms (JAX, PyTorch, Ray RLlib) at scale.

  • Are comfortable working across infra, scripting, and lightweight cloud DevOps.

  • Care deeply about research velocity — and know how to build tools that unblock AI teams.

  • Bonus: Experience with cluster observability, training performance optimization, or inference infra.

  • Have 3–7+ years of experience in infrastructure engineering, MLOps, or distributed systems.

  • Are proficient with Kubernetes, GCP, Ray, Terraform, and modern infra-as-code workflows.

  • Have worked with deep learning platforms (JAX, PyTorch, Ray RLlib) at scale.

  • Are comfortable working across infra, scripting, and lightweight cloud DevOps.

  • Care deeply about research velocity — and know how to build tools that unblock AI teams.

  • Bonus: Experience with cluster observability, training performance optimization, or inference infra.

Ready to radically improve global computation? 🚀📈🌎 🤖

Ready to radically improve global computation? 🚀📈🌎 🤖

Reach out to careers@astrus.ai or Steph Hector for more details

Reach out to careers@astrus.ai or Steph Hector for more details