Infrastructure Software Engineer
Company: Etched
Location: San Jose
Posted on: April 3, 2026
|
|
|
Job Description:
About Etched Etched is building the world’s first AI inference
system purpose-built for transformers - delivering over 10x higher
performance and dramatically lower cost and latency than a B200.
With Etched ASICs, you can build products that would be impossible
with GPUs, like real-time video generation models and extremely
deep & parallel chain-of-thought reasoning agents. Backed by
hundreds of millions from top-tier investors and staffed by leading
engineers, Etched is redefining the infrastructure layer for the
fastest growing industry in history. Job Summary Building
cutting-edge model-specific ASICs requires crafting custom
infrastructure and toolchains to support ultra-fast, reliable, and
scalable development across the stack - from simulation to silicon.
We build this infrastructure as software - and we engineer it with
the same best practices we apply to our products. We use the same
rigor, design discipline, and quality standards and testing as we
do to our ASIC, software, and platform. You will lead the
development and adoption of next-generation infrastructure tooling,
enabling Etched ASIC, Software, and Platform engineers to iterate
faster, build more reliably, and push the boundaries of AI
performance. This includes building and scaling our hybrid
high-performance compute (HPC) cluster, optimized for massively
parallel CI, EDA workflows, Emulation, and hardware-aware job
execution. You’ll also architect and implement a state-of-the-art
observability stack with LLM integration and a strong emphasis on
streaming health and performance telemetry, log aggregation,
distributed tracing, insight generation, synthetic testing, and
smart alerting - across CI pipelines, simulation clusters, and
service endpoints. This role demands a strong software engineering
mindset, quality instincts, and deep understanding of systems. It’s
not just about writing scripts - it’s about writing code that
builds and manages infrastructure with precision, repeatability,
and intent. Key responsibilities Design and build the orchestration
layers that drive our hybrid high-performance clusters—enabling
simulation, synthesis, and continuous integration of AI ASICs at
unprecedented scale. Develop and maintain a fully programmable
infrastructure control plane to ensure reproducibility,
auditability, and rapid iteration across the entire stack. Create
tools and abstractions that empower engineers to harness massive
parallelism without worrying about the underlying complexity
Prototype and execute workload orchestration and migration
strategies between on-premise and cloud environments, balancing
performance, storage availability and replication, uptime, and cost
across heterogeneous hardware and compute backends. Implement
real-time telemetry, tracing systems that surface insights from
millions of metrics, enabling proactive debugging and system
optimization. Build a full observability stack that includes
dashboards, alerting, automated responses, and a synthetic testing
framework to proactively test infrastructure performance and
reliability for various application and data flows, ensuring we
remain proactive against issues impacting development and
productivity workflows. Representative projects Design and deploy a
fully automated, scalable hybrid HPC cluster, combining bare-metal
servers and switches with cloud instances, provisioned through MaaS
and orchestrated via SLURM and Kubernetes, optimized for mixed EDA
workloads and parallel CI pipelines. Develop a real-time
observability system for ASIC toolchain jobs and distributed
builds, integrating Prometheus, Grafana, and VictoriaMetrics with
streaming telemetry, tracing, and alerting to detect performance
regressions before they hit silicon. Architect and implement a
programmable infrastructure-as-code control plane, using Terraform,
Ansible, and Puppet, to version, audit, and redeploy every layer of
Etched's development stack with deterministic reproducibility.
Create a zero-downtime interactive development environment that
provisions and connects Jupyter and VS Code sessions to GPUs and
high-memory nodes via a secure zero-trust network, abstracting away
cluster state and machine failures. Prototype and evaluate dynamic
workload migration strategies between on-premise and cloud
environments to optimize for latency, reliability, and cost across
simulation and synthesis pipelines. Design a synthetic testing and
fault injection framework to validate the behavior of
infrastructure under high-load, degraded hardware, and intermittent
network partitions - before they happen in production. You may be a
good fit if you Are a systems-minded software engineer who loves
building foundational platforms, working close to the metal and
cloud, solving high-leverage problems at scale. Are a deeply
technical engineer who treats infrastructure as a software problem
- prioritizing clean abstractions, version control,small change
lists, easy roll backs, testing, and long-term maintainability over
ad hoc configuration. Have strong programming skills in languages
such as Python, Go, Rust, and C++, and are comfortable building
production-grade tooling. Possess expert-level knowledge of Linux,
virtualization, containerization, and CI/CD pipelines, with a deep
understanding of how to debug, optimize, and scale complex systems.
Are familiar with Infrastructure as Code tools like OpenTofu,
Ansible, or Puppet, and enjoy designing declarative, reproducible
infrastructure systems. Understand and use PromQL and other
telemetry/query languages and have used LLM to extract insight from
real-time metrics, and know how to architect and tune observability
stacks. Have a track record of debugging and resolving difficult
hardware-software integration problems across bare-metal systems,
networks, and distributed workloads. Can lead and mentor technical
teams, guiding design decisions and helping others develop sound
engineering instincts. Have 8 years of experience in infrastructure
engineering, systems programming, or backend software development -
ideally in environments where performance, scale, or hardware
interaction mattered. Are driven by curiosity, take initiative, and
have an innate sense of ownership — you thrive in uncharted
territory, design for edge cases, and love making systems more
powerful, reliable, and elegant. Strong candidates may also have
experience with Familiarity with Bazel build system Deep
understanding of ASIC development flows, especially those involving
Synopsys, Cadence, and Verilator, including how EDA tools interact
with infrastructure for simulation, synthesis, and verification.
Hands-on experience architecting systems with AWS, GCP, or Azure,
including hybrid on-prem/cloud deployments, workload migration
strategies, and cloud-native orchestration tooling. Experience
monitoring, provisioning, and debugging bare-metal servers, network
hardware, and high-performance storage systems in rack-scale
environments. Comfortable in profiling and optimizing compute
environments for single-threaded latency, memory-bound workloads,
or I/O throughput, especially in the context of simulation or CI
performance. Proficiency building or operating telemetry systems at
scale using Prometheus, Grafana, Loki, VictoriaMetrics, and tools
for distributed tracing, log aggregation, and real-time alerting
across heterogeneous mediums (SMS, email, push alerts, etc.)
Benefits Medical, dental, and vision packages with generous premium
coverage $500 per month credit for waiving medical benefits Housing
subsidy of $2k per month for those living within walking distance
of the office Relocation support for those moving to San Jose
(Santana Row) Various wellness benefits covering fitness, mental
health, and more Daily lunch dinner in our office How we’re
different Etched believes in the Bitter Lesson . We think most of
the progress in the AI field has come from using more FLOPs to
train and run models, and the best way to get more FLOPs is to
build model-specific hardware. Larger and larger training runs
encourage companies to consolidate around fewer model
architectures, which creates a market for single-model ASICs. We
are a fully in-person team in San Jose (Santana Row), and greatly
value engineering skills. We do not have boundaries between
engineering and research, and we expect all of our technical staff
to contribute to both as needed.
Keywords: Etched, Vacaville , Infrastructure Software Engineer, Engineering , San Jose, California