Staff Platform Engineer – Agentic AI Systems

Remote

Staff Platform Engineer

Full-Time

About TheLoops

TheLoops, an IFS company, is the first enterprise-grade AI Agent platform built for mission-critical applications and industries. Our AI Agents act as digital coworkers that are governed, secure, always learning, and working 24/7 to drive measurable business outcomes. 

As we grow, we’re looking for driven, collaborative and ambitious individuals to help us deliver the future of AI-powered operations

About the Role
We’re seeking a Staff Platform Engineer to help shape the future of agentic AI systems. In this role you will help design the backbone of our real-time, distributed systems. You’ll be at the forefront of building systems that orchestrate massive data flows, reactive services, and agentic workloads—systems that must adapt dynamically and operate reliably under heavy and unpredictable load. You’ll work with tools like Kafka, Akkastream processing frameworks, and other core distributed technologies, and collaborate across engineering teams to deliver infrastructure that is elastic, fault-tolerant, and observable by design.

If you’re passionate about high-performance computing, resilient architecture, and enabling real-time intelligence at scale, this role is for you.

Responsibilities

  • Design and implement scalable, distributed platform components with technologies like KafkaAkka (Typed)gRPC.
  • Architect and optimize data pipelines capable of handling billions of messages/events per day with low latency and high reliability.
  • Lead efforts in agentic scaling – dynamically spawning, routing, and managing autonomous agents (services/functions) in response to workload or demand.
  • Build resilient systems that self-heal, auto-scale, and degrade gracefully under pressure.
  • Define and implement metrics, tracing, and observability for end-to-end system behavior and performance.
  • Collaborate closely with infrastructure, SRE, and product teams to ensure platform scalability aligns with growth and reliability goals.
  • Drive root-cause analysis of performance bottlenecks and propose long-term architectural improvements.
  • Participate in on-call rotations, architecture reviews, and deep technical design sessions.

Minimum Qualifications

  • 5+ years of experience building distributed systems in a high-throughput production environment.
  • Deep expertise with Kafka (topics, partitions, consumers, tuning, schema registry, stream processing).
  • Strong experience with Akka or other actor-based concurrency models; familiarity with Akka Cluster, Sharding, Persistence, or Typed API.
  • Solid programming skills in Java .
  • Understanding of agentic workloads and dynamic system orchestration (e.g., microservices that represent intelligent agents).
  • Experience designing scalable APIs, message protocols (e.g., Protobuf, Avro), and event-driven architectures.
  • Familiarity with cloud-native environments (e.g., Kubernetes, service mesh, container orchestration).

Preferred Qualifications

  • Experience with serverless compute models or function-as-a-service scaling paradigms.
  • Contributions to open-source projects in the distributed systems ecosystem.
  • Experience with AI or ML-driven orchestration or agentic frameworks.
  • Familiarity with operational tooling: Prometheus, Grafana, OpenTelemetry, Kafka monitoring tools, etc.

Let’s build the future of autonomous intelligence — together.

Apply Now