Site Reliability Engineer, II
At Coursera, we are committed to building a globally diverse team and are thrilled to extend employment opportunities to individuals in any country where we have a legal entity. We require candidates to possess eligible working rights and have a compatible timezone overlap with their team to facilitate seamless collaboration.
Coursera has a commitment to enabling flexibility and workspace choices for employees. Our interviews and onboarding are entirely virtual, providing a smooth and efficient experience for our candidates. As an employee, we enable you to select your main way of working, whether it's from home, one of our offices or hubs, or a co-working space near you.
Job Overview:
Our SRE team is part of the Coursera Infrastructure group that builds the foundation that keeps Coursera reliable, scalable, and efficient. We partner with product and platform teams to deliver resilient systems through automation, observability, and operational excellence. From incident response to infrastructure as code, we enable fast, safe, and cost-aware delivery of global learning experiences. We are hiring an IC3 Site Reliability Engineer (SRE) based in Canada to join our SRE team. This role will support reliability, observability, infrastructure automation, and cost optimization efforts across multiple services. The engineer will work closely with senior SREs to build scalable and efficient systems using our AWS-based tech stack, and gain hands-on experience with real-world SRE projects. Joining this team means working on high-impact projects that keep Coursera running smoothly for millions of learners and partners.
Application is on-going until position is filled
Requirements
2+ years of experience in Site Reliability, DevOps, or Backend Engineering roles
Hands-on experience with at least one cloud platform (e.g., AWS, GCP, Azure)Experience with monitoring and logging tools (e.g., Datadog, CloudWatch, SumoLogic, Graphana)
Familiarity with Infrastructure as Code tools (e.g., Terraform, Ansible)
Experience writing automation scripts and backend systems in Java, Python, Bash or similar languages
Preferred Qualifications:
Exposure to incident management processes and tools (e.g., PagerDuty)
Familiarity with containerized infrastructure (e.g., Docker, Kubernetes)
Experience working on cost visibility or optimization in cloud environments
Knowledge of version control systems and CI/CD practices
Experience contributing to disaster recovery or multi-region infrastructureKnowledge of security/compliance practices (e.g., audit logging, access controls)
If this opportunity interests you, you might like these courses on Coursera:
Site Reliability Engineering: Measuring and Managing Reliability – Learn SRE fundamentals including SLIs, SLOs, and error budgets
Introduction to Cloud Computing – Understand core cloud concepts, including AWS services and architecture
Getting Started with Terraform for Cloud Infrastructure Automation – Learn infrastructure-as-code using Terraform with hands-on AWS examples
The application process will continue on the employer's website.