See all the jobs at Exotel Techcom Pvt Ltd here:
| Technology | Full-time
, ,About Us
Exotel is one of Asia's largest customer communication platforms. We are on a mission to move enterprise customer communication to the cloud. In 2020, we powered over 4 billion calls and connected over 320 million people. We work with some of the most innovative companies such as Ola, Swiggy, Zerodha, Whitehat Jr, Practo, Flipkart, GoJek, etc. We also power customer communication for some of the top banks in the country. Join us on this journey to improve how companies look at customer communication. Read our growth story here.
About the Role
The Site Reliability Engineer (SRE) team at Exotel ensures that our large-scale, distributed production systems are reliable, scalable, and efficient. As an SRE, you will own uptime, monitoring, and incident response while driving automation to minimise manual work. You will be the bridge between infrastructure and engineering teams to ensure new services are production-ready.
If you’re passionate about building reliable systems, automating away repetitive tasks, and solving complex production challenges, this is the role for you.
What You’ll Do
-
Manage and support production-grade infrastructure across cloud and data centers.
-
Take ownership of monitoring and troubleshooting production systems (on-call or shift-based support).
-
Deep dive into Linux system internals, networking, and debugging production issues.
-
Build and improve observability stacks using Prometheus, Grafana, ELK/EFK, or equivalent.
-
Develop and maintain automation scripts/tools (Python, Bash, or similar).
-
Work with CI/CD tools (Jenkins, GitHub Actions, GitLab CI) to support reliable deployments.
-
Drive incident management, root cause analysis (RCA), and long-term fixes.
-
Partner with developers to ensure new features/services are production-ready (monitoring, logging, failover strategies).
-
Continuously improve system availability, reliability, and performance through automation and process improvements
What We’re Looking For (Must-Haves)
-
7+ years of hands-on experience managing production systems at scale.
-
Strong proficiency in Linux system administration, internals, and networking.
-
Proven experience in monitoring & troubleshooting production systems.
-
Hands-on experience with monitoring/alerting/logging tools (Prometheus, Grafana, ELK, Nagios, etc.).
-
Proficiency in at least one scripting language (Python, Bash, Go, etc.).
Good-to-Haves
-
Experience with CI/CD and deployment automation (Jenkins, GitHub Actions, Ansible, Terraform, etc.).
-
Demonstrated ability to automate operational tasks to reduce MTTR.
-
Exposure to cloud platforms like AWS (VPC, EC2, RDS, CloudWatch, IAM).
-
Strong debugging and root cause analysis skills in complex, distributed environments.
Mindset We Value
-
You don’t just fix problems — you build systems to prevent them.
-
You believe monitoring + automation = reliability at scale.
-
You thrive in high-availability, high-scale environments.
-
You have an SRE mindset: you own what you set up, and you engineer for reliability.
Why Exotel?
-
Opportunity to work at scale — billions of calls, millions of users.
-
Be part of a team that deeply values automation, ownership, and reliability.
-
Work with cutting-edge tech and solve complex reliability challenges in real-world production systems.
-
Collaborative, fast-paced, and impact-driven environment.