<div>
About HighLevel:HighLevel is a cloud-based, all-in-one white-label marketing and sales platform that empowers marketing agencies, entrepreneurs, and businesses to elevate their digital presence and drive growth. We are proud to support a global and growing community of over 2 million businesses, from marketing agencies to entrepreneurs to small businesses and beyond. Our platform empowers users across industries to streamline operations, drive growth, and crush their goals.
HighLevel processes over 15 billion API hits and handles more than 2.5 billion message events every day. Our platform manages 470 terabytes of data distributed across five databases, operates with a network of over 250 micro-services, and supports over 1 million domain names.
Our People
With over 1,500 team members across 15+ countries, we operate in a global, remote-first environment. We are building more than software; we are building a global community rooted in creativity, collaboration, and impact. We take pride in cultivating a culture where innovation thrives, ideas are celebrated, and people come first, no matter where they call home.
Our Impact
Every month, our platform powers over 1.5 billion messages, helps generate over 200 million leads, and facilitates over 20 million conversations for the more than 2 million businesses we serve. Behind those numbers are real people growing their companies, connecting with customers, and making their mark - and we get to help make that happen.
About the Role:
We are looking for a Site Reliability Engineer to join our team and help ensure the availability, performance, and scalability of our critical systems. You will work closely with development and operations teams to automate processes, enhance system reliability, and improve observability.
Requirements:
- Experience: 4+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
- Cloud Expertise: Hands-on experience with GCP and AWS
- Infrastructure as Code (IaC): Terraform, Helm, or equivalent tools
- Containerisation & Orchestration: Docker, Kubernetes (GKE)
- Observability: Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools
- Programming/Scripting: Proficiency in Python, Bash, or Shell scripting. Basic understanding of API parsing and JSON manipulation
- CI/CD Pipelines: Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar tools
- Incident Management: Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution
- Databases: Experience in monitoring MongoDB, Redis, ES, Queue based etc
Responsibilities:
- Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.)
- Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues
- Collaborate with developers to enhance application reliability, scalability, and performance
- Drive cost optimisation efforts in cloud environments.
- Monitor multiple databases (MongoDB, Redis, ES, Queue based etc.)
EEO Statement:
The company is an Equal Opportunity Employer. As an employer subject to affirmative action regulations, we invite you to voluntarily provide the following demographic information. This information is used solely for compliance with government recordkeeping, reporting, and other legal requirements. Providing this information is voluntary and refusal to do so will not affect your application status. This data will be kept separate from your application and will not be used in the hiring decision.
#LI-Remote
#LI-HB1