Lead, Site Reliability Engineer

7 days ago

Southern Manila District, Philippines Royal Caribbean International Full time

Overview Position Summary: The Lead Site Reliability Engineer (Lead SRE) will report to the SRE Manager in support of the Royal Caribbean website by utilizing application and user performance data to guide informed decision-making. The Lead SRE will use application and user performance metrics collected from various sources and tools to support tasks such as initial triage of critical production incidents, bug analysis, implementation of best practices in site reliability engineering, infrastructure optimization, and seamless collaboration between internal teams and external service providers, among other operational initiatives. The ideal candidate will have a deep understanding and proven track record in a senior IT support role and could provide leadership toward the development of new employees. The ideal candidate will also have an eye toward the rapidly evolving technology landscape and provide leadership over advanced and emerging concepts, to research and implement proactive and preventative measures that avoid technical incidents. S/he must be able to work with multiple product and project teams simultaneously, thrive in a fast-paced and dynamic environment and connect unexpected threads across disparate teams. The role will provide direct leadership over the individual contributors who provide Level 1 and Level 2 support. This leader requires the ability to direct teams during high-pressure business critical incidents, to ensure that customer-focused decisions are being made to minimize/eliminate guest/employee experience impacts. Essential Duties and Responsibilities At a high-level, responsibilities for this role will include: Product Health : Provides leadership over a large team of Level 1 and Level 2 support resources. Is responsible for the Incident Management, Application Performance, Configuration Management and Operational Readiness of the products within her/his ownership. Partners with and collaborate closely with stakeholders from the various teams within IT to ensure that performance tools, configuration tools and monitoring tools meet the needs of her/his products. Incident Management : Is responsible for a team of resources prepared to react quickly to production incidents with the goal to restore systems/applications back to normal service operation as quickly as possible and minimize the impact on guest/crew experience or business operations, thus ensuring the best possible service levels and availability are maintained. Review ticket analysis and approve closure of tickets/incidents. Understands architecture of Royal website and escalates incidents as needed to the appropriate team for further triage. Synthesizes and communicates incident details to the production team, stakeholders, including executive level stakeholders. Review postmortem / RCA document and follow up. Application Performance Management (APM) : Ensures the proactive monitoring and management of performance and availability of the software applications within the products s/he is responsible for. Strives to detect and diagnose complex application performance problems to maintain an expected level of service. Builds case for prioritizing bug and enhancement tickets. Creates reports on new deployment build performance for product teams to ensure quality. Configuration Management : Leads the team(s) in implementing and maintaining the technology standards and practices across product definition and product configuration. Adjust health thresholds and other monitoring settings based on historical performance. Creates and maintains performance dashboards used by support and product teams. Maintains alerting, communication, and documentation tool chain to ensure it is up to date and efficient. Change Control Governance : Ensuring all production changes required by the product teams are carried out in a planned and authorized manner, within established change control policies and procedures and that all changes are thoroughly tested and validated from the monitoring perspective. Production Operations Readiness : Ensure all product implementations go through an operational readiness review. Establish and maintain clear communication channels (e.g., Slack, Teams) with the scrum and marketing teams. Ensure all team members are informed about relevant updates and changes that may affect the website. Qualifications 10+ years in Site Reliability Engineering (SRE), DevOps, or a related IT operations role. Bachelor’s degree in Computer Science, Information Technology, Computer Engineering, or other relevant advanced degree preferred. At least 3 years of experience managing teams and collaborating with external service providers. Knowledge and Skills Technical Expertise : Proficiency in cloud platforms such as AWS, AWS Elastic Beanstalk; Understanding of API design principles: REST, SOAP, Graph; Advanced knowledge of monitoring and logging tools (AppDynamics, Datadog, Splunk, New Relic, etc.); Strong proficiency in Adobe AEM is crucial for guiding technical initiatives and mentoring teams. AI & Automation Expertise : Working knowledge of scripting languages (Python, Bash, PowerShell) applied to automate alert routing, incident response, and infrastructure tasks, combined with a proactive mindset to explore and adopt new automation approaches. Hands-on exposure to AI Ops platforms for enhancing anomaly detection, root cause analysis, and incident management, demonstrating a passion for staying ahead of industry trends. Solid understanding of AI/ML and Generative AI techniques aimed at reducing alert noise, predicting incidents, and developing automation workflows, with active interest in piloting innovative solutions. Familiarity with autonomous AI agents (Agentic Agents) or intelligent automation systems within operational environments, coupled with enthusiasm to experiment with emerging AI-driven tools in SRE. Problem-Solving Skills : Strong analytical and troubleshooting skills to diagnose and resolve complex production issues swiftly. Ability to develop and implement effective incident response plans. Communication and Collaboration : Excellent written and verbal communication skills for effective interaction with cross-functional teams and documentation. Ability to collaborate with Development, QA, IT, and external managed service providers to ensure seamless operations. Work Environment The Lead SRE Engineer may be required to participate in an on-call rotation to handle urgent incidents and ensure 24x7 system reliability. On-call duties may include evenings, weekends, and holidays as needed. #J-18808-Ljbffr

Engineer, Site Reliability

2 weeks ago

Southern Manila District, Philippines Royal Caribbean International Full time

Overview Position Summary: The Site Reliability Engineer (Senior SRE) reports to the SRE Manager in support of the Royal Caribbean website by utilizing application and user performance data to guide informed decision-making. The SRE uses performance metrics from various sources and tools to support tasks such as initial triage of critical production...
Site Reliability Engineer

7 days ago

Southern Manila District, Philippines Vestas Wind Systems AS Full time

Overview Are you ready to guide the development of innovative infrastructure solutions for a technology-focused entity in the renewable energy sector? We are seeking a Senior Systems Engineer committed to automation, monitoring, and asset management—someone who takes charge of what happens next and promotes continuous improvement in our digital landscape....
Senior Site Reliability Engineer

3 weeks ago

Eastern Manila District, Philippines CC.Talent Full time

Senior Site Reliability Engineer (SRE) Senior Site Reliability Engineer (SRE) to join our global infrastructure team. You will be a guardian of our production environment, responsible for its health, performance, and scalability. Your mission is to apply software engineering principles to solve operational problems, automate everything, and ensure our...
Site Reliability Engineer

7 days ago

Manila, Philippines Tata Consultancy Services Full time

Human Resources Executive at Tata Consultancy Services Job Description: Site Reliability Engineering (SRE) SME Position Overview We are seeking a highly skilled Site Reliability Engineering (SRE) Subject Matter Expert (SME) to lead and advance our observability, performance engineering, reliability, and AIOps practices. The SME will be responsible for...
Site Reliability Engineer

1 week ago

Manila, Philippines Russell Tobin Full time

Senior Associate - Talent Acquisition - Corporate Strategy Hiring | Specialized in APAC We are seeking a highly skilled Site Reliability Engineering (SRE) Subject Matter Expert (SME) to lead and advance our observability, performance engineering, reliability, and AIOps practices. The SME will be responsible for designing, implementing, and evangelizing...
Site Reliability Engineer

6 days ago

Manila, National Capital Region, Philippines HGS Offshore Staffing Solutions Full time ₱2,000,000 - ₱2,500,000 per year

SENIOR SITE RELIABILITY ENGINEERPOSITION OVERVIEWWe are seeking an experienced Senior AWS Site Reliability Engineer to join our cross-functionalcloud platform team. Working alongside a diverse group of DevOps and Site ReliabilityEngineers, you will combine deep technical expertise in AWS cloud infrastructure with strongleadership capabilities in incident...
Site Reliability Engineer

2 days ago

Manila, National Capital Region, Philippines CDOps Tech Full time ₱120,000 - ₱180,000 per year

About the OpportunityWe are seeking a seasoned and passionate Site Reliability Engineer for a high-impact contract engagement with one of our key clients, a leader in the marketing-tech sector. This is not just a typical SRE role; you will be the foundational expert responsible for spearheading the adoption of SRE culture and practices within the client's...
Site Reliability Engineer

2 weeks ago

Manila, National Capital Region, Philippines Russell Tobin Full time ₱120,000 - ₱180,000 per year

We are seeking a highly skilledSite Reliability Engineering (SRE) Subject Matter Expert (SME)to lead and advance our observability, performance engineering, reliability, and AIOps practices. The SME will be responsible for designing, implementing, and evangelizing modern SRE capabilities that improve system reliability, scalability, and efficiency across our...
Site Reliability Engineering Manager

7 days ago

Manila, Philippines Russell Tobin Full time

Senior Associate - Talent Acquisition - Corporate Strategy Hiring | Specialized in APAC We are seeking a highly skilled Site Reliability Engineering (SRE) Subject Matter Expert (SME) to lead and advance our observability, performance engineering, reliability, and AIOps practices. The SME will be responsible for designing, implementing, and evangelizing...
Cloud Site Reliability Engineer

3 weeks ago

Manila, Philippines Tyler Technologies Full time

Join to apply for the Cloud Site Reliability Engineer role at Tyler Technologies Overview Responsibilities Implement tooling to monitor AWS EKS-based systems focusing on performance, reliability, and scalability. Ensure that architecture and deployment models are sufficient to support SLA commitments and are well prepared for future problems of scale....

Americas

Europe

Asia / Oceania

Africa

Lead, Site Reliability Engineer