Lead Site Reliability Engineer

6 days ago


Makati City, National Capital Region, Philippines RCL REGIONAL OPERATING HEADQUARTERS Full time ₱120,000 - ₱180,000 per year

Journey with us Combine your career goals and sense of adventure by joining our incredible team of employees at Royal Caribbean Group. We are proud to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world.

We are proud to be the vacation-industry leader with global brands — including Royal Caribbean International, Celebrity Cruises and Silversea Cruises — the most innovative fleet and private destinations, and the best people. Together, we are dedicated to turning the vacation of a lifetime into a lifetime of vacations for our guests.

Royal Caribbean Group's Global E-Commerce Team has an exciting career opportunity for a full time Lead Site Reliability Engineer reporting to the SRE Manager.

Position Summary:

The Lead Site Reliability Engineer (Lead SRE) will report to the SRE Manager in support of the Royal Caribbean website by utilizing application and user performance data to guide informed decision-making. The Lead SRE will use application and user performance metrics collected from various sources and tools to support tasks such as initial triage of critical production incidents, bug analysis, implementation of best practices in site reliability engineering, infrastructure optimization, and seamless collaboration between internal teams and external service providers, among other operational initiatives.

Essential Duties and Responsibilities:

At a high-level, responsibilities for this role will include:

  1. Product Health: Provides leadership over a large team of Level 1 and Level 2 support resources.  Is responsible for the Incident Management, Application Performance, Configuration Management and Operational Readiness of the products within her/his ownership.  Partners with and collaborate closely with stakeholders from the various teams within IT to ensure that performance tools, configuration tools and monitoring tools meet the needs of her/his products.
  2. Incident Management.  Is responsible for a team of resources prepared to react quickly to production incidents with the goal to restore systems/applications back to normal service operation as quickly as possible and minimize the impact on guest/crew experience or business operations, thus ensuring the best possible service levels and availability are maintained.  Review ticket analysis and approve closure of tickets/incidents. Understands architecture of Royal website and escalates incidents as needed to the appropriate team for further triage. Synthesizes and communicates incident details to the production team, stakeholders, including executive level stakeholders. Review postmortem / RCA document and follow up
  3. Application Performance Management (APM).  Ensures the proactive monitoring and management of performance and availability of the software applications within the products s/he is responsible for.  Strives to detect and diagnose complex application performance problems to maintain an expected level of service. Builds case for prioritizing bug and enhancement tickets. Creates reports on new deployment build performance for product teams to ensure quality.
  4. Configuration Management.  Leads the team(s) in implementing and maintaining the technology standards and practices across product definition and product configuration. Adjust health thresholds and other monitoring settings based on historical performance. Creates and maintains performance dashboards used by support and product teams. Maintains alerting, communication, and documentation tool chain to ensure it is up to date and efficient.
  5. Change Control Governance.  Ensuring all production changes required by the product teams are carried out in a planned and authorized manner, within established change control policies and procedures and that all changes are thoroughly tested and validated from the monitoring perspective.
  6. Production Operations Readiness.Ensure all product implementations go through an operational readiness review.  Establish and maintain clear communication channels (e.g., Slack, Teams) with the scrum and marketing teams. Ensure all team members are informed about relevant updates and changes that may affect the website.

Qualifications:

  • 10+ years in Site Reliability Engineering (SRE), DevOps, or a related IT operations role.
  • Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or other relevant advanced degree preferred.
  • At least 3 years of experience managing teams and collaborating with external service providers.

Knowledge and Skills:

  • Technical Expertise:

  • Proficiency in cloud platforms such as AWS, AWS Elastic Beanstalk.

  • Understanding of API design principles: REST, SOAP, Graph
  • Advanced knowledge of monitoring and logging tools (AppDynamics, Datadog, Splunk, New Relic, etc.).
  • Strong proficiency in Adobe AEM is crucial for guiding technical initiatives and mentoring teams

  • AI & Automation Expertise

  • Working knowledge of scripting languages (Python, Bash, PowerShell) applied to automate alert routing, incident response, and infrastructure tasks, combined with a proactive mindset to explore and adopt new automation approaches.

  • Hands-on exposure to AI Ops platforms for enhancing anomaly detection, root cause analysis, and incident management, demonstrating a passion for staying ahead of industry trends.
  • Solid understanding of AI/ML and Generative AI techniques aimed at reducing alert noise, predicting incidents, and developing automation workflows, with active interest in piloting innovative solutions.
  • Familiarity with autonomous AI agents (Agentic Agents) or intelligent automation systems within operational environments, coupled with enthusiasm to experiment with emerging AI-driven tools in SRE.

  • Problem-Solving Skills:

  • Strong analytical and troubleshooting skills to diagnose and resolve complex production issues swiftly.

  • Ability to develop and implement effective incident response plans.
  • Communication and Collaboration:

  • Excellent written and verbal communication skills for effective interaction with cross-functional teams and documentation.

  • Ability to collaborate with Development, QA, IT, and external managed service providers to ensure seamless operations.

We know there's a lot to consider. As you go through the application process, our recruiters will be glad to provide guidance, and more relevant details to answer any additional questions. Thank you again for your interest in Royal Caribbean Group. We'll hope to see you onboard soon

It is the policy of the Company to ensure equal employment and promotion opportunity to qualified candidates without discrimination or harassment on the basis of race, color, religion, national origin, disability, sexual orientation, sexuality, gender identity or expression, marital status, or any other characteristic protected by law. Royal Caribbean Group and each of its subsidiaries prohibit and will not tolerate discrimination or harassment.



  • Makati City, National Capital Region, Philippines Brixio Full time ₱80,000 - ₱150,000 per year

    #RemoteWork Opportunity: AZURE Cloud: Site Reliability Engineer (SRE)*MUST BE RESIDING IN THE PHILIPPINES*Position: Site Reliability Engineer (SRE)Location: Philippines (Remote)About the Project:Join us in supporting the groundbreaking Website Factory (WSF) project for a global cosmetics company. This project manages over 400 brand websites, providing a...


  • Makati City, National Capital Region, Philippines Cambridge University Press & Assessment | Manila Full time ₱62,000 - ₱84,000 per year

    NOTE: When you click the apply button, you will be re-directed to Cambridge University Press & Assessment's website where you will be required to create a profile and upload a copy of your CV to complete your application.Work setup: We operate in a hybrid work environment, and we encourage applicants who are open to working in the office two days a week to...


  • Makati City, National Capital Region, Philippines Proviniti Full time ₱1,500,000 - ₱3,000,000 per year

    Hiring: Senior Site Reliability EngineerHybrid – 2 Days Onsite in MakatiPermanent, Full-TimeJoin a dynamic digital platform team and lead SRE initiatives that drive reliability, scalability, and sustainability across enterprise systems.What You'll Do:Lead SRE strategy, monitoring, incident response, and automation.Drive SRE maturity, reduce technical...


  • Makati City, National Capital Region, Philippines Descartes Systems Group Full time ₱30,000 - ₱60,000 per year

    Descartes Unites the People and Technology that Move the WorldThe need for efficient, secure, and agile supply chains and logistics operations has become ever more critical and complex. By combining innovative technology, powerful trade intelligence and the reach of our network, Descartes helps get goods, information, transportation assets, and people where...


  • Makati City, National Capital Region, Philippines Descartes Systems Group Full time ₱1,500,000 - ₱3,000,000 per year

    Descartes Unites the People and Technology that Move the WorldThe need for efficient, secure, and agile supply chains and logistics operations has become ever more critical and complex. By combining innovative technology, powerful trade intelligence and the reach of our network, Descartes helps get goods, information, transportation assets, and people where...


  • Mandaluyong City, National Capital Region, Philippines Maya Full time $60,000 - $120,000 per year

    CORE PROFILEAs aSite Reliability Engineer,you will be working on the critical API gateways, backend services, and infrastructure that make Maya function smoothly for millions of people in the Philippines and beyond. Your work will improve features that users rely on every day.Maya operates at a large scale, so we need someone with a keen eye for detail. You...

  • Site Reliability

    1 week ago


    Makati City, National Capital Region, Philippines Canonical - Jobs Full time ₱2,500,000 - ₱6,000,000 per year

    Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the world's leading public cloud and silicon providers,...


  • Quezon City, National Capital Region, Philippines Comrise Full time ₱900,000 - ₱1,200,000 per year

    We are seeking a Site Reliability Engineer (Cloud) to join our growing technology team. In this role, you will be responsible for maintaining and enhancing the reliability, performance, and scalability of our cloud infrastructure. You'll apply software engineering principles to operations tasks, helping ensure the continuous availability and resilience of...


  • Makati City, National Capital Region, Philippines Canonical - Jobs Full time ₱80,000 - ₱120,000 per year

    Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the world's leading public cloud and silicon providers, and...


  • Mandaluyong City, National Capital Region, Philippines DFI Retail Group Full time ₱80,000 - ₱120,000 per year

    Is this your next challenge in IT Infrastructure?In this role, you and the team will be responsible for designing, building and maintaining the systems and infrastructure that support our company's applications and services. This person will work closely with dev teams to ensure the reliability, scalability and performance of our systems, and will also...