Site Reliability Engineer (SRE) Manager
Euronet
- Las Vegas, NV
- Permanent
- Full-time
- Lead the team in designing, implementing, and maintaining highly available, scalable (99.98%), and secure systems. Develop and implement operational processes and procedures to ensure smooth IT infrastructure and service operations.
- Collaborate with cross-regional teams to implement best practices for building, deploying, and monitoring software systems.
- Staying calm under pressure
- Manage major incidents to mitigation/resolution, perform post-incident reviews of all major incidents and determine action items required to avoid similar issues/minimize downtime for future incidents.
- Define and track key performance indicators (KPIs, SLIs, SLAs, SLOs) to measure operational effectiveness.
- Monitor, analyze, and optimize system performance, capacity, and resource utilization.
- Manage budgets and resources effectively.
- Identify and implement continuous improvement initiatives to increase efficiency and reduce risks.
- Lead incident response activities, performing root cause analysis and implementing preventative solutions.
- Drive the development and implementation of automation solutions to streamline operations and reduce manual workloads.
- Manage a team of Site Reliability Engineers / DevOps, including hiring, evaluating, training, and developing team members.
- Build a collaborative and productive team culture.
- Own and maintain the company's cloud infrastructure strategy and SRE team roadmap.
- Evaluate and improve SRE processes and procedures.
- Provide technical expertise by collaborating with stakeholders to make high-level decisions and provide technical direction to team members.
- Participate in deep system design and implementation discussions to ensure high-quality systems are built. Work closely with our Software Development and Engineering teams to build platforms before they go live, building a reliable production-ready services and applications.
- Provide rotational on-call support where you'll respond, detect, triage and resolve production incidents
- Bachelor's degree in Computer Science, Engineering, or a related discipline.
- Over a decade of experience in IT, including at least two years in a leadership capacity.
- Strong technical background in cloud computing, networking, security, and automation.
- Excellent leadership, communication, and interpersonal skills.
- Bachelor's degree in related field or equivalent experience required.
- Strong knowledge of Linux and Windows operating systems and environment
- Strong knowledge of Networking, Load balancers, DNS, NTP and TCP/IP
- Strong knowledge on AWS technologies: Global Accelerator, ALB, NLB, EKS, EC2, VPC, S3, RDS or equivalent experience on (Google Cloud)
- Experience with containers
- Knowledge with container orchestration
- Experience with some Infrastructure Automation like Terraform, Ansible, Puppet/Foreman
- Experience with web servers IIS, Apache, Nginx.
- Proficiency in the design principles for monitoring and alerting systems.
- Experience with monitoring tools like Nagios, Icinga, SolarWinds, New Relic, Grafana
- Solid scripting skills; experience with Shell, Bash, Ansible, Python, Powershell, Ruby.
- Experience in setting up CI/CD pipelines (Gitlab or AzureDevops)
- A willingness to learn on the job and take on tasks as needed
- Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus.
- Experience with one or more of the following F5 products: LTM, AWAF, GTM, AFM, BIGIQ
- Experience with one or more of the technologies used for big data: ELK, Beats, Kafka, Redis, Searchguard.
- Experience with application monitoring tools like Uptrends
- Experience with Postfix
- 401(k) Plan
- Health/Dental/Vision Insurance
- Employee Stock Purchase Plan
- Company-paid Life Insurance
- Company-paid disability insurance
- Tuition Reimbursement
- Paid Time Off
- Paid Volunteer Days
- Paid Holidays
- Plus many more employee perks & incentives!