SRE/Infrastructure Engineer
InfoSum
Other Engineering
basingstoke, uk
Posted on Friday, August 16, 2024
As an SRE/Infrastructure Engineer, is responsible for designing, implementing, and maintaining the cloud infrastructure our platform sits on, as well as the monitoring and deployment services that enable the rest of engineering to develop, deliver and maintain our platform services. You will also be instrumental in both monitoring and incident response, playing a key role in ensuring maximum reliability and minimal downtime. You will collaborate with teams across the company, including developers, customer support, product owners and sales, to ensure the reliability, scalability, and performance of our platform.
- Infrastructure Design and Implementation: assist or lead in the design, deployment, and operation of the infrastructure components required to support our applications and services. This includes managed cloud infrastructure, networking, security, data storage and cloud hosted services.
- System Automation: Develop and maintain automation and tools to streamline infrastructure provisioning, configuration management, deployment, and monitoring. Implement infrastructure as code (IaC) practices using tools such as Terraform and Ansible.
- Monitoring and Alerting: Implement monitoring solutions to track the health, performance, and availability of infrastructure components and applications. Configure alerting mechanisms to notify teams of potential issues and proactively address them before they impact users.
- Incident Response and Root Cause Analysis: Participate in incident response activities to identify, troubleshoot, and resolve incidents. Communicate incident status and updates to ensure both internal and external customers are fully informed. Conduct root cause analysis to determine the underlying causes of incidents and implement preventive measures to avoid recurrence.
- Performance & Cost Optimization: Analyze system performance metrics and identify opportunities for optimization. Tune infrastructure components, optimize configurations, and implement performance enhancements to ensure optimal performance and resource utilization.
- Security and Compliance: Implement security controls, and respond to security incidents in accordance with established policies and procedures.
- Disaster Recovery and High Availability: Design and implement disaster recovery (DR) and high availability (HA) solutions to ensure business continuity and minimize downtime. Develop and test DR plans, implement failover mechanisms, and conduct periodic drills to validate readiness.
- Capacity Planning and Scaling: Monitor resource utilization trends and prepare the infrastructure to handle the predicted changes in the future
- Documentation and Knowledge Sharing: Create and maintain documentation for infrastructure configurations, procedures, and best practices. Share knowledge and expertise with team members through documentation, training sessions, and mentorship to foster a culture of learning and collaboration.