The Site Reliability Engineer is responsible for establishing a NOC and building foundation services to monitor and maintain a large distributed production environment. The candidate will be excel in problem solving issues of scale, performance and availability. The Site Reliability Engineer will be responsible for aiding in the development and execution of building innovative solutions, writing ruby, java, bash, python and java script applications.
Duties and Responsibilities
- Responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
- Automate tooling to control change management, monitoring and alerting.
- Work closely with the Development and DevOps teams to provide effective feedback in order to improve availability and performance of the application stack.
- Ensure SLA’s are met or exceeded.
- Plan and execute installations, upgrades & testing of system/network hardware, hosted applications and third party software and/or tools
- Analyze and resolve system problems and outages
- Perform system security administration and hardening
- Create and maintain standard operating procedures and ensure help desk requests are recorded and tracked in detail
- Understand all aspects of the technology stack in order to push performance improvements through the system.
- Drive Infrastructure as Code practices throughout the organization
- Provide on-call 24×7 support for all hosted Systems/Networks in rotation with other staff
- Diploma of Technology, Bachelor’s degree in Computer Science or an equivalent combination of education and experience
- Minimum 6 yrs experience administering a complex IT environment
- Excellent verbal and written communication skills with the ability to communicate complex subjects to a variety of audiences ranging from management to technical staff
- Advanced knowledge of Linux, VMWare vSphere, Docker, Kubernetes, Gitlab and Ansible.
- Strong Knowledge on ipv4 networking, OSPF and BGP.
- Strong knowledge of DNS, DHCP, LDAP, PKI, email, NTP, SSSD
- Exceptional troubleshooting skills.
- Strong programming and scripting skills in Python, Bash, PowerShell, PHP and SQL.
- Experience maintaining Apache, Tomcat/Jboss application infrastructures.
- Administration and working knowledge of Postgres, MySQL, MSSQL and/or Oracle an asset
- Ability to multi-task combined with exemplary time management skills
- The ability to travel internationally on a short-term project basis