Job Description
Position: Site Reliability Engineer
Location: 100% Remote
Duration: 12+ Months Contract
Interview: Video
Key Areas of Responsibility:
- This is a strategic and hands-on position where you will work closely with cross-functional teams to identify potential issues and provide innovative insights to optimize system performance, stability, and availability.
- Guide cross functional teams to manage and support their PagerDuty alerts, teams, schedules, escalation policies and automations.
- The engineer will also be responsible for automating alerting and remediation processes to reduce mean time to resolution (MTTR) and improve system uptime.
- Monitor Server, network infrastructure and application performance metrics, and identify patterns and trends to improve system performance and reliability.
- Troubleshoot issues and outages, working closely with development and operations teams to identify root causes and develop solutions.
- Collaborate with cross-functional teams to support incident management, change management, and problem management processes.
- Proactively detect and prevent future problems/incidents and initiate the Problem Management process to allow quicker diagnosis and resolution.
- Develop trend analysis and prepare service improvement plans to address identified gaps.
- Build strong relationships with key stakeholders, including senior management, department heads, and external partners, to ensure their support and engagement in incident management initiatives.
- Foster a culture of continuous improvement, staying abreast of industry trends, emerging technologies, and best practices to enhance incident management capabilities.
- Create dashboards and reports to provide insights into operational performance and health.
- Build automation to optimize processes and workflows within our on-call systems and monitoring platforms.
- Complete any assigned project work or tasks, with a view to improving existing processes, capabilities and seek out automation opportunities.
- Ability to support on-call rotation and off-hours support as required.
Minimum Qualifications:
- Bachelor's Degree in IT, Business Management or a related discipline preferred.
- 5+ of direct experience working in the observability, operations, or DevOps domains.
- Proficient in Observability, monitoring, PagerDuty, and logging tools Like Datadog, Dynatrace, PowerBI, etc.
- 3+ years of technical experience: systems engineering, SRE, DevOps, software engineering
Other Required Qualifications:
- Excellent written and verbal communication skills with the ability to communicate effectively with all stakeholders including senior leadership.
- Strong ability to understand, accurately translate and produce technical information for a general and business audience.
- Strong experience with change, incident, and problem management principles, methodologies, and tools.
- Experience using configuration and change tools to include such as ServiceNow Change and CMDB and or related tools.
- Experience with project delivery methodologies (Agile, Scrum).
- Hands on experience with monitoring and performance monitoring tools: DataDog, Dynatrace, Splunk, etc.
Preferred Qualifications:
- ITIL v3 Foundation Certification Preferred
- Certification in Project Management
- Experience implementing continuous process improvements within a configuration, change, release, or asset management program
- Cloud certifications (Azure, AWS, GCP)
- Direct experience scripting in two of the following languages: Python, PowerShell, Bash.
- Proficient at technical and business writing
Job Tags
Contract work, Remote job,