الوظائف الحالية

اكتشف و تقدم بالطلب الآن

Contract

Abu Dhabi, United Arab Emirates

09.03.2026

Design, implement, and manage observability platforms covering metrics, logs, and distributed tracing.
Deploy and configure monitoring and alerting tools such as Grafana, Prometheus, Datadog, ELK Stack, or Dynatrace.
Define and implement SLIs, SLOs, and error budgets aligned to service reliability requirements.
Build dashboards and visualizations for operational, performance, and business-level metrics.
Tune alerting thresholds to reduce noise and ensure all alerts are actionable and meaningful.
Collaborate with DevOps, cloud, and application teams to instrument services and workloads for observability.
Support root-cause analysis and performance investigations using observability data and tooling.
Maintain and evolve the observability strategy as infrastructure and application landscapes grow.
Develop runbooks and documentation for observability tooling, monitoring standards, and on-call procedures.

Qualifications and Skills:

4+ years of experience in systems monitoring, observability, or Site Reliability Engineering (SRE) roles.
Hands-on experience with observability tools such as Grafana, Prometheus, Datadog, Dynatrace, or ELK Stack.
Understanding of distributed tracing concepts and tools such as Jaeger, Zipkin, or OpenTelemetry.
Experience instrumenting applications and infrastructure components for monitoring and alerting.
Scripting ability in Python, Bash, or similar for automation and alerting customization.
Knowledge of cloud-native monitoring services including CloudWatch, Azure Monitor, or GCP Operations Suite.
Datadog, Dynatrace certification, or familiarity with SRE practices and reliability engineering principles is advantageous.