Hi Soham, you can cover this in two ways: use public traces for realism, and synthetic traces for controlled stress testing.
Public datasets worth checking:
Google cluster traces (Borg) for job/task scheduling and resource usage patterns
Alibaba cluster trace for container workloads and utilization over time
Azure traces and other public workload datasets from academic benchmarking papers
Also look for “cluster trace”, “workload trace”, “autoscaling trace”, “request trace”, “datacenter telemetry”, “Kubernetes trace” on the Hub
If you cannot find a dataset with all signals in one place, a common approach is to fuse:
a request arrival trace (per service) plus
a resource utilization trace (per node or pod)
then derive autoscaling events from policy simulation.
How I can help you directly:
Provide a ready to use synthetic dataset generator that produces time series for CPU, memory, network, request rate, latency, error rate, plus autoscaling actions under different policies (HPA style, predictive, RL style)
Include bursty traffic, diurnal seasonality, noisy telemetry, failures, and multi service interference
Output formats that plug into training easily, like parquet plus a gym style environment spec for RL or a supervised dataset for predicting scale up and scale down actions
Add evaluation scripts for cost latency SLO violations and stability metrics, so you can compare heuristics vs learned policies