Inquiry About Dataset for AI-Driven Cloud Load Balancing and Auto scaling of instances

Hi everyone,

I’m currently building a Smart Load Balancer with Auto-Scaling Instances and exploring ways to optimize cloud performance using AI-based techniques.

I’m looking for a dataset that contains:

  • Server or VM utilization data (CPU, memory, network usage)

  • Task or request distribution logs

  • Auto-scaling or workload patterns over time

  • Any real or simulated cloud performance metrics

I’d really appreciate it if anyone could suggest:

  • Publicly available cloud workload datasets

  • Google, Alibaba, or Azure cluster traces

  • Or any datasets that can help in modeling or testing AI-based load balancing algorithms

Thanks in advance for your help and suggestions :folded_hands:

— Soham Kale

1 Like

I gathered some resources for now.

Hi Soham, you can cover this in two ways: use public traces for realism, and synthetic traces for controlled stress testing.

Public datasets worth checking:

  • Google cluster traces (Borg) for job/task scheduling and resource usage patterns

  • Alibaba cluster trace for container workloads and utilization over time

  • Azure traces and other public workload datasets from academic benchmarking papers

  • Also look for “cluster trace”, “workload trace”, “autoscaling trace”, “request trace”, “datacenter telemetry”, “Kubernetes trace” on the Hub

If you cannot find a dataset with all signals in one place, a common approach is to fuse:

  • a request arrival trace (per service) plus

  • a resource utilization trace (per node or pod)
    then derive autoscaling events from policy simulation.

How I can help you directly:

  • Provide a ready to use synthetic dataset generator that produces time series for CPU, memory, network, request rate, latency, error rate, plus autoscaling actions under different policies (HPA style, predictive, RL style)

  • Include bursty traffic, diurnal seasonality, noisy telemetry, failures, and multi service interference

  • Output formats that plug into training easily, like parquet plus a gym style environment spec for RL or a supervised dataset for predicting scale up and scale down actions

  • Add evaluation scripts for cost latency SLO violations and stability metrics, so you can compare heuristics vs learned policies

1 Like