Quantum computing holds immense promise for achieving computational gains once thought unattainable, given the physical limits to conventional (dubbed classical) compute resources. Certain classes of problems could benefit greatly from speedups obtainable from the exploitation of quantum mechanical properties, including superposition, entanglement and interference, by quantum computing systems. During this talk, we'll consider realistic instances in this pre-fault-tolerant era, strategies for the integration of quantum resources with HPC, and approaches for facilitating research use cases. Symposium members are encouraged to contribute efforts made by their HPC departments, both from emulation/simulation and hardware integration perspectives; the presentation is intended as an opportunity to explore ideas, share experiences and gather knowledge to advance quantum-centric HPC in our region.
Achieving efficient utilization of shared compute resources is a primary objective of HPC providers, to maximize value for both researchers and institutions. While providers leverage workload managers to efficiently allocate many resources across many users over time, scheduling alone cannot prevent allocated resources from sitting idle or underutilized in terms of raw compute and memory resources. Common causes are misconfigured workload parameters; a user may unintentionally request too many resources, or of the wrong type. Between wait times and the often-opaque nature of batch execution, users may “fire-and-forget" their batch workloads and be unaware of serious inefficiencies. To address this issue, we develop a real-time workload monitoring and alerting system that rapidly informs users and HPC administrators of inefficient workloads, even while those workloads are still running. We will present the architecture of our system, which includes components from Slurm, Prometheus, VictoriaMetrics, PostgreSQL, and CHPC software. We will also provide a data-driven analysis of the results we have achieved with the system, as well as lessons learned and our future roadmap
This presentation will provide an overview of the Slurm HPC dashboard developed and used at ASU to monitor cluster utilization, GPU resources, and overall system health. I will explain how the dashboard supports day to day operations, demonstrate recently added features, and discuss the roadmap for future development as we continue expanding its capabilities to meet growing demands.