This presentation provides a technical overview of machine learning infrastructure on Google Cloud Platform (GCP), focused on hardware and operational efficiency.
We will discuss evaluating hardware accelerators based on specific workload requirements and take a deeper dive into
Google Tensor Processing Units (TPUs), specifically examining their architecture for large-scale matrix operations and the optimization of
FLOPS per dollar. We will then look at other key considerations for running ML lab environments in GCP. Topics include:
- Infrastructure Selection: An overview of GCP’s ML-optimized compute and storage offerings.
- Operational Management: Strategies for capacity planning, cost control, and maximizing goodput.
- Frameworks and libraries enabling model training and serving.