If there has been something recent and accelerated happening in Machine Learning (ML), let me tell you it is because of the recent advances in ML. Now machines can understand natural language, they can have conversations, create and even draw images and videos. Modern ML say what frameworks are programmed and all these amazing things are trained using them. They are called TensorFlow, JAX, and PiTorch. These are the kind of libraries that provide some super high-level instructions to ML practitioners, like linear algebra operations and neural network layers. But what is cool is that these practitioners don’t need to worry about their models running really well on hardware because the framework automatically optimizes their model through an underlying compiler. The efficiency of the ML workload really depends on how good the compiler is. But sadly, compilers rely kind of on heuristics to solve complicated optimization problems, often resulting in suboptimal performance. In our blog post, we will show you the stuff we are doing to use ML to improve efficiency of ML workloads. There have been some works out there that have shown that we can use ML to improve performance of ML programs. These other datasets for program performance prediction they target small sub-programs. But we got something new. NeurIPS 2023 introduced this new dataset from us. It is called “TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs.” We had 616 teams from 66 different countries at a Kaggle competition we hosted on this dataset. There is a new method we are using for scaling GNN training to handle large programs as graphs. The method we are using is opening up a whole new level of training where we can go bigger and better. This method really lets us train arbitrarily large graphs on a device with a memory capacity that is pretty limited. It actually even improves the generalization of the model. ML compilers are routines that convert the programs machine learning can use and they convert them to executables. The ML programs can be represented as a computation graph and a node in that graph is what represents a tensor operation and an edge represents a tensor flowing from one node to another. One important optimization in ML compilers is to assign memory layouts to all intermediate tensors in the program. And boy oh boy can you see some speedups being made when the compiler makes those optimal choices. We’ve seen up to a 32% speedup when choosing an optimal layout configuration over the default compiler’s configuration in the XLA benchmark suite. With that motivation, we release TpuGraphs, a dataset for learning cost models for programs running on Google’s custom Tensor Processing Units (TPUs). And let me tell you, we got a lot more graphs than some other datasets from earlier and the size of the graphs is much larger on average. We even provide baseline learned cost models with our dataset. Participants in our Kaggle competition even told us about some interesting new techniques they employed.