About
I am Raghu Raja, currently a Principal Engineer at AWS. I work on Machine Learning infrastructure. Since mid-2025, I have been helping build Project Mantle, a new inference engine powering AWS Bedrock. For about two years leading up to that, I was focusing on networking needs of model training workloads, which involved thinking about collective communication primitives and algorithms. I started my second stint with AWS in 2023. During my first stint (2017-2021), I was with the AWS HPC organization, where I was the Technical Lead for a team building software for the Elastic Fabric Adapter. I helped develop libfabric and was the maintainer for the EFA libfabric provider. Between these two stints, I spent two years as an Architect at Enfabrica - a stealth startup that has since been acquired by NVIDIA. There, I led the Machine Learning software ecosystem development.
Before AWS, I was a Senior Engineer in Cray’s Storage R&D organization, working on strategic pathfinding projects (such as this) targeting future-generation supercomputers as well as tactical feature development for current-generation systems.
I attended graduate school at The Ohio State University (Go Bucks!). My dissertation, advised by D. K. Panda and done in collaboration with Lawrence Livermore National Laboratory, covered the intersection of two key aspects of supercomputing systems — scalable networking and efficient parallel I/O.
While I no longer actively publish research papers, you can find some of my academic publications here. That said, I am actively involved with several research communities. I still serve on steering and technical program committees for various workshops and conferences too.
When not in front of a computer, I am likely being goofy with my son Avi and wife Aarthy.