Name: Scaling LLM Inference With Tiered Caching: Extending LMCache With Amazon SageMaker HyperPod - Yihua Cheng, Tensormesh, Inc. & Ziwen Ning
Start: 2026-05-18T16:30:00-0500
End: 2026-05-18T17:10:00-0500

May 18-20, 2026
Minneapolis, MN
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit North America 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Central DaylightTime (UTC -5). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Scaling LLM Inference With Tiered Caching: Extending LMCache With Amazon SageMaker HyperPod - Yihua Cheng, Tensormesh, Inc. & Ziwen Ning

Monday May 18, 2026 4:30pm - 5:10pm CDT

211A+B (Level Two)

LMCache supports tiered KV caching with CPU memory offloading, extending inference beyond GPU memory limits. But what happens when local CPU memory isn't enough? This session introduces the next tier: offloading KV cache to Amazon SageMaker HyperPod managed storage, expanding cache capacity for large-scale LLM inference.

We'll cover the technical design of the SageMaker HyperPod connector contribution to LMCache. Hot entries stay in GPU memory, warm entries spill to CPU memory, and cold entries persist to HyperPod's managed storage. This three-tier architecture lets organizations cache far more context than local resources allow, reducing redundant computation for repeated prompts and long-context scenarios.

The session demonstrates the integration in action, showing cache hit rates, latency across tiers, and how the connector handles transitions between local and remote storage. We'll discuss key engineering decisions, including async prefetching and failure handling.

Attendees will leave with practical knowledge of how managed cloud storage can extend open source caching frameworks for LLM inference infrastructure.

Speakers

Yihua Cheng

CTO, Tensormesh, Inc.

Yihua Cheng is co-founder and CTO of Tensormesh. He has a deep background in large language models, high-performance computing, and open-source development.
Yihua created LMCache and the vLLM production stack, open-source projects that have collectively earned over 9,000 GitHub... Read More →

Ziwen Ning

Open Source Contributor

Ziwen Ning is an open-source contributor to LMCache. He was previously a Senior Software Development Engineer at AWS, working on Amazon SageMaker HyperPod with a focus on building scalable ML infrastructure. Before that at Annapurna Labs, he enhanced the AI/ML experience through the... Read More →

Monday May 18, 2026 4:30pm - 5:10pm CDT
211A+B (Level Two)

Open AI & Data

Audience Experience Level Intermediate

Open Source Summit + Embedded Linux Conference North America 2026

Yihua Cheng

Ziwen Ning

Get help with the event

Open Source Summit + Embedded Linux Conference North America 2026

Yihua Cheng

Ziwen Ning

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event