Remove kv
article thumbnail

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Machine Learning Research at Apple

In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache).

147
147
article thumbnail

Deno KV Is in Open Beta

Hacker News

Deno KV, the easiest way to add a strongly consistent database to your app, is now in open beta.

Database 177
professionals

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

StreamingLLM: tiny tweak to KV LRU improves long conversations

Hacker News

Researchers developed a technique that enables an AI chatbot like ChatGPT to conduct a day-long conversation with a human collaborator without slowing down or crashing, no matter how much text the conversation involves.

AI 181
article thumbnail

Deno KV internals: building a database for the modern web

Hacker News

How we built a performant, scalable, ACID-compliant, JavaScript-native database on FoundationDB.

Database 152
article thumbnail

Harvesting Electricity from High-Voltage Transmission Lines Using Fences

Hacker News

When you have a bunch of 230 kV transmission lines running over your property, why not use them for some scientific experiments? This is where the [Double M Innovations] YouTube channel comes into pla.

181
181
article thumbnail

Efficient Memory Management for Large Language Model Serving with PagedAttention

Hacker News

However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage.

Algorithm 181
article thumbnail

LLMLingua: Compressing Prompts for Faster Inferencing

Hacker News

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

136
136