Data Science Current

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Machine Learning Research at Apple

MAY 13, 2024

In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache).

Deno KV Is in Open Beta

Hacker News

SEPTEMBER 5, 2023

Deno KV, the easiest way to add a strongly consistent database to your app, is now in open beta.

Database

StreamingLLM: tiny tweak to KV LRU improves long conversations

Hacker News

FEBRUARY 13, 2024

Researchers developed a technique that enables an AI chatbot like ChatGPT to conduct a day-long conversation with a human collaborator without slowing down or crashing, no matter how much text the conversation involves.

AI

AI AI

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Deno KV internals: building a database for the modern web

Hacker News

MAY 4, 2024

How we built a performant, scalable, ACID-compliant, JavaScript-native database on FoundationDB.

Database

Harvesting Electricity from High-Voltage Transmission Lines Using Fences

Hacker News

JANUARY 28, 2024

When you have a bunch of 230 kV transmission lines running over your property, why not use them for some scientific experiments? This is where the [Double M Innovations] YouTube channel comes into pla.

Efficient Memory Management for Large Language Model Serving with PagedAttention

Hacker News

SEPTEMBER 14, 2023

However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage.

Algorithm

LLMLingua: Compressing Prompts for Faster Inferencing

Hacker News

DECEMBER 18, 2023

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Deno Queues

Hacker News

SEPTEMBER 27, 2023

This new primitive builds on the foundation set by Deno KV, and is available today in the Deno JavaScript runtime and Deno Deploy. Introducing Deno Queues - zero config, scalable messaging with a guaranteed at-least-once delivery.

Volcanic eruption started just north of the town of Grindavik, Iceland

Hacker News

DECEMBER 18, 2023

Gos hófst um fjóra kílómetra norðaustan Grindavíkur, norðan Sundhnúks í Sundhnúkaröðinni, klukkan 22:17 í kvöld.

LoMA: Lossless Compressed Memory Attention

Hacker News

JANUARY 27, 2024

At present, reducing resource consumption by compressing the KV cache is a common approach. We propose a new method, Lossless Compressed Memory Attention (LoMA), which allows for lossless compression of information into special memory token KV pairs according to a set compression ratio.

Efficient LLM inference solution on Intel GPU

Hacker News

JANUARY 20, 2024

We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. We implement our LLM inference solution on Intel GPU and publish it publicly.

Permazen: Language-natural persistence to KV stores

Hacker News

SEPTEMBER 19, 2023

Comments (..)

Video?—?Deep Dive: Optimizing LLM inference

Julien Simon

MARCH 11, 2024

In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.

Deep Learning

Deep Learning Deep Learning

Multi-Query Attention Explained

Towards AI

NOVEMBER 17, 2023

Lower computational complexity By using KV cache, the computational cost of calculating tensor Key and Value in each step of MQA is 1/h of MHA, where h represents the number of heads. MQA has a relatively small KV cache that can fit into the cache (SRAM).

AI

AI AI Machine Learning Machine Learning

20x Savings on OpenAI Bills by This Simple Method

Towards AI

JANUARY 13, 2024

LLMLingua uses GPT2-small and LLaMA-2-7B to decrease the prompt size by 20x TLDR: If you want to U+1F4B0 Save Cost by reducing both prompt and generation lengths.U+1F4DD 1F4DD Extend Context Support beyond Model limits of the APIs and ModelsU+1F4DC Mitigates the “lost in the middle” issue and boost overall performance.U+1F575️

AI

AI AI Algorithm Data Science

Llama: Add grammar-based sampling

Hacker News

JULY 21, 2023

llama_init_from_file: kv self size = 780.00 llama_init_from_file: kv self size = 780.00 llama_init_from_file: kv self size = 780.00 llama_init_from_file: kv self size = 780.00 llama_init_from_file: kv self size = 780.00 llama_init_from_file: kv self size = 780.00 MB (+ 3124.00 MB per state). MB (+ 3124.00

Show HN: Lockval Engine – a distributed back end KV engine that can run scripts

Hacker News

FEBRUARY 13, 2023

Comments (..)

Bigram Language Modeling From Scratch

Towards AI

FEBRUARY 5, 2024

We’ll add a special token/character to denote the start and end of the word bg_pairs = dict()for word in words: word = [' '] + list(word) + [' '] for ch1, ch2 in zip(word, word[1:]): bg_pair = (ch1, ch2) bg_pairs[bg_pair] = bg_pairs.get(bg_pair, 0) + 1 Printing out the top 10 bigram pair based on their frequency sorted(bg_pairs.items(), (..)

AI

AI AI Data Science Machine Learning

Trending AI GitHub Repos: Week of October 9, 2023

ODSC - Open Data Science

OCTOBER 13, 2023

It addresses two major challenges in deploying LLMs: caching previous tokens’ key and value states (KV) consumes extensive memory, and popular LLMs cannot generalize to longer texts than the training sequence length. DocsGPT DocsGPT is an open-source solution that streamlines the process of finding information in project documentation.

Data Science

Data Science AI AI Artificial Intelligence

Llama.cpp: Full CUDA GPU Acceleration

Hacker News

JUNE 12, 2023

Especially for long generations this makes a large difference because the KV cache is still CPU only on master and gets larger as the context fills up. I also added two special kernels for doing matrix vector multiplication with permuted or not contiguous inputs; they are used in conjunction with the KV cache. Changes to ggml.c:

Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers

AWS Machine Learning Blog

APRIL 8, 2024

This reduces the size of the KV cache on GPU memory, allowing for greater concurrency, and improving price-performance. Different KV cache sharding strategies are available to suit different types of workloads. It allows you to insert and remove KV cache rows from new requests while you’re handing batched inference.

AWS

AWS Deep Learning Deep Learning Machine Learning

Improve performance of Falcon models with Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 11, 2023

Forward pass, activations, and the KV cache Given an input sequence of tokens, they are run in a forward pass across all the layers of the LLM (like Falcon) to generate the next token. These cached key and value tensors are often referred to as the KV cache.

AWS

AWS Machine Learning Machine Learning ML

Unleash Your Creativity with Kivy: A Comprehensive Tutorial

Mlearning.ai

JULY 5, 2023

Table of Contents Introduction to Kivy Benefits of Kivy Installation and Setup Creating a Basic Kivy App Kivy Widgets and Layouts Event Handling and Properties Kivy Language (KV) for UI Design Kivy Screen Manager Integrating with Other Python Libraries Deploying Kivy Apps to Different Platforms 1. kv extension. kv extension.

Python

Python ML ML Machine Learning

Grouped-Query Attention(GQA) Explained

Towards AI

DECEMBER 28, 2023

However, as the context window or batch size increases, the memory cost associated with the size of the key-value cache(kv cache) in the multi-head attention(MHA) model significantly increases.

AI

AI AI Machine Learning Machine Learning

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart

AWS Machine Learning Blog

JANUARY 29, 2024

KV caching and device memory Standard transformer attention mechanisms compute attention for each new token against all previous tokens. This is called this the KV cache , and it grows with batch size and sequence length. The following formula is a rough approximation for the maximum KV cache size.

ML

ML ML Machine Learning Machine Learning

Transcribe audio to text on Cloudflare Workers with AssemblyAI, Node.js, and TypeScript

AssemblyAI

AUGUST 3, 2023

. * * - Run `npm run dev` in your terminal to start a development server * - Open a browser tab at [link] to see your worker in action * - Run `npm run deploy` to publish your worker * * Learn more at [link] */ export interface Env { // Example binding to KV.

Database

Inside Jamba: Mamba, Transformers, and MoEs Together to Power a New Form of LLMs

Towards AI

APRIL 8, 2024

Furthermore, by substituting some Transformer layers with Mamba layers, Jamba significantly reduces the size of the key-value (KV) cache needed for processing, achieving up to an eight-fold decrease compared to traditional Transformers.

Machine Learning

Machine Learning Machine Learning AI AI

Improve throughput performance of Llama 2 models using Amazon SageMaker

AWS Machine Learning Blog

SEPTEMBER 25, 2023

These cached key and value tensors are often referred to as the KV cache or attention cache. As per the paper vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention , the KV cache takes up to 1.7 As a result, efficiently managing the KV cache presents a significant challenge. GB for a single sequence in Llama 13B.

AWS

AWS Machine Learning Machine Learning Deep Learning

Data Science Current

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Deno KV Is in Open Beta

Webinars

Trending Sources

StreamingLLM: tiny tweak to KV LRU improves long conversations

Webinars

Deno KV internals: building a database for the modern web

Harvesting Electricity from High-Voltage Transmission Lines Using Fences

Efficient Memory Management for Large Language Model Serving with PagedAttention

LLMLingua: Compressing Prompts for Faster Inferencing

Deno Queues

Volcanic eruption started just north of the town of Grindavik, Iceland

LoMA: Lossless Compressed Memory Attention

Efficient LLM inference solution on Intel GPU

Permazen: Language-natural persistence to KV stores

Video?—?Deep Dive: Optimizing LLM inference

Multi-Query Attention Explained

20x Savings on OpenAI Bills by This Simple Method

Llama: Add grammar-based sampling

Show HN: Lockval Engine – a distributed back end KV engine that can run scripts

Bigram Language Modeling From Scratch

Trending AI GitHub Repos: Week of October 9, 2023

Llama.cpp: Full CUDA GPU Acceleration

Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers

Improve performance of Falcon models with Amazon SageMaker

Unleash Your Creativity with Kivy: A Comprehensive Tutorial

Grouped-Query Attention(GQA) Explained

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart

Transcribe audio to text on Cloudflare Workers with AssemblyAI, Node.js, and TypeScript

Inside Jamba: Mamba, Transformers, and MoEs Together to Power a New Form of LLMs

Improve throughput performance of Llama 2 models using Amazon SageMaker

Stay Connected