Data Science Current

Apple Machine Learning Research at CVPR 2025

Machine Learning Research at Apple

JUNE 9, 2025

Apple researchers are advancing AI and ML through fundamental research, and to support the broader research community and help accelerate progress in this field, we share much of our research through publications and engagement at conferences. This week, the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), will take place in Nashville, Tennessee.

Machine Learning

Machine Learning Machine Learning ML ML

Updates to Apple's On-Device and Server Foundation Language Models

Machine Learning Research at Apple

JUNE 8, 2025

With Apple Intelligence, we're integrating powerful generative AI right into the apps and experiences people use every day, all while protecting their privacy. At the 2025 Worldwide Developers Conference we introduced a new generation of language foundation models specifically developed to enhance the Apple Intelligence features in our latest software releases.

AI

AI AI

Improve Vision Language Model Chain-of-thought Reasoning

Machine Learning Research at Apple

JUNE 4, 2025

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes often relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers leads to poor generalization on reasoning tasks that require more detailed explanations.

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

Machine Learning Research at Apple

JUNE 4, 2025

Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations. Here we develop and evaluate voice quality models for seven voice and speech dimensions (intelligibility, imprecise consonants, harsh voice, naturalness, monoloudness, monopitch, and breathiness). Probes were trained on the public Speech Accessibility (SAP) project dataset with 11,184 samples from 434 speakers, using embeddings from frozen pre-trained models as features.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Machine Learning Research at Apple

JUNE 4, 2025

Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scal- ing properties, and limitations remain insufficiently understood. Current evaluations primarily fo- cus on established mathematical and coding benchmarks, emphasizing final answer accuracy.

Proxy-FDA: Proxy-Based Feature Distribution Alignment for Fine-Tuning Vision Foundation Models Without Forgetting

Machine Learning Research at Apple

JUNE 4, 2025

Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of concept forgetting on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance.

Beyond Text Compression: Evaluating Tokenizers Across Scales

Machine Learning Research at Apple

JUNE 4, 2025

Tokenizer design significantly impacts language model performance, yet evaluating tokenizer quality remains challenging. While text compression has emerged as a common intrinsic metric, recent work questions its reliability as a quality indicator. We investigate whether evaluating tokenizers on smaller models (350M parameters) reliably predicts their impact at larger scales (2.7B parameters).

Distillation Scaling Laws

Machine Learning Research at Apple

JUNE 2, 2025

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training.

Supervised Learning

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

Machine Learning Research at Apple

JUNE 2, 2025

*Equal Contributors Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim tr

World-Consistent Video Diffusion With Explicit 3D Modeling

Machine Learning Research at Apple

MAY 29, 2025

As diffusion models dominating visual content generation, efforts have been made to adapt these models for multi-view image generation to create 3D content. Traditionally, these methods implicitly learn 3D consistency by generating only RGB frames, which can lead to artifacts and inefficiencies in training. In contrast, we propose generating Normalized Coordinate Space (NCS) frames alongside RGB frames.

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Machine Learning Research at Apple

MAY 29, 2025

With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to gen

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

Machine Learning Research at Apple

MAY 27, 2025

Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation founda- tion models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations.

Interleaved Reasoning for Large Language Models via Reinforcement Learning

Machine Learning Research at Apple

MAY 27, 2025

Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions.

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

Machine Learning Research at Apple

MAY 26, 2025

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture.

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Machine Learning Research at Apple

MAY 21, 2025

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency.

What Makes for a Good Stereoscopic Image?

Machine Learning Research at Apple

MAY 21, 2025

This paper was accepted at the CV4Metaverse Workshop at CVPR 2025. With rapid advancements in virtual reality (VR) headsets, effectively measuring stereoscopic quality of experience (SQoE) has become essential for delivering immersive and comfortable 3D experiences. However, most existing stereo metrics focus on isolated aspects of the viewing experience such as visual discomfort or image quality, and have traditionally faced data limitations.

Cubify Anything: Scaling Indoor 3D Object Detection

Machine Learning Research at Apple

MAY 20, 2025

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-pe

Humanoid Policy ~ Human Policy

Machine Learning Research at Apple

MAY 20, 2025

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning.

Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

Machine Learning Research at Apple

MAY 15, 2025

Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar.

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Machine Learning Research at Apple

MAY 11, 2025

We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms.

Matrix3D: Large Photogrammetry Model All-in-One

Machine Learning Research at Apple

MAY 8, 2025

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3Ds large-scale multi-modal training lies in the incorporation of a mask learning strategy.

Classifier-Free Guidance is a Predictor-Corrector

Machine Learning Research at Apple

APRIL 30, 2025

We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM (Ho et al., 2020) and DDIM (Song et al., 2021), and neither sampler with CFG generates the gamma-powered distribution p(x|c)^p(x)^{1}.

Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

Machine Learning Research at Apple

APRIL 30, 2025

We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works.

Algorithm

Mechanisms of Projective Composition of Diffusion Models

Machine Learning Research at Apple

APRIL 30, 2025

We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022).

Local Pan-Privacy for Federated Analytics

Machine Learning Research at Apple

APRIL 30, 2025

Pan-privacy was proposed by Dwork et al. (2010) as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system's internal state. Motivated by federated telemetry applications, we study local pan-privacy, where privacy should be retained under repeated unannounced intrusions on the local state.

Analytics

Analytics Analytics

An LLM-Based Approach to Review Summarization on the App Store

Machine Learning Research at Apple

APRIL 23, 2025

Ratings and reviews are an invaluable resource for users exploring an app on the App Store, providing insights into how others have experienced the app. With review summaries now available in iOS 18.4, users can quickly get a high-level overview of what other users think about an app, while still having the option to dive into individual reviews for more detail.

How to Verify Any (Reasonable) Distribution Property: Computationally Sound Argument Systems for Distributions

Machine Learning Research at Apple

APRIL 23, 2025

As statistical analyses become more central to science, industry and society, there is a growing need to ensure correctness of their results. Approximate correctness can be verified by replicating the entire analysis, but can we verify without replication? Building on a recent line of work, we study proof-systems that allow a probabilistic verifier to ascertain that the results of an analysis are approximately correct, while drawing fewer samples and using less computational resources than would

ACM Human-Computer Interaction Conference (CHI) 2025

Machine Learning Research at Apple

APRIL 16, 2025

Apple is presenting new research at the ACM annual conference on Human-Computer Interaction (CHI), which takes place in person in Yokohama, Japan, from April 26 to May 1. We are proud to again sponsor the conference, which brings together the scientific and industrial research communities focused on interactive technology. Below is an overview of Apples participation at CHI 2025.

Apple Machine Learning Research at ICLR 2025

Machine Learning Research at Apple

APRIL 20, 2025

Apple researchers are advancing machine learning (ML) and AI through fundamental research that improves the worlds understanding of this technology and helps to redefine what is possible with it. To support the broader research community and help accelerate progress in this field, we share much of our research through publications, open source resources, and engagement at conferences.

Machine Learning

Machine Learning Machine Learning Deep Learning Deep Learning

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Machine Learning Research at Apple

APRIL 15, 2025

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging.

FastVLM: Efficient Vision encoding for Vision Language Models

Machine Learning Research at Apple

APRIL 17, 2025

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, th

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Machine Learning Research at Apple

APRIL 15, 2025

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework.

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization

Machine Learning Research at Apple

APRIL 14, 2025

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results.

Disentangled Representational Learning with the Gromov-Monge Gap

Machine Learning Research at Apple

APRIL 16, 2025

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geom

Machine Learning

Machine Learning Machine Learning

CoMotion: Concurrent Multi-Person 3D Motion

Machine Learning Research at Apple

APRIL 14, 2025

We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame.

Step-by-Step Diffusion: An Elementary Tutorial

Machine Learning Research at Apple

APRIL 15, 2025

We present an accessible first course on the mathematics of diffusion models and flow matching for machine learning. We aim to teach diffusion as simply as possible, with minimal mathematical and machine learning prerequisites, but enough technical detail to reason about its correctness. Unlike most tutorials on this subject, we take neither a Variational Auto Encoder (VAE) nor a Stochastic Differential Equations (SDE) approach.

Machine Learning

Machine Learning Machine Learning

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Machine Learning Research at Apple

APRIL 14, 2025

Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing.

Scaling Laws for Native Multimodal Models

Machine Learning Research at Apple

APRIL 15, 2025

Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior.

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Machine Learning Research at Apple

APRIL 13, 2025

This paper was accepted at the Workshop on Foundation Models in the Wild at ICLR 2025. Visual understanding is inherently contextual - what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest.

Language Models Know More Than They Show: Exploring Hallucinations From the Model's Viewpoint

Machine Learning Research at Apple

APRIL 10, 2025

Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized.

Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy

Machine Learning Research at Apple

APRIL 13, 2025

At Apple, we believe privacy is a fundamental human right. And we believe in giving our users a great experience while protecting their privacy. For years, weve used techniques like differential privacy as part of our opt-in device analytics program. This lets us gain insights into how our products are used, so we can improve them, while protecting user privacy by preventing Apple from seeing individual-level data from those users.

Analytics

Analytics Analytics

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Machine Learning Research at Apple

APRIL 9, 2025

Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data.

Clustering

The AdEMAMix Optimizer: Better, Faster, Older

Machine Learning Research at Apple

APRIL 9, 2025

Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape.

Machine Learning

Machine Learning Machine Learning

Simple ReFlow: Improved Techniques for Fast Flow Models

Machine Learning Research at Apple

APRIL 9, 2025

Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many sampling steps, this slows inference and limits applicability to time-critical tasks. The ReFlow procedure can accelerate sampling by straightening generation trajectories. However, ReFlow is an iterative procedure, typically requiring training on simulated data, and results in reduced sample quality.

Machine Learning Research at Apple

Apple Machine Learning Research at CVPR 2025

Updates to Apple's On-Device and Server Foundation Language Models

Webinars

Trending Sources

Improve Vision Language Model Chain-of-thought Reasoning

Webinars

Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Proxy-FDA: Proxy-Based Feature Distribution Alignment for Fine-Tuning Vision Foundation Models Without Forgetting

Beyond Text Compression: Evaluating Tokenizers Across Scales

Distillation Scaling Laws

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

World-Consistent Video Diffusion With Explicit 3D Modeling

SpeakStream: Streaming Text-to-Speech with Interleaved Data

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

Interleaved Reasoning for Large Language Models via Reinforcement Learning

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

What Makes for a Good Stereoscopic Image?

Cubify Anything: Scaling Indoor 3D Object Detection

Humanoid Policy ~ Human Policy

Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

Matrix3D: Large Photogrammetry Model All-in-One

Classifier-Free Guidance is a Predictor-Corrector

Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

Mechanisms of Projective Composition of Diffusion Models

Local Pan-Privacy for Federated Analytics

An LLM-Based Approach to Review Summarization on the App Store

How to Verify Any (Reasonable) Distribution Property: Computationally Sound Argument Systems for Distributions

ACM Human-Computer Interaction Conference (CHI) 2025

Apple Machine Learning Research at ICLR 2025

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

FastVLM: Efficient Vision encoding for Vision Language Models

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization

Disentangled Representational Learning with the Gromov-Monge Gap

CoMotion: Concurrent Multi-Person 3D Motion

Step-by-Step Diffusion: An Elementary Tutorial

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

Scaling Laws for Native Multimodal Models

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Language Models Know More Than They Show: Exploring Hallucinations From the Model's Viewpoint

Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

The AdEMAMix Optimizer: Better, Faster, Older

Simple ReFlow: Improved Techniques for Fast Flow Models

Stay Connected