Machine Learning Research at Apple

article thumbnail

Conformal Prediction via Regression-as-Classification

Machine Learning Research at Apple

Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals. Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression.

147
147
article thumbnail

Large Language Models as Generalizable Policies for Embodied Tasks

Machine Learning Research at Apple

We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions.

147
147
professionals

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

MOFI: Learning Image Representation from Noisy Entity Annotated Images

Machine Learning Research at Apple

In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees a named entity recognition model to extract entities from text, and uses a CLIP model to select the right entities as the labels of the paired image. The approach is simple, and can be readily scaled up to billions of image-text pairs mined from the web, through which we have successfully created a dataset with 2 millions of distinct entities.

147
147
article thumbnail

Compressing LLMs: The Truth is Rarely Pure and Never Simple

Machine Learning Research at Apple

Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline.

147
147
article thumbnail

How Far Are We from Intelligent Visual Deductive Reasoning?

Machine Learning Research at Apple

This paper was accepted at the How Far Are We from AGI? workshop at ICLR 2024. Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs.

130
130
article thumbnail

Pseudo-Generalized Dynamic View Synthesis from a Video

Machine Learning Research at Apple

Rendering scenes observed in a monocular video from novel viewpoints is a chal- lenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized tech- niques, which only run a deep net forward pass on a test scene. In contrast, for dy- namic scenes, scene-specific optimization techniques exist, but, to our best knowl- edge, there is currently no generalized method for dynamic novel view synthesis from a

130
130
article thumbnail

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Machine Learning Research at Apple

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs.

130
130