AWS, Clustering, Deep Learning and ML

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

AWS Machine Learning Blog

MAY 1, 2024

Llama2 by Meta is an example of an LLM offered by AWS. To learn more about Llama 2 on AWS, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart. Virginia) and US West (Oregon) AWS Regions, and most recently announced general availability in the US East (Ohio) Region.

AWS

AWS ML ML Clustering

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Flipboard

JUNE 20, 2023

Machine learning (ML) engineers have traditionally focused on striking a balance between model training and deployment cost vs. performance. This is important because training ML models and then using the trained models to make predictions (inference) can be highly energy-intensive tasks.

AWS

AWS Machine Learning Machine Learning Deep Learning

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

AWS Machine Learning Blog

APRIL 29, 2024

For AWS and Outerbounds customers, the goal is to build a differentiated machine learning and artificial intelligence (ML/AI) system and reliably improve it over time. First, the AWS Trainium accelerator provides a high-performance, cost-effective, and readily available solution for training and fine-tuning large models.

AWS

AWS ML ML Python

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

The Project Clinic: Assessing Project Health, Planning, and Execution

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

AWS Machine Learning Blog

FEBRUARY 12, 2024

Sharing in-house resources with other internal teams, the Ranking team machine learning (ML) scientists often encountered long wait times to access resources for model training and experimentation – challenging their ability to rapidly experiment and innovate. If it shows online improvement, it can be deployed to all the users.

ML

ML ML AWS Machine Learning

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

AWS Machine Learning Blog

MAY 31, 2023

Running machine learning (ML) workloads with containers is becoming a common practice. What you get is an ML development environment that is consistent and portable. With containers, scaling on a cluster becomes much easier. With containers, scaling on a cluster becomes much easier. Run the ML task on Amazon ECS.

AWS

AWS Machine Learning Machine Learning Clustering

Scaling distributed training with AWS Trainium and Amazon EKS

AWS Machine Learning Blog

FEBRUARY 1, 2023

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Many enterprise customers choose to deploy their deep learning workloads using Kubernetes—the de facto standard for container orchestration in the cloud.

AWS

AWS Clustering Deep Learning Deep Learning

Announcing New Tools for Building with Generative AI on AWS

Flipboard

APRIL 13, 2023

The seeds of a machine learning (ML) paradigm shift have existed for decades, but with the ready availability of scalable compute capacity, a massive proliferation of data, and the rapid advancement of ML technologies, customers across industries are transforming their businesses.

AWS

AWS AI AI ML

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

AWS Machine Learning Blog

DECEMBER 12, 2023

In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium , a purpose-built machine learning (ML) accelerator optimized for deep learning training. M tokens/$) trained such models with AWS Trainium without losing any model quality. We’ll outline how we cost-effectively (3.2

AWS

AWS Deep Learning Deep Learning Machine Learning

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

AWS Machine Learning Blog

APRIL 1, 2024

Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. Distributed model training requires a cluster of worker nodes that can scale.

Clustering

Clustering AWS ML ML

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Many practitioners are extending these Redshift datasets at scale for machine learning (ML) using Amazon SageMaker , a fully managed ML service, with requirements to develop features offline in a code way or low-code/no-code way, store featured data from Amazon Redshift, and make this happen at scale in a production environment.

ML

ML ML AWS Data Warehouse

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

AWS Machine Learning Blog

JANUARY 17, 2024

Today, we’re excited to announce the availability of Llama 2 inference and fine-tuning support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. In this post, we demonstrate how to deploy and fine-tune Llama 2 on Trainium and AWS Inferentia instances in SageMaker JumpStart.

AWS

AWS Python Machine Learning Machine Learning

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

AWS Machine Learning Blog

APRIL 19, 2024

Our deep learning models have non-trivial requirements: they are gigabytes in size, are numerous and heterogeneous, and require GPUs for fast inference and fine-tuning. The architecture deploys a simple service in a Kubernetes pod within an EKS cluster. The following diagram illustrates the solution architecture.

Clustering

Clustering AI AI AWS

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

AWS Machine Learning Blog

NOVEMBER 30, 2023

The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity AI are going all in on AWS for generative AI.

AWS

AWS AI AI ML

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Flipboard

FEBRUARY 16, 2023

Modern model pre-training often calls for larger cluster deployment to reduce time and cost. In October 2022, we launched Amazon EC2 Trn1 Instances , powered by AWS Trainium , which is the second generation machine learning accelerator designed by AWS. The following diagram shows an example.

Clustering

Clustering AWS Deep Learning Deep Learning

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

AWS Machine Learning Blog

APRIL 19, 2024

We are excited to announce a new version of the Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. They are also supported by AWS CloudFormation. Release v1.2.9

AWS

AWS ML ML Machine Learning

How HR Tech Company Sense Scaled their ML Operations using Iguazio

Iguazio

JANUARY 16, 2024

This includes a data team, an analytics team, DevOps, AI/ML, and a data science team. The AI/Ml team is made up of ML engineers, data scientists and backend product engineers. The Challenge Like many organizations, the AI/ML team at Sense was finding it challenging to scale its ML operations.

ML

ML ML DataOps Data Scientist

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

AWS (Amazon Web Services), the comprehensive and evolving cloud computing platform provided by Amazon, is comprised of infrastructure as a service (IaaS), platform as a service (PaaS) and packaged software as a service (SaaS). With its wide array of tools and convenience, AWS has already become a popular choice for many SaaS companies.

AWS

AWS Cloud Computing Data Lakes Database

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

AWS Machine Learning Blog

NOVEMBER 22, 2023

For decades, Amazon has pioneered and innovated machine learning (ML), bringing delightful experiences to its customers. From the earliest days, Amazon has used ML for various use cases such as book recommendations, search, and fraud detection. To use accelerators, you need a software layer to support them.

AWS

AWS ML ML Deep Learning

How Sense Uses Iguazio as a Key Component of Their ML Stack

Iguazio

JANUARY 16, 2024

This includes a data team, an analytics team, DevOps, AI/ML, and a data science team. The AI/Ml team is made up of ML engineers, data scientists and backend product engineers. The Challenge: Scaling ML Operations Like many organizations, the AI/ML team at Sense was finding it challenging to scale its ML operations.

ML

ML ML DataOps Data Scientist

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

AWS Machine Learning Blog

MAY 23, 2024

By distributing experts across workers, expert parallelism addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster. The following figure offers a simplified look at how expert parallelism works on a multi-GPU cluster.

Clustering

Clustering AWS Deep Learning Deep Learning

Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances

AWS Machine Learning Blog

JUNE 7, 2023

Libraries such as DeepSpeed (an open-source deep learning optimization library for PyTorch) address some of these challenges, and can help accelerate model development and training. Training setup We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using AWS Batch. Pre-training of a 1.5-billion-parameter

AWS

AWS Clustering Deep Learning Deep Learning

How LotteON built a personalized recommendation system using Amazon SageMaker and MLOps

AWS Machine Learning Blog

MAY 16, 2024

Therefore, we decided to introduce a deep learning-based recommendation algorithm that can identify not only linear relationships in the data, but also more complex relationships. However, it was necessary to upgrade the recommendation service to analyze each customer’s taste and meet their needs.

AWS

AWS ML ML Deep Learning

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

AWS Machine Learning Blog

APRIL 19, 2023

The DJL is a deep learning framework built from the ground up to support users of Java and JVM languages like Scala, Kotlin, and Clojure. With the DJL, integrating this deep learning is simple. Since 2018, our team has been developing a variety of ML models to enable betting products for NFL and NCAA football.

ML

ML ML Deep Learning Deep Learning

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint

AWS Machine Learning Blog

APRIL 25, 2024

We provide a comprehensive guide on how to deploy speaker segmentation and clustering solutions using SageMaker on the AWS Cloud. Solution overview Amazon Transcribe is the go-to service for speaker diarization in AWS. Hugging Face is a popular open source hub for machine learning (ML) models.

AWS

AWS ML ML Python

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

AWS Machine Learning Blog

DECEMBER 18, 2023

Amazon SageMaker Canvas Amazon SageMaker Canvas is a visual machine learning (ML) service that enables business analysts and data scientists to build and deploy custom ML models without requiring any ML experience or having to write a single line of code. Setup the Database access and Network access.

Clustering

Clustering AWS Database ML

Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2

AWS Machine Learning Blog

JANUARY 13, 2023

To mitigate these challenges, we propose a federated learning (FL) framework, based on open-source FedML on AWS, which enables analyzing sensitive HCLS data. It involves training a global machine learning (ML) model from distributed health data held locally at different sites.

AWS

AWS Analytics Analytics Machine Learning

Getting started with Amazon Titan Text Embeddings

AWS Machine Learning Blog

JANUARY 31, 2024

Embeddings play a key role in natural language processing (NLP) and machine learning (ML). These models are based on deep learning architectures such as Transformers, which can capture the contextual information and relationships between words in a sentence more effectively. Why do we need an embeddings model?

Natural Language Processing

Natural Language Processing AWS Machine Learning Machine Learning

Host ML models on Amazon SageMaker using Triton: TensorRT models

AWS Machine Learning Blog

MAY 8, 2023

SageMaker provides single model endpoints (SMEs), which allow you to deploy a single ML model, or multi-model endpoints (MMEs), which allow you to specify multiple models to host behind a logical endpoint for higher resource utilization. TensorRT is an SDK developed by NVIDIA that provides a high-performance deep learning inference library.

ML

ML ML Deep Learning Deep Learning

Architect personalized generative AI SaaS applications on Amazon SageMaker

Flipboard

MARCH 9, 2023

In this post, we review the technical requirements and application design considerations for fine-tuning and serving hyper-personalized AI models at scale on AWS. Second, SageMaker supports unique GPU-enabled hosting options for deploying deep learning models at scale.

AWS

AWS AI AI ML

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

AWS Machine Learning Blog

NOVEMBER 16, 2023

This is a joint blog with AWS and Philips. Since 2014, the company has been offering customers its Philips HealthSuite Platform, which orchestrates dozens of AWS services that healthcare and life sciences companies use to improve patient care.

AWS

AWS ML ML AI

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic

AWS Machine Learning Blog

MARCH 2, 2023

Hyperparameter optimization is highly computationally demanding for deep learning models. In our solution, we implement a hyperparameter grid search on an EKS cluster for tuning a bert-base-cased model for classifying positive or negative sentiment for stock market data headlines. The code can be found on the GitHub repo.

Clustering

Clustering AWS Deep Learning Deep Learning

MLOps: A complete guide for building, deploying, and managing machine learning models

Data Science Dojo

AUGUST 24, 2023

ML models have grown significantly in recent years, and businesses increasingly rely on them to automate and optimize their operations. However, managing ML models can be challenging, especially as models become more complex and require more resources to train and deploy. What is MLOps?

Machine Learning

Machine Learning Machine Learning ML ML

Training large language models on Amazon SageMaker: Best practices

AWS Machine Learning Blog

MARCH 6, 2023

These factors require training an LLM over large clusters of accelerated machine learning (ML) instances. In the past few years, numerous customers have been using the AWS Cloud for LLM training. SageMaker-managed clusters of ml.p4d.24xlarge

AWS

AWS Clustering ML ML

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

AWS Machine Learning Blog

MAY 30, 2023

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

AWS

AWS Deep Learning Deep Learning ML

First ODSC Europe 2023 Sessions Announced

ODSC - Open Data Science

MARCH 27, 2023

You’ll hear how declarative machine learning has been essential to the speedy adoption at leading institutions such as Apple and Meta, as well as learn about Ludwig, the open-source declarative machine learning framework. Botnets Detection at Scale — Lesson Learned from Clustering Billions of Web Attacks into Botnets.

Machine Learning

Machine Learning Machine Learning ML ML

Enable pod-based GPU metrics in Amazon CloudWatch

AWS Machine Learning Blog

SEPTEMBER 7, 2023

Since then, this feature has been integrated into many of our managed Amazon Machine Images (AMIs), such as the Deep Learning AMI and the AWS ParallelCluster AMI. Prerequisites To simplify reproducing the entire stack from this post, we use a container that has all the required tooling (aws cli, eksctl, helm, etc.)

Clustering

Clustering AWS Machine Learning Machine Learning

How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

AWS Machine Learning Blog

FEBRUARY 13, 2024

The BigBasket team was running open source, in-house ML algorithms for computer vision object recognition to power AI-enabled checkout at their Fresho (physical) stores. Their objective was to fine-tune an existing computer vision machine learning (ML) model for SKU detection. Log model training metrics.

AWS

AWS AI AI ML

Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker

AWS Machine Learning Blog

JANUARY 9, 2023

Machine learning (ML) applications are complex to deploy and often require the ability to hyper-scale, and have ultra-low latency requirements and stringent cost budgets. Deploying ML models at scale with optimized cost and compute efficiencies can be a daunting and cumbersome task. Design patterns for building ML applications.

ML

ML ML AWS Deep Learning

ML Pipeline Architecture Design Patterns (With 10 Real-World Examples)

The MLOps Blog

AUGUST 11, 2023

There comes a time when every ML practitioner realizes that training a model in Jupyter Notebook is just one small part of the entire project. At that point, the Data Scientists or ML Engineers become curious and start looking for such implementations. What are ML pipeline architecture design patterns?

ML

ML ML Machine Learning Machine Learning

Announcing the Preview of Amazon SageMaker Profiler: Track and visualize detailed hardware performance data for your model training workloads

AWS Machine Learning Blog

AUGUST 24, 2023

Today, we’re pleased to announce the preview of Amazon SageMaker Profiler , a capability of Amazon SageMaker that provides a detailed view into the AWS compute resources provisioned during training deep learning models on SageMaker. Framework Version AWS DLC Image URI PyTorch 2.0.0 and 1.13.1) and 2.11.1). 763104351884.dkr.ecr.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker

AWS

AWS Deep Learning Deep Learning ML

Enable faster training with Amazon SageMaker data parallel library

AWS Machine Learning Blog

DECEMBER 5, 2023

Walkthrough AWS-optimized AllGather AWS-optimized AllGather uses the following techniques to achieve better performance on AWS infrastructure compared to NCCL: We move data between instances via Elastic Fabric Adapter (EFA) network with an all-to-all communication pattern. 24xlarge nodes (512 NVIDIA A100 GPUs) PyTorch FSDP 97.89

AWS

AWS Deep Learning Deep Learning Clustering

How Games24x7 transformed their retraining MLOps pipelines with Amazon SageMaker

AWS Machine Learning Blog

APRIL 12, 2023

The AI and data science team dive into a plethora of multi-dimensional data and run a variety of use cases like player journey optimization, game action detection, hyper-personalization, customer 360, and more on AWS. Solution overview The following diagram illustrates the solution architecture.

AWS

AWS ML ML Deep Learning

How Veriff decreased deployment time by 80% using Amazon SageMaker multi-model endpoints

AWS Machine Learning Blog

OCTOBER 16, 2023

As an AI-powered solution, Veriff needs to create and run dozens of machine learning (ML) models in a cost-effective way. These models range from lightweight tree-based models to deep learning computer vision models, which need to run on GPUs to achieve low latency and improve the user experience.

Data Scientist

Data Scientist ML ML AWS

Power recommendations and search using an IMDb knowledge graph – Part 3

AWS Machine Learning Blog

JANUARY 6, 2023

Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention. We downloaded the data from AWS Data Exchange and processed it in AWS Glue to generate KG files. Background.

AWS

AWS ML ML Machine Learning

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Webinars

Trending Sources

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

Webinars

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

Scaling distributed training with AWS Trainium and Amazon EKS

Announcing New Tools for Building with Generative AI on AWS

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

How HR Tech Company Sense Scaled their ML Operations using Iguazio

10 Things AWS Can Do for Your SaaS Company

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

How Sense Uses Iguazio as a Key Component of Their ML Stack

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances

How LotteON built a personalized recommendation system using Amazon SageMaker and MLOps

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2

Getting started with Amazon Titan Text Embeddings

Host ML models on Amazon SageMaker using Triton: TensorRT models

Architect personalized generative AI SaaS applications on Amazon SageMaker

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic

MLOps: A complete guide for building, deploying, and managing machine learning models

Training large language models on Amazon SageMaker: Best practices

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

First ODSC Europe 2023 Sessions Announced

Enable pod-based GPU metrics in Amazon CloudWatch

How BigBasket improved AI-enabled checkout at their physical stores using Amazon SageMaker

Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker

ML Pipeline Architecture Design Patterns (With 10 Real-World Examples)

Announcing the Preview of Amazon SageMaker Profiler: Track and visualize detailed hardware performance data for your model training workloads

Enable faster training with Amazon SageMaker data parallel library

How Games24x7 transformed their retraining MLOps pipelines with Amazon SageMaker

How Veriff decreased deployment time by 80% using Amazon SageMaker multi-model endpoints

Power recommendations and search using an IMDb knowledge graph – Part 3

Stay Connected