AWS, Clustering, Deep Learning and Machine Learning

AWS

Clustering

Deep Learning

Machine Learning

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Flipboard

JUNE 20, 2023

Machine learning (ML) engineers have traditionally focused on striking a balance between model training and deployment cost vs. performance. For reference, GPT-3, an earlier generation LLM has 175 billion parameters and requires months of non-stop training on a cluster of thousands of accelerated processors.

AWS

AWS Machine Learning Machine Learning Deep Learning

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

AWS Machine Learning Blog

MAY 31, 2023

Running machine learning (ML) workloads with containers is becoming a common practice. With containers, scaling on a cluster becomes much easier. With containers, scaling on a cluster becomes much easier. Prerequisites To follow along, familiarity with core AWS services such as Amazon EC2 and Amazon ECS is implied.

AWS

AWS Machine Learning Machine Learning Clustering

Join 20,000+

professionals

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

The Project Clinic: Assessing Project Health, Planning, and Execution

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

AWS Machine Learning Blog

APRIL 29, 2024

For AWS and Outerbounds customers, the goal is to build a differentiated machine learning and artificial intelligence (ML/AI) system and reliably improve it over time. First, the AWS Trainium accelerator provides a high-performance, cost-effective, and readily available solution for training and fine-tuning large models.

AWS

AWS ML ML Python

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

The Project Clinic: Assessing Project Health, Planning, and Execution

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

AWS Machine Learning Blog

MAY 1, 2024

Llama2 by Meta is an example of an LLM offered by AWS. To learn more about Llama 2 on AWS, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart. Virginia) and US West (Oregon) AWS Regions, and most recently announced general availability in the US East (Ohio) Region.

AWS

AWS ML ML Clustering

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

AWS Machine Learning Blog

OCTOBER 5, 2023

In this post, we walk through how to fine-tune Llama 2 on AWS Trainium , a purpose-built accelerator for LLM training, to reduce training times and costs. We review the fine-tuning scripts provided by the AWS Neuron SDK (using NeMo Megatron-LM), the various configurations we used, and the throughput results we saw.

AWS

AWS Machine Learning Machine Learning Deep Learning

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

AWS Machine Learning Blog

DECEMBER 12, 2023

In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium , a purpose-built machine learning (ML) accelerator optimized for deep learning training. M tokens/$) trained such models with AWS Trainium without losing any model quality. We’ll outline how we cost-effectively (3.2

AWS

AWS Deep Learning Deep Learning Machine Learning

Scaling distributed training with AWS Trainium and Amazon EKS

AWS Machine Learning Blog

FEBRUARY 1, 2023

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Many enterprise customers choose to deploy their deep learning workloads using Kubernetes—the de facto standard for container orchestration in the cloud.

AWS

AWS Clustering Deep Learning Deep Learning

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

AWS Machine Learning Blog

APRIL 1, 2024

Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. Distributed model training requires a cluster of worker nodes that can scale. The following figure shows how FSDP works for two data parallel processes.

Clustering

Clustering AWS ML ML

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

AWS Machine Learning Blog

JANUARY 17, 2024

Today, we’re excited to announce the availability of Llama 2 inference and fine-tuning support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. In this post, we demonstrate how to deploy and fine-tune Llama 2 on Trainium and AWS Inferentia instances in SageMaker JumpStart.

AWS

AWS Python Machine Learning Machine Learning

Announcing New Tools for Building with Generative AI on AWS

Flipboard

APRIL 13, 2023

The seeds of a machine learning (ML) paradigm shift have existed for decades, but with the ready availability of scalable compute capacity, a massive proliferation of data, and the rapid advancement of ML technologies, customers across industries are transforming their businesses.

AWS

AWS AI AI ML

MLOps: A complete guide for building, deploying, and managing machine learning models

Data Science Dojo

AUGUST 24, 2023

MLFlow Machine Learning flow MLflow has unique features and characteristics that differentiate it from other MLOps tools, making it appealing to users with specific requirements or preferences: Modularity : One of MLflow’s most significant advantages is its modular architecture.

Machine Learning

Machine Learning Machine Learning ML ML

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Flipboard

FEBRUARY 16, 2023

Modern model pre-training often calls for larger cluster deployment to reduce time and cost. In October 2022, we launched Amazon EC2 Trn1 Instances , powered by AWS Trainium , which is the second generation machine learning accelerator designed by AWS. The following diagram shows an example.

Clustering

Clustering AWS Deep Learning Deep Learning

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

AWS Machine Learning Blog

APRIL 19, 2024

Our deep learning models have non-trivial requirements: they are gigabytes in size, are numerous and heterogeneous, and require GPUs for fast inference and fine-tuning. The architecture deploys a simple service in a Kubernetes pod within an EKS cluster. The following diagram illustrates the solution architecture.

Clustering

Clustering AI AI AWS

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

AWS Machine Learning Blog

APRIL 19, 2024

We are excited to announce a new version of the Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. They are also supported by AWS CloudFormation. Release v1.2.9

AWS

AWS ML ML Machine Learning

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

AWS Machine Learning Blog

NOVEMBER 30, 2023

The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity AI are going all in on AWS for generative AI.

AWS

AWS AI AI ML

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

AWS (Amazon Web Services), the comprehensive and evolving cloud computing platform provided by Amazon, is comprised of infrastructure as a service (IaaS), platform as a service (PaaS) and packaged software as a service (SaaS). With its wide array of tools and convenience, AWS has already become a popular choice for many SaaS companies.

AWS

AWS Cloud Computing Data Lakes Database

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

AWS Machine Learning Blog

MAY 23, 2024

By distributing experts across workers, expert parallelism addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster. The following figure offers a simplified look at how expert parallelism works on a multi-GPU cluster.

Clustering

Clustering AWS Deep Learning Deep Learning

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

AWS Machine Learning Blog

NOVEMBER 22, 2023

For decades, Amazon has pioneered and innovated machine learning (ML), bringing delightful experiences to its customers. Similar to the rest of the industry, the advancements of accelerated hardware have allowed Amazon teams to pursue model architectures using neural networks and deep learning (DL).

AWS

AWS ML ML Deep Learning

Getting started with Amazon Titan Text Embeddings

AWS Machine Learning Blog

JANUARY 31, 2024

Embeddings play a key role in natural language processing (NLP) and machine learning (ML). These models are based on deep learning architectures such as Transformers, which can capture the contextual information and relationships between words in a sentence more effectively. Why do we need an embeddings model?

Natural Language Processing

Natural Language Processing AWS Machine Learning Machine Learning

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

AWS Machine Learning Blog

FEBRUARY 12, 2024

Sharing in-house resources with other internal teams, the Ranking team machine learning (ML) scientists often encountered long wait times to access resources for model training and experimentation – challenging their ability to rapidly experiment and innovate.

ML ML AWS Machine Learning

Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2

AWS Machine Learning Blog

JANUARY 13, 2023

To mitigate these challenges, we propose a federated learning (FL) framework, based on open-source FedML on AWS, which enables analyzing sensitive HCLS data. It involves training a global machine learning (ML) model from distributed health data held locally at different sites. Request a VPC peering connection.

AWS

AWS Analytics Analytics Machine Learning

Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

AWS Machine Learning Blog

DECEMBER 22, 2023

As a result, machine learning practitioners must spend weeks of preparation to scale their LLM workloads to large clusters of GPUs. Integrating tensor parallelism to enable training on massive clusters This release of SMP also expands PyTorch FSDP’s capabilities to include tensor parallelism techniques.

Clustering

Clustering AWS Deep Learning Deep Learning

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

AWS Machine Learning Blog

DECEMBER 18, 2023

MongoDB’s robust time series data management allows for the storage and retrieval of large volumes of time-series data in real-time, while advanced machine learning algorithms and predictive capabilities provide accurate and dynamic forecasting models with SageMaker Canvas. Setup the Database access and Network access.

Clustering

Clustering AWS Database ML

Training large language models on Amazon SageMaker: Best practices

AWS Machine Learning Blog

MARCH 6, 2023

These factors require training an LLM over large clusters of accelerated machine learning (ML) instances. In the past few years, numerous customers have been using the AWS Cloud for LLM training. When you launch a job with multiple machines, SageMaker Training and Processing run your code one time per machine.

AWS

AWS Clustering ML ML

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 23, 2023

This post is co-authored by Anatoly Khomenko, Machine Learning Engineer, and Abdenour Bezzouh, Chief Technology Officer at Talent.com. The system is developed by a team of dedicated applied machine learning (ML) scientists, ML engineers, and subject matter experts in collaboration between AWS and Talent.com.

AWS

AWS Deep Learning Deep Learning Machine Learning

Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances

AWS Machine Learning Blog

JUNE 7, 2023

Libraries such as DeepSpeed (an open-source deep learning optimization library for PyTorch) address some of these challenges, and can help accelerate model development and training. Training setup We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using AWS Batch. Pre-training of a 1.5-billion-parameter

AWS

AWS Clustering Deep Learning Deep Learning

How LotteON built a personalized recommendation system using Amazon SageMaker and MLOps

AWS Machine Learning Blog

MAY 16, 2024

In this post, we share how LotteON improved their recommendation service using Amazon SageMaker and machine learning operations (MLOps). Therefore, we decided to introduce a deep learning-based recommendation algorithm that can identify not only linear relationships in the data, but also more complex relationships.

AWS

AWS ML ML Deep Learning

First ODSC Europe 2023 Sessions Announced

ODSC - Open Data Science

MARCH 27, 2023

The Next Generation of Low-Code Machine Learning Devvret Rishi | Co-founder and Chief Product Officer | Predibase In this session, you’ll explore declarative machine learning, a configuration-based modeling interface, which provides more flexibility and simplicity when implementing cutting-edge machine learning.

Machine Learning

Machine Learning Machine Learning ML ML

Enable pod-based GPU metrics in Amazon CloudWatch

AWS Machine Learning Blog

SEPTEMBER 7, 2023

Since then, this feature has been integrated into many of our managed Amazon Machine Images (AMIs), such as the Deep Learning AMI and the AWS ParallelCluster AMI. Prerequisites To simplify reproducing the entire stack from this post, we use a container that has all the required tooling (aws cli, eksctl, helm, etc.)

Clustering

Clustering AWS Machine Learning Machine Learning

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic

AWS Machine Learning Blog

MARCH 2, 2023

Hyperparameter optimization is highly computationally demanding for deep learning models. In our solution, we implement a hyperparameter grid search on an EKS cluster for tuning a bert-base-cased model for classifying positive or negative sentiment for stock market data headlines. The code can be found on the GitHub repo.

Clustering

Clustering AWS Deep Learning Deep Learning

Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and the model parallel library

AWS Machine Learning Blog

JUNE 12, 2023

Transformer neural networks A transformer neural network is a popular deep learning architecture to solve sequence-to-sequence tasks. It uses attention as the learning mechanism to achieve close to human-level performance. The DLCs are developed through a collaboration between AWS and Hugging Face.

AWS

AWS Deep Learning Deep Learning Machine Learning

Definite Guide to Building a Machine Learning Platform

The MLOps Blog

MARCH 21, 2023

Moving across the typical machine learning lifecycle can be a nightmare. Machine learning platforms are increasingly looking to be the “fix” to successfully consolidate all the components of MLOps from development to production. What is a machine learning platform? That’s where this guide comes in!

Machine Learning

Machine Learning Machine Learning Data Scientist ML

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint

AWS Machine Learning Blog

APRIL 25, 2024

We provide a comprehensive guide on how to deploy speaker segmentation and clustering solutions using SageMaker on the AWS Cloud. Solution overview Amazon Transcribe is the go-to service for speaker diarization in AWS. Hugging Face is a popular open source hub for machine learning (ML) models.

AWS

AWS ML ML Python

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

AWS Machine Learning Blog

MAY 30, 2023

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

AWS

AWS Deep Learning Deep Learning ML

Enable faster training with Amazon SageMaker data parallel library

AWS Machine Learning Blog

DECEMBER 5, 2023

Walkthrough AWS-optimized AllGather AWS-optimized AllGather uses the following techniques to achieve better performance on AWS infrastructure compared to NCCL: We move data between instances via Elastic Fabric Adapter (EFA) network with an all-to-all communication pattern. 24xlarge nodes (512 NVIDIA A100 GPUs) PyTorch FSDP 97.89

AWS

AWS Deep Learning Deep Learning Clustering

Top NLP Skills, Frameworks, Platforms, and Languages for 2023

ODSC - Open Data Science

FEBRUARY 17, 2023

Developing NLP tools isn’t so straightforward, and requires a lot of background knowledge in machine & deep learning, among others. Machine & Deep Learning Machine learning is the fundamental data science skillset, and deep learning is the foundation for NLP.

Deep Learning

Deep Learning Deep Learning Data Science Natural Language Processing

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Many practitioners are extending these Redshift datasets at scale for machine learning (ML) using Amazon SageMaker , a fully managed ML service, with requirements to develop features offline in a code way or low-code/no-code way, store featured data from Amazon Redshift, and make this happen at scale in a production environment.

ML ML AWS Data Warehouse

Announcing the Preview of Amazon SageMaker Profiler: Track and visualize detailed hardware performance data for your model training workloads

AWS Machine Learning Blog

AUGUST 24, 2023

Today, we’re pleased to announce the preview of Amazon SageMaker Profiler , a capability of Amazon SageMaker that provides a detailed view into the AWS compute resources provisioned during training deep learning models on SageMaker. Framework Version AWS DLC Image URI PyTorch 2.0.0 and 1.13.1) and 2.11.1). 763104351884.dkr.ecr.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker

AWS

AWS Deep Learning Deep Learning ML

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

AWS Machine Learning Blog

NOVEMBER 16, 2023

This is a joint blog with AWS and Philips. Since 2014, the company has been offering customers its Philips HealthSuite Platform, which orchestrates dozens of AWS services that healthcare and life sciences companies use to improve patient care.

AWS

AWS ML ML AI

Architect personalized generative AI SaaS applications on Amazon SageMaker

Flipboard

MARCH 9, 2023

In this post, we review the technical requirements and application design considerations for fine-tuning and serving hyper-personalized AI models at scale on AWS. Second, SageMaker supports unique GPU-enabled hosting options for deploying deep learning models at scale.

AWS

AWS AI AI ML

Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker

AWS Machine Learning Blog

JUNE 7, 2023

Starting June 7th, both Falcon LLMs will also be available in Amazon SageMaker JumpStart, SageMaker’s machine learning (ML) hub that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get started with ML. The model weights are available to download, inspect and deploy anywhere.

Clustering

Clustering AWS Machine Learning Machine Learning

Optimize generative AI workloads for environmental sustainability

AWS Machine Learning Blog

SEPTEMBER 21, 2023

To add to our guidance for optimizing deep learning workloads for sustainability on AWS , this post provides recommendations that are specific to generative AI workloads. Adopt an efficient inference infrastructure – You can deploy your models on an AWS Inferentia2 accelerator.

AI AI AWS Deep Learning

Power recommendations and search using an IMDb knowledge graph – Part 3

AWS Machine Learning Blog

JANUARY 6, 2023

Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention. We downloaded the data from AWS Data Exchange and processed it in AWS Glue to generate KG files. Background.

AWS

AWS ML ML Machine Learning

Training Sessions Coming to ODSC APAC 2023

ODSC - Open Data Science

AUGUST 15, 2023

Full-Stack Machine Learning for Data Scientists Hugo Bowne-Anderson, PhD | Head of Data Science Evangelism and Marketing | Outerbounds This session will address the issue of how to make the life cycle of a machine learning project a repeatable process. Check out a few of them below.

Machine Learning

Machine Learning Machine Learning Data Science Data Scientist

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

Webinars

Trending Sources

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

Webinars

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Scaling distributed training with AWS Trainium and Amazon EKS

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Announcing New Tools for Building with Generative AI on AWS

MLOps: A complete guide for building, deploying, and managing machine learning models

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

10 Things AWS Can Do for Your SaaS Company

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

Getting started with Amazon Titan Text Embeddings

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2

Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

Training large language models on Amazon SageMaker: Best practices

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances

How LotteON built a personalized recommendation system using Amazon SageMaker and MLOps

First ODSC Europe 2023 Sessions Announced

Enable pod-based GPU metrics in Amazon CloudWatch

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic

Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and the model parallel library

Definite Guide to Building a Machine Learning Platform

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon SageMaker as an asynchronous endpoint

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

Enable faster training with Amazon SageMaker data parallel library

Top NLP Skills, Frameworks, Platforms, and Languages for 2023

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Announcing the Preview of Amazon SageMaker Profiler: Track and visualize detailed hardware performance data for your model training workloads

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

Architect personalized generative AI SaaS applications on Amazon SageMaker

Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker

Optimize generative AI workloads for environmental sustainability

Power recommendations and search using an IMDb knowledge graph – Part 3

Training Sessions Coming to ODSC APAC 2023

Stay Connected