AWS, Clustering and ML - Data Science Current

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

NOVEMBER 19, 2024

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. Third, we’ll explore the robust infrastructure services from AWS powering AI innovation, featuring Amazon SageMaker , AWS Trainium , and AWS Inferentia under AI/ML, as well as Compute topics.

AWS

AWS ML ML AI

Racing into the future: How AWS DeepRacer fueled my AI and ML journey

AWS Machine Learning Blog

NOVEMBER 19, 2024

In 2018, I sat in the audience at AWS re:Invent as Andy Jassy announced AWS DeepRacer —a fully autonomous 1/18th scale race car driven by reinforcement learning. At the time, I knew little about AI or machine learning (ML). seconds, securing the 2018 AWS DeepRacer grand champion title!

AWS

AWS ML ML AI

Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans

AWS Machine Learning Blog

DECEMBER 5, 2024

In this post, we demonstrate how you can address this requirement by using Amazon SageMaker HyperPod training plans , which can bring down your training cluster procurement wait time. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.

Clustering

Clustering AWS Python ML

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

AWS Machine Learning Blog

NOVEMBER 26, 2024

AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment. Solution overview The steps to implement the solution are as follows: Create the EKS cluster.

AWS

AWS Clustering ML ML

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

AWS Machine Learning Blog

DECEMBER 24, 2024

The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

AWS

AWS Clustering Deep Learning Deep Learning

Scale ML workflows with Amazon SageMaker Studio and Amazon SageMaker HyperPod

AWS Machine Learning Blog

DECEMBER 4, 2024

Scaling machine learning (ML) workflows from initial prototypes to large-scale production deployment can be daunting task, but the integration of Amazon SageMaker Studio and Amazon SageMaker HyperPod offers a streamlined solution to this challenge. Make sure you have the latest version of the AWS Command Line Interface (AWS CLI).

ML

ML ML Clustering AWS

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

AWS Machine Learning Blog

OCTOBER 24, 2024

Machine learning (ML) helps organizations to increase revenue, drive business growth, and reduce costs by optimizing core business functions such as supply and demand forecasting, customer churn prediction, credit risk scoring, pricing, predicting late shipments, and many others. For this post we’ll use a provisioned Amazon Redshift cluster.

Data Warehouse

Data Warehouse Machine Learning Machine Learning Cloud Data

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 16, 2024

Amazon SageMaker supports geospatial machine learning (ML) capabilities, allowing data scientists and ML engineers to build, train, and deploy ML models using geospatial data. We use the purpose-built geospatial container with SageMaker Processing jobs for a simplified, managed experience to create and run a cluster.

ML

ML ML Clustering Machine Learning

Streamline AWS resource troubleshooting with Amazon Bedrock Agents and AWS Support Automation Workflows

AWS Machine Learning Blog

MARCH 20, 2025

As AWS environments grow in complexity, troubleshooting issues with resources can become a daunting task. Fortunately, AWS provides a powerful tool called AWS Support Automation Workflows , which is a collection of curated AWS Systems Manager self-service automation runbooks. The agent uses Anthropics Claude 3.5

AWS

AWS Clustering AI AI

Reduce ML training costs with Amazon SageMaker HyperPod

AWS Machine Learning Blog

APRIL 10, 2025

As cluster sizes grow, the likelihood of failure increases due to the number of hardware components involved. Larger clusters, more failures, smaller MTBF As cluster size increases, the entropy of the system increases, resulting in a lower MTBF. It implies that if a single instance fails, it stops the entire job.

ML

ML ML Clustering AWS

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

AWS Machine Learning Blog

NOVEMBER 25, 2024

8B and 70B inference support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. Trainium and Inferentia, enabled by the AWS Neuron software development kit (SDK), offer high performance and lower the cost of deploying Meta Llama 3.1 An AWS Identity and Access Management (IAM) role to access SageMaker.

AWS

AWS Python ML ML

ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

AWS Machine Learning Blog

FEBRUARY 26, 2025

These experiences are made possible by our machine learning (ML) backend engine, with ML models built for video understanding, search, recommendation, advertising, and novel visual effects. By using sophisticated ML algorithms, the platform efficiently scans billions of videos each day.

AWS

AWS ML ML Clustering

Optimizing costs of generative AI applications on AWS

AWS Machine Learning Blog

DECEMBER 26, 2024

The potential for such large business value is galvanizing tens of thousands of enterprises to build their generative AI applications in AWS. This post addresses these cost considerations so you can optimize your generative AI costs in AWS. Annual costs (directional)*. These costs are based on assumptions.

AWS

AWS Database AI AI

Build a reverse image search engine with Amazon Titan Multimodal Embeddings in Amazon Bedrock and AWS managed services

AWS Machine Learning Blog

NOVEMBER 13, 2024

Prerequisites To implement the proposed solution, make sure that you have the following: An AWS account and a working knowledge of FMs, Amazon Bedrock , Amazon SageMaker , Amazon OpenSearch Service , Amazon S3 , and AWS Identity and Access Management (IAM). Amazon Titan Multimodal Embeddings model access in Amazon Bedrock.

AWS

AWS Database K-nearest Neighbors AI

HCLTech’s AWS powered AutoWise Companion: A seamless experience for informed automotive buyer decisions with data-driven design

AWS Machine Learning Blog

JANUARY 15, 2025

Powered by generative AI services on AWS and large language models (LLMs) multi-modal capabilities, HCLTechs AutoWise Companion provides a seamless and impactful experience. Technical architecture The overall solution is implemented using AWS services and LangChain. AWS Glue AWS Glue is used for data cataloging.

AWS

AWS SQL AI AI

Build scalable containerized RAG based generative AI applications in AWS using Amazon EKS with Amazon Bedrock

Flipboard

MAY 13, 2025

In this post, we demonstrate a solution using Amazon Elastic Kubernetes Service (EKS) with Amazon Bedrock to build scalable and containerized RAG solutions for your generative AI applications on AWS while bringing your unstructured user file data to Amazon Bedrock in a straightforward, fast, and secure way. Sonnet on Amazon Bedrock.

AWS

AWS AI AI Clustering

How Rocket Companies modernized their data science solution on AWS

AWS Machine Learning Blog

FEBRUARY 21, 2025

Communication between the two systems was established through Kerberized Apache Livy (HTTPS) connections over AWS PrivateLink. Data exploration and model development were conducted using well-known machine learning (ML) tools such as Jupyter or Apache Zeppelin notebooks. HBase is employed to offer real-time key-based access to data.

Data Science

Data Science AWS Hadoop Data Scientist

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

AWS Machine Learning Blog

APRIL 17, 2024

Recent developments in machine learning (ML) have led to increasingly large models, some of which require hundreds of billions of parameters. In such distributed environments, observability of both instances and ML chips becomes key to model performance fine-tuning and cost optimization.

AWS

AWS Clustering ML ML

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

AWS Machine Learning Blog

JULY 25, 2024

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). eks-5e0fdde Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin.

Clustering

Clustering AWS ML ML

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

AWS Machine Learning Blog

MARCH 19, 2025

When the stakes are high, success requires not just cutting-edge technology, but the ability to operationalize it at scalea challenge that AWS has consistently solved for customers. To train generative AI models at enterprise scale, ServiceNow uses NVIDIA DGX Cloud on AWS. The team achieved 97.1%

AWS

AWS AI AI Clustering

Real value, real time: Production AI with Amazon SageMaker and Tecton

AWS Machine Learning Blog

DECEMBER 4, 2024

Businesses are under pressure to show return on investment (ROI) from AI use cases, whether predictive machine learning (ML) or generative AI. Only 54% of ML prototypes make it to production, and only 5% of generative AI use cases make it to production. Using SageMaker, you can build, train and deploy ML models.

ML

ML ML AWS AI

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

AWS Machine Learning Blog

MARCH 3, 2025

These recipes include a training stack validated by Amazon Web Services (AWS) , which removes the tedious work of experimenting with different model configurations, minimizing the time it takes for iterative evaluation and testing. The launcher will interface with your cluster with Slurm or Kubernetes native constructs.

Clustering

Clustering AWS ML ML

How climate tech startups are building foundation models with Amazon SageMaker HyperPod

Flipboard

JUNE 4, 2025

Amazon Web Services (AWS) provides the essential compute infrastructure to support these endeavors, offering scalable and powerful resources through Amazon SageMaker HyperPod. To offer a more concrete look at these trends, the following is a deep dive into how climate tech startups are building FMs on AWS.

AWS

AWS Clustering ML ML

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

AWS Machine Learning Blog

MARCH 10, 2025

We walk through the journey Octus took from managing multiple cloud providers and costly GPU instances to implementing a streamlined, cost-effective solution using AWS services including Amazon Bedrock, AWS Fargate , and Amazon OpenSearch Service. Along the way, it also simplified operations as Octus is an AWS shop more generally.

AWS

AWS Database AI AI

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

AWS Machine Learning Blog

NOVEMBER 14, 2024

We recently announced the general availability of cross-account sharing of Amazon SageMaker Model Registry using AWS Resource Access Manager (AWS RAM) , making it easier to securely share and discover machine learning (ML) models across your AWS accounts.

AWS

AWS ML ML Machine Learning

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

AWS Machine Learning Blog

JUNE 11, 2024

Starting with the AWS Neuron 2.18 release , you can now launch Neuron DLAMIs (AWS Deep Learning AMIs) and Neuron DLCs (AWS Deep Learning Containers) with the latest released Neuron packages on the same day as the Neuron SDK release. AWS DLCs provide a set of Docker images that are pre-installed with deep learning frameworks.

AWS

AWS Deep Learning Deep Learning ML

Syngenta develops a generative AI assistant to support sales representatives using Amazon Bedrock Agents

Flipboard

DECEMBER 3, 2024

Syngenta and AWS collaborated to develop Cropwise AI , an innovative solution powered by Amazon Bedrock Agents , to accelerate their sales reps’ ability to place Syngenta seed products with growers across North America. The collaboration between Syngenta and AWS showcases the transformative power of LLMs and AI agents.

AWS

AWS AI AI Machine Learning

Build a Search Engine: Setting Up AWS OpenSearch

Flipboard

MAY 5, 2025

Home Table of Contents Build a Search Engine: Setting Up AWS OpenSearch Introduction What Is AWS OpenSearch? What AWS OpenSearch Is Commonly Used For Key Features of AWS OpenSearch How Does AWS OpenSearch Work? Why Use AWS OpenSearch for Semantic Search? Looking for the source code to this post?

AWS

AWS Clustering Deep Learning Deep Learning

Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container

AWS Machine Learning Blog

JUNE 25, 2024

Amazon Web Services is excited to announce the launch of the AWS Neuron Monitor container , an innovative tool designed to enhance the monitoring capabilities of AWS Inferentia and AWS Trainium chips on Amazon Elastic Kubernetes Service (Amazon EKS).

AWS

AWS ML ML Clustering

Search enterprise data assets using LLMs backed by knowledge graphs

Flipboard

NOVEMBER 27, 2024

In the context of enterprise data asset search powered by a metadata catalog hosted on services such Amazon DataZone, AWS Glue, and other third-party catalogs, knowledge graphs can help integrate this linked data and also enable a scalable search paradigm that integrates metadata that evolves over time.

AWS

AWS Database ML ML

Train, optimize, and deploy models on edge devices using Amazon SageMaker and Qualcomm AI Hub

AWS Machine Learning Blog

OCTOBER 18, 2024

We demonstrate this solution by walking you through a comprehensive step-by-step guide on how to fine-tune YOLOv8 , a real-time object detection model, on Amazon Web Services (AWS) using a custom dataset. You can train foundation models (FMs) for weeks and months without disruption by automatically monitoring and repairing training clusters.

AWS

AWS AI AI Machine Learning

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

AWS Machine Learning Blog

NOVEMBER 22, 2024

Although QLoRA helps optimize memory during fine-tuning, we will use Amazon SageMaker Training to spin up a resilient training cluster, manage orchestration, and monitor the cluster for failures. To take complete advantage of this multi-GPU cluster, we use the recent support of QLoRA and PyTorch FSDP. 24xlarge compute instance.

Clustering

Clustering AWS ML ML

Announcing New Tools for Building with Generative AI on AWS

Flipboard

APRIL 13, 2023

The seeds of a machine learning (ML) paradigm shift have existed for decades, but with the ready availability of scalable compute capacity, a massive proliferation of data, and the rapid advancement of ML technologies, customers across industries are transforming their businesses.

AWS

AWS AI AI ML

Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

AWS Machine Learning Blog

NOVEMBER 27, 2024

Launching a machine learning (ML) training cluster with Amazon SageMaker training jobs is a seamless process that begins with a straightforward API call, AWS Command Line Interface (AWS CLI) command, or AWS SDK interaction. For this post, we demonstrate SMP implementation on SageMaker trainings jobs.

AWS

AWS Clustering ML ML

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

AWS Machine Learning Blog

DECEMBER 18, 2024

Training an LLM is a compute-intensive and complex process, which is why Fastweb, as a first step in their AI journey, used AWS generative AI and machine learning (ML) services such as Amazon SageMaker HyperPod. The team opted for fine-tuning on AWS.

Clustering

Clustering AWS AI AI

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia , custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

Clustering

Clustering AWS ML ML

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

AWS Machine Learning Blog

APRIL 2, 2025

At its core, Ray offers a unified programming model that allows developers to seamlessly scale their applications from a single machine to a distributed cluster. Ray promotes the same coding patterns for both a simple machine learning (ML) experiment and a scalable, resilient production application.

Clustering

Clustering AWS AI AI

Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes

AWS Machine Learning Blog

SEPTEMBER 4, 2024

Its scalability and load-balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. Amazon SageMaker provides capabilities to remove the undifferentiated heavy lifting of building and deploying ML models. This entire workflow is shown in the following solution diagram.

AWS

AWS Clustering ML ML

How Lumi streamlines loan approvals with Amazon SageMaker AI

AWS Machine Learning Blog

APRIL 4, 2025

They use real-time data and machine learning (ML) to offer customized loans that fuel sustainable growth and solve the challenges of accessing capital. This approach combines the efficiency of machine learning with human judgment in the following way: The ML model processes and classifies transactions rapidly.

AI

AI AI Machine Learning Machine Learning

Unleash AI innovation with Amazon SageMaker HyperPod

AWS Machine Learning Blog

MARCH 18, 2025

The rise of generative AI has significantly increased the complexity of building, training, and deploying machine learning (ML) models. It now demands deep expertise, access to vast datasets, and the management of extensive compute clusters.

AI

AI AI AWS Clustering

How Hexagon built an AI assistant using AWS generative AI services

AWS Machine Learning Blog

MAY 13, 2025

Understanding these advantages, we partnered with AWS to embark on a journey to develop HxGN Alix, an AI-powered digital worker using AWS generative AI services. The AWS services include: Amazon Elastic Kubernetes Service (Amazon EKS) We used Amazon EKS for compute and model deployment.

AWS

AWS AI AI Machine Learning

Integrate HyperPod clusters with Active Directory for seamless multi-user login

AWS Machine Learning Blog

APRIL 22, 2024

Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. In this solution, HyperPod cluster instances use the LDAPS protocol to connect to the AWS Managed Microsoft AD via an NLB.

Clustering

Clustering AWS Machine Learning Machine Learning

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Flipboard

JUNE 20, 2023

Machine learning (ML) engineers have traditionally focused on striking a balance between model training and deployment cost vs. performance. This is important because training ML models and then using the trained models to make predictions (inference) can be highly energy-intensive tasks.

AWS

AWS Machine Learning Machine Learning ML

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

AWS Machine Learning Blog

MAY 31, 2023

Running machine learning (ML) workloads with containers is becoming a common practice. What you get is an ML development environment that is consistent and portable. With containers, scaling on a cluster becomes much easier. With containers, scaling on a cluster becomes much easier. Run the ML task on Amazon ECS.

AWS

AWS Machine Learning Machine Learning ML

Your guide to generative AI and ML at AWS re:Invent 2024

Racing into the future: How AWS DeepRacer fueled my AI and ML journey

Webinars

Trending Sources

Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans

Webinars

Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Scale ML workflows with Amazon SageMaker Studio and Amazon SageMaker HyperPod

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

Streamline AWS resource troubleshooting with Amazon Bedrock Agents and AWS Support Automation Workflows

Reduce ML training costs with Amazon SageMaker HyperPod

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

Optimizing costs of generative AI applications on AWS

Build a reverse image search engine with Amazon Titan Multimodal Embeddings in Amazon Bedrock and AWS managed services

HCLTech’s AWS powered AutoWise Companion: A seamless experience for informed automotive buyer decisions with data-driven design

Build scalable containerized RAG based generative AI applications in AWS using Amazon EKS with Amazon Bedrock

How Rocket Companies modernized their data science solution on AWS

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

Real value, real time: Production AI with Amazon SageMaker and Tecton

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

How climate tech startups are building foundation models with Amazon SageMaker HyperPod

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

Syngenta develops a generative AI assistant to support sales representatives using Amazon Bedrock Agents

Build a Search Engine: Setting Up AWS OpenSearch

Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container

Search enterprise data assets using LLMs backed by knowledge graphs

Train, optimize, and deploy models on edge devices using Amazon SageMaker and Qualcomm AI Hub

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

Announcing New Tools for Building with Generative AI on AWS

Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes

How Lumi streamlines loan approvals with Amazon SageMaker AI

Unleash AI innovation with Amazon SageMaker HyperPod

How Hexagon built an AI assistant using AWS generative AI services

Integrate HyperPod clusters with Active Directory for seamless multi-user login

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

Stay Connected