Remove Clustering Remove Data Scientist Remove System Architecture
article thumbnail

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

AWS Machine Learning Blog

The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia , custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

article thumbnail

Real value, real time: Production AI with Amazon SageMaker and Tecton

AWS Machine Learning Blog

Orchestrate with Tecton-managed EMR clusters – After features are deployed, Tecton automatically creates the scheduling, provisioning, and orchestration needed for pipelines that can run on Amazon EMR compute engines. You can view and create EMR clusters directly through the SageMaker notebook.

ML 101
professionals

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

AWS Machine Learning Blog

At its core, Ray offers a unified programming model that allows developers to seamlessly scale their applications from a single machine to a distributed cluster. A Ray cluster consists of a single head node and a number of connected worker nodes. Ray clusters and Kubernetes clusters pair well together.

article thumbnail

Multi-account support for Amazon SageMaker HyperPod task governance

AWS Machine Learning Blog

Organizations building or adopting generative AI use GPUs to run simulations, run inference (both for internal or external usage), build agentic workloads, and run data scientists’ experiments. The workloads range from ephemeral single-GPU experiments run by scientists to long multi-node continuous pre-training runs.

article thumbnail

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

AWS Machine Learning Blog

Solution overview The following figure illustrates our system architecture for CreditAI on AWS, with two key paths: the document ingestion and content extraction workflow, and the Q&A workflow for live user query response. He specializes in generative AI, machine learning, and system design.

AWS 114
article thumbnail

Ask HN: Who wants to be hired? (July 2025)

Hacker News

I originally wanted to program numerical libraries for such systems, but I ended up doing AI/ML instead. I have about 3 YoE training PyTorch models on HPC clusters and 1 YoE optimizing PyTorch models, including with custom CUDA kernels. I'm a data scientist, I'm looking for hard problems to solve.

Python 55
article thumbnail

Ask HN: Who is hiring? (July 2025)

Hacker News

But if your issue is suffered by many but you don't all cluster together in latitude and longitude then that issue has less weight. The current team is very high functioning (MD + data scientist combos, former ASF board member, Google and Amazon engineers, Stanford LLM researchers, etc.) Where you live means something.

Python 78