Clustering, Data Engineering and ML

From Chaos to Control: A Cost Maturity Journey with Databricks

databricks

JULY 24, 2025

inherits tags on the cluster definition, while serverless adheres to Serverless Budget Policies ( AWS | Azure | GCP ). Case 2: Only one task runs on serverless In this case, BP tags would also propagate to system tables for the serverless compute usage, while the classic compute billing record inherits tags from the cluster definition.

Clustering

Clustering SQL Azure AWS

8 Ways to Scale your Data Science Workloads

KDnuggets

JULY 22, 2025

This article covers eight practical methods in BigQuery designed to do exactly that, from using AI-powered agents to serving ML models straight from a spreadsheet. Machine Learning in your Spreadsheets BQML training and prediction from a Google Sheet Many data conversations start and end in a spreadsheet.

Data Science

Data Science Natural Language Processing Machine Learning Machine Learning

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 16, 2024

Amazon SageMaker supports geospatial machine learning (ML) capabilities, allowing data scientists and ML engineers to build, train, and deploy ML models using geospatial data. Identify areas of interest We begin by illustrating how SageMaker can be applied to analyze geospatial data at a global scale.

ML

ML ML Clustering Machine Learning

Webinars

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

AWS Machine Learning Blog

OCTOBER 24, 2024

Machine learning (ML) helps organizations to increase revenue, drive business growth, and reduce costs by optimizing core business functions such as supply and demand forecasting, customer churn prediction, credit risk scoring, pricing, predicting late shipments, and many others. A provisioned or serverless Amazon Redshift data warehouse.

Data Warehouse

Data Warehouse Machine Learning Machine Learning Cloud Data

Introducing Databricks One

databricks

JUNE 12, 2025

Why We Built Databricks One At Databricks, our mission is to democratize data and AI. For years, we’ve focused on helping technical teams—data engineers, scientists, and analysts—build pipelines, develop advanced models, and deliver insights at scale.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Real value, real time: Production AI with Amazon SageMaker and Tecton

AWS Machine Learning Blog

DECEMBER 4, 2024

Businesses are under pressure to show return on investment (ROI) from AI use cases, whether predictive machine learning (ML) or generative AI. Only 54% of ML prototypes make it to production, and only 5% of generative AI use cases make it to production. Using SageMaker, you can build, train and deploy ML models.

ML

ML ML AWS AI

End-to-End model training and deployment with Amazon SageMaker Unified Studio

Flipboard

JULY 3, 2025

SageMaker Unified Studio streamlines access to familiar tools and functionality from purpose-built AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR , AWS Glue , Amazon Athena , Amazon Redshift , Amazon Bedrock , and Amazon SageMaker AI.

ML

ML ML AWS Data Engineer

What’s New in Lakeflow Declarative Pipelines: July 2025

databricks

JULY 22, 2025

The new IDE for Data Engineering in Lakeflow Declarative Pipelines We also announced the General Availability of Lakeflow , Databricks’ unified solution for data ingestion, transformation, and orchestration on the Data Intelligence Platform. Preview coming soon. And in the weeks since DAIS, we’ve kept the momentum going.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Machine Learning at Scale: Why PySpark MLlib Still Wins in 2025

Towards AI

JULY 11, 2025

Built directly on Spark’s engine, MLlib pipelines leverage: DataFrame APIs (not RDDs anymore), Declarative transformations across nodes, Automatic memory and partition management, and Built-in model tuning and evaluation tools. All of it runs natively on distributed clusters. No wrappers. But that’s the price you pay for scale.

Machine Learning

Machine Learning Machine Learning Clustering Data Lakes

How PayU built a secure enterprise AI assistant using Amazon Bedrock

Flipboard

JULY 15, 2025

For deployment, we containerized Open WebUI and orchestrated it on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, using automatic scaling to dynamically adjust resources based on demand while maintaining high availability.

AWS

AWS AI AI SQL

How Rocket Companies modernized their data science solution on AWS

AWS Machine Learning Blog

FEBRUARY 21, 2025

Data exploration and model development were conducted using well-known machine learning (ML) tools such as Jupyter or Apache Zeppelin notebooks. Apache Hive was used to provide a tabular interface to data stored in HDFS, and to integrate with Apache Spark SQL. This created a challenge for data scientists to become productive.

Data Science

Data Science AWS Hadoop Data Scientist

Large Language Models: A Self-Study Roadmap

Flipboard

JULY 7, 2025

The key here is to focus on concepts like supervised vs. unsupervised learning, regression, classification, clustering, and model evaluation.

Natural Language Processing

Natural Language Processing Machine Learning Machine Learning Data Science

Stress Testing Supply Chain Networks at Scale on Databricks

databricks

JULY 15, 2025

On a lightweight four-node cluster, the TTR and TTS analyses completed in 5 and 40 minutes respectively on the network described above (1,700 nodes)—all for under $10 in cloud spend. This highlights the solution’s impressive speed and cost-effectiveness.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Deployment of Machine Learning Models and its challenges

How to Learn Machine Learning

JUNE 9, 2025

If you are an aspiring data scientist, or working professional looking to better understand this critical step in the ML Lifecycle, a Machine Learning Course could provide you the foundation and practical experience to avoid these problems. Why Model Deployment Matters in Machine Learning? Here’s how: 1. Here’s how: 1.

Machine Learning

Machine Learning Machine Learning ML ML

Ask HN: Who wants to be hired? (July 2025)

Hacker News

JULY 1, 2025

Prior to that, I spent a couple years at First Orion - a smaller data company - helping found & build out a data engineering team as one of the first engineers. We were focused on building data pipelines and models to protect our users from malicious phonecalls.

Python

Python AWS SQL ML

Stream ingest data from Kafka to Amazon Bedrock Knowledge Bases using custom connectors

AWS Machine Learning Blog

APRIL 18, 2025

The next step is to use a SageMaker Studio terminal instance to connect to the MSK cluster and create the test stream topic. The next step is to use a SageMaker Studio terminal instance to connect to the MSK cluster and create the test stream topic. Prepare the test data. Define a Python function to put data to the topic.

Apache Kafka

Apache Kafka AWS Clustering Database

Enhance deployment guardrails with inference component rolling updates for Amazon SageMaker AI inference

AWS Machine Learning Blog

MARCH 25, 2025

As organizations increasingly deploy foundation models (FMs) and other machine learning (ML) models to production, they face challenges related to resource utilization, cost-efficiency, and maintaining high availability during updates. Li held data science roles in the financial and retail industries.

AI

AI AI AWS ML

Ask HN: Who is hiring? (July 2025)

Hacker News

JULY 1, 2025

Good at Go, Kubernetes (Understanding how to manage stateful services in a multi-cloud environment) We have a Python service in our Recommendation pipeline, so some ML/Data Science knowledge would be good. You must be independent and self-organized. I wonder if we can move away from representation purely on where you live.

Python

Python AWS ML ML

Ask HN: What Are You Working On? (June 2025)

Hacker News

JUNE 29, 2025

reply ml- 1 day ago | prev | next [–] Still on my sabbatical and continuing to build on things I enjoy rather than things that pay (for now). reply ml- 18 hours ago | root | parent | next [–] Appreciate it! ML runs locally, no clouds, uploads etc. It was incredibly useful to me, thanks! Really cool idea.

AI

AI AI Database Python

Boost your MLOps efficiency with these 6 must-have tools and platforms

Data Science Dojo

FEBRUARY 20, 2023

Machine learning (ML) is the technology that automates tasks and provides insights. It allows data scientists to build models that can automate specific tasks. It comes in many forms, with a range of tools and platforms designed to make working with ML more efficient. It provides a large cluster of clusters on a single machine.

Machine Learning

Machine Learning Machine Learning AWS Azure

Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search

Flipboard

NOVEMBER 17, 2023

Amazon SageMaker enables enterprises to build, train, and deploy machine learning (ML) models. Amazon SageMaker JumpStart provides pre-trained models and data to help you get started with ML. Set up a MongoDB cluster To create a free tier MongoDB Atlas cluster, follow the instructions in Create a Cluster.

K-nearest Neighbors

K-nearest Neighbors AWS Clustering Database

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

AWS Machine Learning Blog

AUGUST 15, 2024

With over 50 connectors, an intuitive Chat for data prep interface, and petabyte support, SageMaker Canvas provides a scalable, low-code/no-code (LCNC) ML solution for handling real-world, enterprise use cases. Organizations often struggle to extract meaningful insights and value from their ever-growing volume of data.

ML

ML ML Data Preparation AWS

Unlock ML insights using the Amazon SageMaker Feature Store Feature Processor

AWS Machine Learning Blog

SEPTEMBER 19, 2023

Amazon SageMaker Feature Store provides an end-to-end solution to automate feature engineering for machine learning (ML). For many ML use cases, raw data like log files, sensor readings, or transaction records need to be transformed into meaningful features that are optimized for model training. SageMaker Studio set up.

ML

ML ML AWS SQL

ML Collaboration: Best Practices From 4 ML Teams

The MLOps Blog

DECEMBER 28, 2022

The onset of the pandemic has triggered a rapid increase in the demand and adoption of ML technology. Building ML team Following the surge in ML use cases that have the potential to transform business, the leaders are making a significant investment in ML collaboration, building teams that can deliver the promise of machine learning.

ML

ML ML Data Scientist Machine Learning

Use LangChain with PySpark to process documents at massive scale with Amazon SageMaker Studio and Amazon EMR Serverless

AWS Machine Learning Blog

SEPTEMBER 3, 2024

With the introduction of EMR Serverless support for Apache Livy endpoints , SageMaker Studio users can now seamlessly integrate their Jupyter notebooks running sparkmagic kernels with the powerful data processing capabilities of EMR Serverless. This same interface is also used for provisioning EMR clusters.

AWS

AWS Clustering Big Data Big Data

Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

AWS Machine Learning Blog

APRIL 5, 2024

SageMaker geospatial capabilities make it straightforward for data scientists and machine learning (ML) engineers to build, train, and deploy models using geospatial data. Now, with the specialized geospatial container in SageMaker, managing and running clusters for geospatial processing has become more straightforward.

Clustering

Clustering ML ML AWS

The 2021 Executive Guide To Data Science and AI

Applied Data Science

AUGUST 2, 2021

With a range of role types available, how do you find the perfect balance of Data Scientists , Data Engineers and Data Analysts to include in your team? The most common data science languages are Python and R — SQL is also a must have skill for acquiring and manipulating data.

Data Science

Data Science Data Scientist Data Analyst Machine Learning

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

AWS Machine Learning Blog

DECEMBER 13, 2024

Amazon SageMaker HyperPod is designed to support large-scale machine learning (ML) operations, providing a robust environment for training foundation models (FMs) over extended periods. This blog post specifically applies to HyperPod clusters using Slurm as the orchestrator.

Clustering

Clustering AWS ML ML

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

AWS Machine Learning Blog

DECEMBER 18, 2023

By using these capabilities, businesses can efficiently store, manage, and analyze time-series data, enabling data-driven decisions and gaining a competitive edge. If you need an automated workflow or direct ML model integration into apps, Canvas forecasting functions are accessible through APIs.

Clustering

Clustering AWS Database ML

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

AWS Machine Learning Blog

APRIL 19, 2023

Since 2018, our team has been developing a variety of ML models to enable betting products for NFL and NCAA football. Then we needed to Dockerize the application, write a deployment YAML file, deploy the gRPC server to our Kubernetes cluster, and make sure it’s reliable and auto scalable. We recently developed four more new models.

ML

ML ML Deep Learning Deep Learning

Host the Spark UI on Amazon SageMaker Studio

AWS Machine Learning Blog

AUGUST 8, 2023

You can run Spark applications interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive sessions, you can choose Apache Spark or Ray to easily process large datasets, without worrying about cluster management.

AWS

AWS Clustering Machine Learning Machine Learning

Accelerate time to insight with Amazon SageMaker Data Wrangler and the power of Apache Hive

AWS Machine Learning Blog

MARCH 10, 2023

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. Starting today, you can connect to Amazon EMR Hive as a big data query engine to bring in large datasets for ML.

Clustering

Clustering AWS ML ML

First ODSC Europe 2023 Sessions Announced

ODSC - Open Data Science

MARCH 27, 2023

Botnets Detection at Scale — Lesson Learned from Clustering Billions of Web Attacks into Botnets. ML Governance: A Lean Approach Ryan Dawson | Principal Data Engineer | Thoughtworks Meissane Chami | Senior ML Engineer | Thoughtworks During this session, you’ll discuss the day-to-day realities of ML Governance.

Machine Learning

Machine Learning Machine Learning ML ML

Governing the ML lifecycle at scale, Part 4: Scaling MLOps with security and governance controls

AWS Machine Learning Blog

FEBRUARY 7, 2025

This post, part of the Governing the ML lifecycle at scale series ( Part 1 , Part 2 , Part 3 ), explains how to set up and govern a multi-account ML platform that addresses these challenges. An enterprise might have the following roles involved in the ML lifecycles. This ML platform provides several key benefits.

ML

ML ML Data Scientist AWS

Connect Amazon EMR and RStudio on Amazon SageMaker

AWS Machine Learning Blog

APRIL 17, 2023

You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. Data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for large-scale data processing.

Clustering

Clustering AWS Machine Learning Machine Learning

Use Amazon DocumentDB to build no-code machine learning solutions in Amazon SageMaker Canvas

AWS Machine Learning Blog

DECEMBER 15, 2023

We are excited to announce the launch of Amazon DocumentDB (with MongoDB compatibility) integration with Amazon SageMaker Canvas , allowing Amazon DocumentDB customers to build and use generative AI and machine learning (ML) solutions without writing code. On the Import data page, for Data Source , choose DocumentDB and Add Connection.

Machine Learning

Machine Learning Machine Learning AWS ML

Botnet Detection at Scale?—?Lessons Learned From Clustering Billions of Web Attacks Into Botnets

ODSC - Open Data Science

APRIL 24, 2023

Botnet Detection at Scale — Lessons Learned From Clustering Billions of Web Attacks Into Botnets Editor’s note: Ori Nakar is a speaker for ODSC Europe this June. Be sure to check out his talk, “ Botnet detection at scale — Lesson learned from clustering billions of web attacks into botnets ,” there!

Clustering

Clustering SQL Algorithm Data Science

The Backbone of Data Engineering: 5 Key Architectural Patterns Explained

Mlearning.ai

MAY 16, 2023

Data engineering is a rapidly growing field that designs and develops systems that process and manage large amounts of data. There are various architectural design patterns in data engineering that are used to solve different data-related problems.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Transitioning off Amazon Lookout for Metrics

AWS Machine Learning Blog

OCTOBER 9, 2024

Amazon Lookout for Metrics is a fully managed service that uses machine learning (ML) to detect anomalies in virtually any time-series business or operational metrics—such as revenue performance, purchase transactions, and customer acquisition and retention rates—with no ML experience required. To learn more, see the documentation.

AWS

AWS ML ML Data Quality

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. Hosted on Amazon ECS with tasks run on Fargate, this platform streamlines the end-to-end ML workflow, from data ingestion to model deployment.

AWS

AWS Machine Learning Machine Learning ML

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

Alignment to other tools in the organization’s tech stack Consider how well the MLOps tool integrates with your existing tools and workflows, such as data sources, data engineering platforms, code repositories, CI/CD pipelines, monitoring systems, etc. and Pandas or Apache Spark DataFrames.

Machine Learning

Machine Learning Machine Learning ML ML

Top NLP Skills, Frameworks, Platforms, and Languages for 2023

ODSC - Open Data Science

FEBRUARY 17, 2023

Cloud Computing, APIs, and Data Engineering NLP experts don’t go straight into conducting sentiment analysis on their personal laptops. TensorFlow is desired for its flexibility for ML and neural networks, PyTorch for its ease of use and innate design for NLP, and scikit-learn for classification and clustering.

Deep Learning

Deep Learning Deep Learning Data Science Natural Language Processing

How Vericast optimized feature engineering using Amazon SageMaker Processing

AWS Machine Learning Blog

MAY 3, 2023

For any machine learning (ML) problem, the data scientist begins by working with data. This includes gathering, exploring, and understanding the business and technical aspects of the data, along with evaluation of any manipulations that may be needed for the model building process.

AWS

AWS Machine Learning Machine Learning ML

How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale

AWS Machine Learning Blog

NOVEMBER 1, 2023

Organizations can search for PII using methods such as keyword searches, pattern matching, data loss prevention tools, machine learning (ML), metadata analysis, data classification software, optical character recognition (OCR), document fingerprinting, and encryption.

AWS

AWS Machine Learning Machine Learning ML

From Chaos to Control: A Cost Maturity Journey with Databricks

8 Ways to Scale your Data Science Workloads

Webinars

Trending Sources

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

Webinars

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

Introducing Databricks One

Real value, real time: Production AI with Amazon SageMaker and Tecton

End-to-End model training and deployment with Amazon SageMaker Unified Studio

What’s New in Lakeflow Declarative Pipelines: July 2025

Machine Learning at Scale: Why PySpark MLlib Still Wins in 2025

How PayU built a secure enterprise AI assistant using Amazon Bedrock

How Rocket Companies modernized their data science solution on AWS

Large Language Models: A Self-Study Roadmap

Stress Testing Supply Chain Networks at Scale on Databricks

Deployment of Machine Learning Models and its challenges

Ask HN: Who wants to be hired? (July 2025)

Stream ingest data from Kafka to Amazon Bedrock Knowledge Bases using custom connectors

Enhance deployment guardrails with inference component rolling updates for Amazon SageMaker AI inference

Ask HN: Who is hiring? (July 2025)

Ask HN: What Are You Working On? (June 2025)

Boost your MLOps efficiency with these 6 must-have tools and platforms

Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

Unlock ML insights using the Amazon SageMaker Feature Store Feature Processor

ML Collaboration: Best Practices From 4 ML Teams

Use LangChain with PySpark to process documents at massive scale with Amazon SageMaker Studio and Amazon EMR Serverless

Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

The 2021 Executive Guide To Data Science and AI

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

Host the Spark UI on Amazon SageMaker Studio

Accelerate time to insight with Amazon SageMaker Data Wrangler and the power of Apache Hive

First ODSC Europe 2023 Sessions Announced

Governing the ML lifecycle at scale, Part 4: Scaling MLOps with security and governance controls

Connect Amazon EMR and RStudio on Amazon SageMaker

Use Amazon DocumentDB to build no-code machine learning solutions in Amazon SageMaker Canvas

Botnet Detection at Scale?—?Lessons Learned From Clustering Billions of Web Attacks Into Botnets

The Backbone of Data Engineering: 5 Key Architectural Patterns Explained

Transitioning off Amazon Lookout for Metrics

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

MLOps Landscape in 2023: Top Tools and Platforms

Top NLP Skills, Frameworks, Platforms, and Languages for 2023

How Vericast optimized feature engineering using Amazon SageMaker Processing

How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale

Stay Connected