Blog and Data Preparation - Data Science Current

Top 7 Data Science, Large Language Model, and AI Blogs of 2024

Data Science Dojo

NOVEMBER 27, 2024

In this blog, we will explore the top 7 LLM, data science, and AI blogs of 2024 that have been instrumental in disseminating detailed and updated information in these dynamic fields. These blogs stand out as they make deep, complex topics easy to understand for a broader audience.

Data Science

Data Science Natural Language Processing AI AI

AI Ethics in Data Preparation: A Responsibility We Can’t Ignore!

Data Science Blog

DECEMBER 28, 2024

Data is the lifeblood of modern decision-making, and AI systems rely heavily on it. However, the quality and ethical implications of this data are paramount. The Importance of Ethical Data Preparation Ethical data preparation is fundamental to the success of AI systems. One of the most significant is bias.

Data Preparation

Data Preparation AI AI Data Science

Introducing Recursive Common Table Expressions to Databricks

databricks

JULY 21, 2025

Events Data + AI Summit Data + AI World Tour Data Intelligence Days Event Calendar Blog and Podcasts Databricks Blog Explore news, product announcements, and more Databricks Mosaic Research Blog Discover the latest in our Gen AI research Data Brew Podcast Let’s talk data!

SQL

SQL Data Warehouse Data Science Artificial Intelligence

Webinars

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

How Dataiku and Snowflake Strengthen the Modern Data Stack

phData

NOVEMBER 4, 2024

Snowflake excels in efficient data storage and governance, while Dataiku provides the tooling to operationalize advanced analytics and machine learning models. Together they create a powerful, flexible, and scalable foundation for modern data applications. One of the standout features of Dataiku is its focus on collaboration.

Machine Learning

Machine Learning Machine Learning Data Science Data Preparation

Automate Data Quality Reports with n8n: From CSV to Professional Analysis

KDnuggets

JUNE 26, 2025

The workflow adapts automatically to any CSV structure, allowing you to quickly assess multiple datasets and prioritize your data preparation efforts. Next Steps 1. Email Integration Add a Send Email node to automatically deliver reports to stakeholders by connecting it after the HTML node.

Data Quality

Data Quality Data Science Natural Language Processing Machine Learning

The Lifecycle of Feature Engineering: From Raw Data to Model-Ready Inputs

Flipboard

JULY 16, 2025

Data preparation tools : Libraries such as Pandas, Scikit-learn pipelines, and Spark MLlib simplify data cleaning and transformation tasks. AutoML frameworks : Tools like Google AutoML and H2O.ai include automated feature engineering as part of their machine learning pipelines.

Machine Learning

Machine Learning Machine Learning Natural Language Processing Data Science

Implementing Approximate Nearest Neighbor Search with KD-Trees

PyImageSearch

DECEMBER 23, 2024

KD-Trees are a type of binary search tree that partitions data points into k-dimensional space, allowing for efficient querying of nearest neighbors. We will start by setting up libraries and data preparation. One of the most effective methods to perform ANN search is to use KD-Trees (K-Dimensional Trees).

K-nearest Neighbors

K-nearest Neighbors Algorithm Deep Learning Deep Learning

Advancing Data Fabric with Micro-segment Creation in IBM Knowledge Catalog

IBM Data Science in Practice

JANUARY 2, 2025

By creating microsegments, businesses can be alerted to surprises, such as sudden deviations or emerging trends, empowering them to respond proactively and make data-driven decisions. Choose Segment ColumnData Explanation: Segmenting column data prepares the system to generate SQL queries for distinctvalues.

SQL

SQL Data Quality Data Profiling Data Preparation

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

NOVEMBER 19, 2024

This session covers the technical process, from data preparation to model customization techniques, training strategies, deployment considerations, and post-customization evaluation. Explore how this powerful tool streamlines the entire ML lifecycle, from data preparation to model deployment.

AWS

AWS ML ML AI

Build Observable Data Flywheels for Production with Iguazio’s MLRun and NVIDIA NeMo Microservices

Iguazio

JUNE 11, 2025

Read the blog for more details, or go straight to the blueprint to try it out for yourself. It automates data preparation, model tuning, customization, validation and optimization of LLMs, ML models and live AI applications over elastic resources. What is MLRun?

ML

ML ML AI AI

Best practices and lessons for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock

AWS Machine Learning Blog

NOVEMBER 1, 2024

We discuss the important components of fine-tuning, including use case definition, data preparation, model customization, and performance evaluation. This post dives deep into key aspects such as hyperparameter optimization, data cleaning techniques, and the effectiveness of fine-tuning compared to base models.

Data Preparation

Data Preparation Machine Learning Machine Learning ML

Using responsible AI principles with Amazon Bedrock Batch Inference

AWS Machine Learning Blog

NOVEMBER 21, 2024

Have an S3 bucket to store your data prepared for batch inference. Have an AWS Identity and Access Management (IAM) role for batch inference with a trust policy and Amazon S3 access (read access to the folder containing input data and write access to the folder storing output data).

AI

AI AI AWS Data Preparation

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

AWS Machine Learning Blog

NOVEMBER 14, 2024

Data preparation For this example, you will use the South German Credit dataset open source dataset. After you have completed the data preparation step, it’s time to train the classification model. An experiment collects multiple runs with the same objective.

AWS

AWS ML ML Machine Learning

Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 15, 2024

For this walkthrough, we use a straightforward generative AI lifecycle involving data preparation, fine-tuning, and a deployment of Meta’s Llama-3-8B LLM. Data preparation In this phase, prepare the training and test data for the LLM. We use the SageMaker Core SDK to execute all the steps.

Python

Python AWS ML ML

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

AWS Machine Learning Blog

DECEMBER 24, 2024

In this blog post, we showcase how you can perform efficient supervised fine tuning for a Meta Llama 3 model using PEFT on AWS Trainium with SageMaker HyperPod. Fine tuning Now that your SageMaker HyperPod cluster is deployed, you can start preparing to execute your fine tuning job.

AWS

AWS Clustering Deep Learning Deep Learning

Improve governance of models with Amazon SageMaker unified Model Cards and Model Registry

AWS Machine Learning Blog

NOVEMBER 13, 2024

Several activities are performed in this phase, such as creating the model, data preparation, model training, evaluation, and model registration. Model lineage tracking captures and retains information about the stages of an ML workflow, from data preparation and training to model registration and deployment.

ML

ML ML AWS Data Preparation

Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

AWS Machine Learning Blog

NOVEMBER 15, 2024

It offers an unparalleled suite of tools that cater to every stage of the ML lifecycle, from data preparation to model deployment and monitoring. Amazon SageMaker is a comprehensive, fully managed machine learning (ML) platform that revolutionizes the entire ML workflow.

AWS

AWS Computer Science Computer Science Database

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

AWS Machine Learning Blog

OCTOBER 24, 2024

Conventional ML development cycles take weeks to many months and requires sparse data science understanding and ML development skills. Business analysts’ ideas to use ML models often sit in prolonged backlogs because of data engineering and data science team’s bandwidth and data preparation activities.

Data Warehouse

Data Warehouse Machine Learning Machine Learning Cloud Data

Amazon Bedrock Model Distillation: Boost function calling accuracy while reducing cost and latency

AWS Machine Learning Blog

APRIL 30, 2025

Preparing your data Effective data preparation is crucial for successful distillation of agent function calling capabilities. Amazon Bedrock provides two primary methods for preparing your training data: uploading JSONL files to Amazon S3 or using historical invocation logs.

AWS

AWS AI AI Computer Science

End-to-End model training and deployment with Amazon SageMaker Unified Studio

Flipboard

JULY 3, 2025

Organizations need a unified, streamlined approach that simplifies the entire process from data preparation to model deployment. To address these challenges, AWS has expanded Amazon SageMaker with a comprehensive set of data, analytics, and generative AI capabilities.

ML

ML ML AWS Data Engineering

Import data from Google Cloud Platform BigQuery for no-code machine learning with Amazon SageMaker Canvas

AWS Machine Learning Blog

OCTOBER 28, 2024

This minimizes the complexity and overhead associated with moving data between cloud environments, enabling organizations to access and utilize their disparate data assets for ML projects. You can use SageMaker Canvas to build the initial data preparation routine and generate accurate predictions without writing code.

Machine Learning

Machine Learning Machine Learning ML ML

A guide to Amazon Bedrock Model Distillation (preview)

AWS Machine Learning Blog

DECEMBER 4, 2024

Start a distillation job with S3 JSONL data using an API To use an API to start a distillation job using training data stored in an S3 bucket, follow these steps: First, create and configure an Amazon Bedrock client: import boto3 from datetime import datetime bedrock_client = boto3.client(service_name="bedrock")

AWS

AWS AI AI ML

Large Language Models: A Self-Study Roadmap

Flipboard

JULY 7, 2025

Recommended Learning Resources The Illustrated Transformer (Blog & Visual Guide): A must-read visual explanation of transformer models. It covers the entire process, from data preparation to model training and evaluation, enabling viewers to adapt LLMs for specific tasks or domains.

Natural Language Processing

Natural Language Processing Machine Learning Machine Learning Data Science

Ace Your Interview: Top 10 Data Visualization Questions and Answers (Beginner & Advanced)

Pickl AI

APRIL 21, 2025

This blog post breaks down top data visualization interview questions into two categories: Beginner and Advanced. Whether you’re just starting or looking to step into a more senior role, these examples and expert answers will help you prepare and impress. The approach depends on the context and the amount of missing data.

Data Visualization

Data Visualization Power BI Data Analysis Data Analysis

Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

AWS Machine Learning Blog

MAY 1, 2025

Best practices for data preparation The quality and structure of your training data fundamentally determine the success of fine-tuning. Our experiments revealed several critical insights for preparing effective multimodal datasets: Data structure You should use a single image per example rather than multiple images.

AWS

AWS ML ML AI

Deploying Gen AI in Production with NVIDIA NIM & MLRun

Iguazio

JUNE 9, 2025

In this blog post, we spotlight a leading player in the gen AI infrastructure ecosystem, NVIDIA , commonly known for their GPUs, software and research that have helped drive gen AI implementation and research. We introduce their new solution model deployment - NVIDIA NIM.

AI

AI AI Data Preparation ML

Evaluate healthcare generative AI applications using LLM-as-a-judge on AWS

AWS Machine Learning Blog

FEBRUARY 27, 2025

In our previous blog posts, we explored various techniques such as fine-tuning large language models (LLMs), prompt engineering, and Retrieval Augmented Generation (RAG) using Amazon Bedrock to generate impressions from the findings section in radiology reports using generative AI. Part 1 focused on model fine-tuning.

AWS

AWS AI AI ML

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

AWS Machine Learning Blog

DECEMBER 18, 2024

This strategic decision was driven by several factors: Efficient data preparation Building a high-quality pre-training dataset is a complex task, involving assembling and preprocessing text data from various sources, including web sources and partner companies. The team opted for fine-tuning on AWS.

Clustering

Clustering AWS AI AI

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

AWS Machine Learning Blog

NOVEMBER 15, 2024

SageMaker Studio is an IDE that offers a web-based visual interface for performing the ML development steps, from data preparation to model building, training, and deployment. In this section, we cover how to discover these models in SageMaker Studio.

ML

ML ML Python AWS

Responsible AI in action: How Data Reply red teaming supports generative AI safety on AWS

AWS Machine Learning Blog

APRIL 29, 2025

To address potential fairness concerns, it can be helpful to evaluate disparities and imbalances in training data or outcomes. Amazon SageMaker Clarify helps identify potential biases during data preparation without requiring code.

AWS

AWS AI AI ML

MLRun v1.8 Now Available: Smarter Model Monitoring, Alerts and Tracking

Iguazio

JUNE 5, 2025

MLRun automates key processes such as data preparation, model tuning, customization, validation and optimization for ML models, LLMs and live AI applications across scalable, elastic infrastructure.

ML

ML ML Data Preparation Data Scientist

LAI #71: Open-Sora: $200K Video Model, HPC’s Unsung Hero, and 10 Ways LLMs Fail in the Wild

Towards AI

APRIL 17, 2025

In this piece, we explore practical ways to define data standards, ethically scrape and clean your datasets, and cut out the noise whether youre pretraining from scratch or fine-tuning a base model. If youre working on LLMs, this is one of those foundations thats easy to overlook but hard to ignore. 👉 Read the post here!

AI

AI AI Data Preparation Deep Learning

A generative AI prototype with Amazon Bedrock transforms life sciences and the genome analysis process

Flipboard

MAY 28, 2025

This approach was use case-specific and required data preparation and manual work. We would like to acknowledge Thomaz Silva and Saeed Elnaj for their contributions to this blog. Before LLMs for text-to-SQL, user queries had to be preprocessed to match specific templates, which were then used to rephrase the queries.

SQL

SQL AWS AI AI

IBM watsonx Platform: Compliance obligations to controls mapping

IBM Journey to AI blog

OCTOBER 30, 2024

This approach enables centralized access and sharing while minimizing extract, transform and load (ETL) processes and data duplication. Integrated vectorized embedding capabilities streamline data preparation for various applications such as retrieval augmented generation (RAG) and other machine learning and generative AI use cases.

Machine Learning

Machine Learning Machine Learning ETL AI

How to Use Maps in Sigma Computing

phData

JULY 10, 2025

Sigma offers powerful mapping capabilities that allow users to visualize geographic data effectively. Whether you’re analyzing regional trends, plotting locations, or visualizing complex geographical data, Sigma Maps can help you gain valuable insights. In this blog, we will cover how to use maps in Sigma Computing.

Data Visualization

Data Visualization Database Data Preparation Analytics

Modernize and migrate on-premises fraud detection machine learning workflows to Amazon SageMaker

AWS Machine Learning Blog

JUNE 5, 2025

The architecture incorporates best practices in MLOps, making sure that the different stages of the ML lifecyclefrom data preparation to production deploymentare optimized for performance and reliability. This new design accelerates model development and deployment, so Radial can respond faster to evolving fraud detection challenges.

Machine Learning

Machine Learning Machine Learning AWS ML

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Flipboard

MARCH 21, 2025

Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems. Solution walkthrough (Scenario 1) The first step focuses on preparing the data for each data source for unified access.

SQL

SQL Data Analyst Data Warehouse AWS

How Formula 1® uses generative AI to accelerate race-day issue resolution

AWS Machine Learning Blog

FEBRUARY 18, 2025

The following sections further explain the main components of the solution: ETL pipelines to transform the log data, agentic RAG implementation, and the chat application. Creating ETL pipelines to transform log data Preparing your data to provide quality results is the first step in an AI project.

AWS

AWS Database ETL AI

Improving Zero-Shot Quality with Verb Injection

Towards AI

APRIL 15, 2025

Teaching Language Models Complex New Verbs Fine-tuning large language models (LLMs) has become the default method for tailoring AI systems to specific tasks, yet it often comes with significant drawbacks: high computational costs, brittleness from overfitting, catastrophic forgetting, and substantial data preparation hurdles.

Data Preparation

Data Preparation Azure AI AI

15 Fan-Favorite Speakers & Instructors Returning for ODSC East 2025

ODSC - Open Data Science

MARCH 18, 2025

Allen Downey, PhD, Principal Data Scientist at PyMCLabs Allen is the author of several booksincluding Think Python, Think Bayes, and Probably Overthinking Itand a blog about data science and Bayesian statistics. A prolific educator, Julien shares his knowledge through code demos, blogs, and YouTube, making complex AI accessible.

Data Science

Data Science Machine Learning Machine Learning Data Scientist

Ask HN: Who is hiring? (July 2025)

Hacker News

JULY 1, 2025

It seems like that's not the main focus of your org, but I was pleased to see a reference to RCV in your blog: [0] [0]: https://goodparty.org/blog/article/final-five-voting-explain. eu/knowledge-hub/blog and https://www.ml6.eu/customers/cases

Python

Python AWS ML ML

Looking Ahead: The Future of Data Preparation for Generative AI

Data Science Blog

AUGUST 22, 2024

Businesses need to understand the trends in data preparation to adapt and succeed. If you input poor-quality data into an AI system, the results will be poor. This principle highlights the need for careful data preparation, ensuring that the input data is accurate, consistent, and relevant.

Data Preparation

Data Preparation Data Quality AI AI

Accelerate data preparation for ML in Amazon SageMaker Canvas

AWS Machine Learning Blog

NOVEMBER 29, 2023

Data preparation is a crucial step in any machine learning (ML) workflow, yet it often involves tedious and time-consuming tasks. Amazon SageMaker Canvas now supports comprehensive data preparation capabilities powered by Amazon SageMaker Data Wrangler. Within the data flow, add an Amazon S3 destination node.

Data Preparation

Data Preparation ML ML Data Quality

Migrate Amazon SageMaker Data Wrangler flows to Amazon SageMaker Canvas for faster data preparation

AWS Machine Learning Blog

AUGUST 20, 2024

Amazon SageMaker Data Wrangler provides a visual interface to streamline and accelerate data preparation for machine learning (ML), which is often the most time-consuming and tedious task in ML projects. Charles holds an MS in Supply Chain Management and a PhD in Data Science. Huong Nguyen is a Sr.

Data Preparation

Data Preparation ML ML AWS

Top 7 Data Science, Large Language Model, and AI Blogs of 2024

AI Ethics in Data Preparation: A Responsibility We Can’t Ignore!

Webinars

Trending Sources

Introducing Recursive Common Table Expressions to Databricks

Webinars

How Dataiku and Snowflake Strengthen the Modern Data Stack

Automate Data Quality Reports with n8n: From CSV to Professional Analysis

The Lifecycle of Feature Engineering: From Raw Data to Model-Ready Inputs

Implementing Approximate Nearest Neighbor Search with KD-Trees

Advancing Data Fabric with Micro-segment Creation in IBM Knowledge Catalog

Your guide to generative AI and ML at AWS re:Invent 2024

Build Observable Data Flywheels for Production with Iguazio’s MLRun and NVIDIA NeMo Microservices

Best practices and lessons for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock

Using responsible AI principles with Amazon Bedrock Batch Inference

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Improve governance of models with Amazon SageMaker unified Model Cards and Model Registry

Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

Amazon Bedrock Model Distillation: Boost function calling accuracy while reducing cost and latency

End-to-End model training and deployment with Amazon SageMaker Unified Studio

Import data from Google Cloud Platform BigQuery for no-code machine learning with Amazon SageMaker Canvas

A guide to Amazon Bedrock Model Distillation (preview)

Large Language Models: A Self-Study Roadmap

Ace Your Interview: Top 10 Data Visualization Questions and Answers (Beginner & Advanced)

Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

Deploying Gen AI in Production with NVIDIA NIM & MLRun

Evaluate healthcare generative AI applications using LLM-as-a-judge on AWS

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

Responsible AI in action: How Data Reply red teaming supports generative AI safety on AWS

MLRun v1.8 Now Available: Smarter Model Monitoring, Alerts and Tracking

LAI #71: Open-Sora: $200K Video Model, HPC’s Unsung Hero, and 10 Ways LLMs Fail in the Wild

A generative AI prototype with Amazon Bedrock transforms life sciences and the genome analysis process

IBM watsonx Platform: Compliance obligations to controls mapping

How to Use Maps in Sigma Computing

Modernize and migrate on-premises fraud detection machine learning workflows to Amazon SageMaker

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

How Formula 1® uses generative AI to accelerate race-day issue resolution

Improving Zero-Shot Quality with Verb Injection

15 Fan-Favorite Speakers & Instructors Returning for ODSC East 2025

Ask HN: Who is hiring? (July 2025)

Looking Ahead: The Future of Data Preparation for Generative AI

Accelerate data preparation for ML in Amazon SageMaker Canvas

Migrate Amazon SageMaker Data Wrangler flows to Amazon SageMaker Canvas for faster data preparation

Stay Connected