Data Pipeline, Document and Python - Data Science Current

Go vs. Python for Modern Data Workflows: Need Help Deciding?

KDnuggets

JUNE 19, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter Go vs. Python for Modern Data Workflows: Need Help Deciding?

Python

Python Natural Language Processing Data Science Machine Learning

Graceful External Termination: Handling Pod Deletions in Kubernetes Data Ingestion and Streaming…

IBM Data Science in Practice

APRIL 7, 2025

Graceful External Termination: Handling Pod Deletions in Kubernetes Data Ingestion and Streaming Jobs When running big-data pipelines in Kubernetes, especially streaming jobs, its easy to overlook how these jobs deal with termination. If not handled correctly, this can lead to locks, data issues, and a negative user experience.

Python

Python ETL Data Pipeline Big Data

Evaluate large language models for your machine translation tasks on AWS

AWS Machine Learning Blog

JANUARY 7, 2025

The solution offers two TM retrieval modes for users to choose from: vector and document search. When using the Amazon OpenSearch Service adapter (document search), translation unit groupings are parsed and stored into an index dedicated to the uploaded file. This is covered in detail later in the post.

AWS

AWS Python AI AI

Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and agents with Amazon Bedrock

AWS Machine Learning Blog

DECEMBER 6, 2023

However, they can’t generalize well to enterprise-specific questions because, to generate an answer, they rely on the public data they were exposed to during pre-training. However, the popular RAG design pattern with semantic search can’t answer all types of questions that are possible on documents.

SQL

SQL AWS Analytics Analytics

Cookiecutter Data Science V2

DrivenData Labs

MAY 21, 2024

Better documentation with more examples , clearer explanations of the choices and tools, and a more modern look and feel. Find the latest at [link] (the old documentation will redirect here shortly). This better reflects the common Python practice of having your top level module be the project name. CCDS tests : V1 has no tests.

Data Science

Data Science Python Data Scientist Data Warehouse

How to Build Effective Data Pipelines in Snowpark

phData

AUGUST 6, 2024

As today’s world keeps progressing towards data-driven decisions, organizations must have quality data created from efficient and effective data pipelines. For customers in Snowflake, Snowpark is a powerful tool for building these effective and scalable data pipelines.

Data Pipeline

Data Pipeline Python Data Engineer Data Engineering

Orchestration Frameworks 101: Simplifying LLM-App Interactions with LangChain and Llama Index

Data Science Dojo

SEPTEMBER 14, 2023

Provide connectors for data sources: Orchestration frameworks typically provide connectors for a variety of data sources, such as databases, cloud storage, and APIs. This makes it easy to connect your data pipeline to the data sources that you need. It is known for its ease of use and flexibility.

Data Pipeline

Data Pipeline Python Database AI

Align and monitor your Amazon Bedrock powered insurance assistance chatbot to responsible AI principles with AWS Audit Manager

AWS Machine Learning Blog

JANUARY 7, 2025

Use case In this example of an insurance assistance chatbot, the customers generative AI application is designed with Amazon Bedrock Agents to automate tasks related to the processing of insurance claims and Amazon Bedrock Knowledge Bases to provide relevant documents. getOutstandingPaperwork What are the missing documents from {{claim}}?

AWS

AWS AI AI Database

Generate training data and cost-effectively train categorical models with Amazon Bedrock

AWS Machine Learning Blog

MARCH 27, 2025

Lets say the task at hand is to predict the root cause categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry) for customer support cases. We suggest consulting LLM prompt engineering documentation such as Anthropic prompt engineering for experiments.

AWS

AWS ETL ML ML

Build an ML Inference Data Pipeline using SageMaker and Apache Airflow

Mlearning.ai

APRIL 6, 2023

Automate and streamline our ML inference pipeline with SageMaker and Airflow Building an inference data pipeline on large datasets is a challenge many companies face. For example, a company may enrich documents in bulk to translate documents, identify entities and categorize those documents, etc.

Data Pipeline

Data Pipeline ML ML AWS

How to Quickly Set up a Benchmark for Deep Learning Models With Kedro?

Towards AI

JANUARY 11, 2024

Photo by AltumCode on Unsplash As a data scientist, I used to struggle with experiments involving the training and fine-tuning of large deep-learning models. It facilitates the creation of various data pipelines, including tasks such as data transformation, model training, and the storage of all pipeline outputs.

Deep Learning

Deep Learning Deep Learning Data Pipeline Machine Learning

AWS Machine Learning: A Beginner’s Guide

How to Learn Machine Learning

DECEMBER 24, 2024

You can easily: Store and process data using S3 and RedShift Create data pipelines with AWS Glue Deploy models through API Gateway Monitor performance with CloudWatch Manage access control with IAM This integrated ecosystem makes it easier to build end-to-end machine learning solutions.

Machine Learning

Machine Learning Machine Learning AWS ML

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Heartbeat

NOVEMBER 6, 2023

Image Source — Pixel Production Inc In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You might be curious how a simple tool like Apache Airflow can be powerful for managing complex data pipelines.

Data Pipeline

Data Pipeline Clean Data ETL Python

Best Data Engineering Tools Every Engineer Should Know

Pickl AI

MARCH 19, 2025

Summary: Data engineering tools streamline data collection, storage, and processing. Tools like Python, SQL, Apache Spark, and Snowflake help engineers automate workflows and improve efficiency. Learning these tools is crucial for building scalable data pipelines.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

HCLTech’s AWS powered AutoWise Companion: A seamless experience for informed automotive buyer decisions with data-driven design

AWS Machine Learning Blog

JANUARY 15, 2025

This personalized document helps the customer gain a deeper understanding of the vehicle and supports their decision-making process. In the application layer, the GUI for the solution is created using Streamlit in Python language.

AWS

AWS SQL AI AI

Create a generative AI-based application builder assistant using Amazon Bedrock Agents

AWS Machine Learning Blog

OCTOBER 24, 2024

The following are sample user queries: Write a Python function to validate email address syntax. For building and designing software applications, you will use the existing Knowledge Base on AWS well-architected framework to generate a response of the most relevant design principles and links to any documents.

AWS

AWS SQL Database AI

Navigating the World of Data Engineering: A Beginners Guide.

Towards AI

MARCH 21, 2023

The visualization of the data is important as it gives us hidden insights and potential details about the dataset and its pattern, which we may miss out on without data visualization. PowerBI, Tableau) and programming languages like R and Python in the form of bar graphs, scatter line plots, histograms, and much more.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

11 Open Source Data Exploration Tools You Need to Know in 2023

ODSC - Open Data Science

FEBRUARY 24, 2023

These tools will help make your initial data exploration process easy. ydata-profiling GitHub | Website The primary goal of ydata-profiling is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Output is a fully self-contained HTML application. You can watch it on demand here.

Exploratory Data Analysis

Exploratory Data Analysis Data Visualization Data Analysis Data Analysis

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

FEBRUARY 21, 2025

The agent knowledge base stores Amazon Bedrock service documentation, while the cache knowledge base contains curated and verified question-answer pairs. This setup uses the AWS SDK for Python (Boto3) to interact with AWS services. He specializes in designing, building, and optimizing large-scale data solutions.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

For example, if your team is proficient in Python and R, you may want an MLOps tool that supports open data formats like Parquet, JSON, CSV, etc., User support arrangements Consider the availability and quality of support from the provider or vendor, including documentation, tutorials, forums, customer service, etc.

Machine Learning

Machine Learning Machine Learning ML ML

These AI & Data Engineering Sessions Are a Must-Attend at ODSC East 2025

ODSC - Open Data Science

MARCH 19, 2025

As AI and data engineering continue to evolve at an unprecedented pace, the challenge isnt just building advanced modelsits integrating them efficiently, securely, and at scale. This session explores open-source tools and techniques for transforming unstructured documents into structured formats like JSON and Markdown.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Optimizing Matillion Workflows: A Guide to Visual Design and Best Practices

phData

APRIL 28, 2025

Intuitive Workflow Design Workflows should be easy to follow and visually organized, much like clean, well-structured SQL or Python code. Comments and Notes: Documenting for Future You (or Someone Else) Good documentation makes life easiernot just for you but for anyone who might need to pick up your work later.

AI

AI AI SQL ETL

How Do You Call Snowflake Stored Procedures Using dbt Hooks?

phData

AUGUST 2, 2024

Snowflake AI Data Cloud is one of the most powerful platforms, including storage services supporting complex data. Integrating Snowflake with dbt adds another layer of automation and control to the data pipeline. Snowflake stored procedures and dbt Hooks are essential to modern data engineering and analytics workflows.

Data Pipeline

Data Pipeline Python Database SQL

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

AWS Machine Learning Blog

APRIL 19, 2023

Right now, most deep learning frameworks are built for Python, but this neglects the large number of Java developers and developers who have existing Java code bases they want to integrate the increasingly powerful capabilities of deep learning into. When we did our research online, the Deep Java Library showed up on the top. With v0.21.0

ML

ML ML Deep Learning Deep Learning

Build a Stocks Price Prediction App powered by Snowflake, AWS, Python and Streamlit?—?Part 2 of 3

Mlearning.ai

MARCH 15, 2023

Build a Stocks Price Prediction App powered by Snowflake, AWS, Python and Streamlit — Part 2 of 3 A comprehensive guide to develop machine learning applications from start to finish. Introduction Welcome Back, Let's continue with our Data Science journey to create the Stock Price Prediction web application.

Python

Python AWS Exploratory Data Analysis EDA

Big Data vs. Data Science: Demystifying the Buzzwords

Pickl AI

APRIL 21, 2025

Key Takeaways Big Data focuses on collecting, storing, and managing massive datasets. Data Science extracts insights and builds predictive models from processed data. Big Data technologies include Hadoop, Spark, and NoSQL databases. Data Science uses Python, R, and machine learning frameworks.

Big Data

Big Data Big Data Data Science Machine Learning

What Are Snowflake’s Best Features for Data Transformation?

phData

AUGUST 8, 2024

Putting the T for Transformation in ELT (ETL) is essential to any data pipeline. After extracting and loading your data into the Snowflake AI Data Cloud , you may wonder how best to transform it. Luckily, Snowflake answers this question with many features designed to transform your data for all your analytic use cases.

SQL

SQL Data Pipeline Python ETL

Upcoming Snowflake Features

phData

JULY 1, 2024

Cortex Search : This feature provides a search solution that Snowflake fully manages from data ingestion, embedding, retrieval, reranking, and generation. Use cases for this feature include needle-in-a-haystack lookups and multi-document synthesis and reasoning. Furthermore, Snowflake Notebooks can also be run on a schedule.

Python

Python Database Data Pipeline SQL

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

Airflow for workflow orchestration Airflow schedules and manages complex workflows, defining tasks and dependencies in Python code. An example direct acyclic graph (DAG) might automate data ingestion, processing, model training, and deployment tasks, ensuring that each step is run in the correct order and at the right time.

AWS

AWS Machine Learning Machine Learning ML

Mastering Version Control for ML Models: Best Practices You Need to Know

DagsHub

AUGUST 29, 2024

Version control systems (VCS) play a key role in this area by offering a structured method to track changes made to models and handle versions of data and code used in these ML projects. Data Versioning In ML projects, keeping track of datasets is very important because the data can change over time.

ML

ML ML Python Machine Learning

How to Effectively Version Control Your Machine Learning Pipeline

phData

AUGUST 20, 2024

Implementing proper version control in ML pipelines is essential for efficient management of code, data, and models by ensuring reproducibility and collaboration. Reproducibility ensures that experiments can be reliably reproduced by tracking changes in code, data, and model hyperparameters.

Machine Learning

Machine Learning Machine Learning ML ML

Implementing GenAI in Practice

Iguazio

JANUARY 22, 2024

In addition, MLOps practices like building data, experting tracking, versioning, artifacts and others, also need to be part of the GenAI productization process. For example, when indexing a new version of a document, it’s important to take care of versioning in the ML pipeline. This helps cleanse the data.

Data Pipeline

Data Pipeline ML ML Data Warehouse

What Is Keras Core?

PyImageSearch

JULY 24, 2023

Going Beyond with Keras Core The Power of Keras Core: Expanding Your Deep Learning Horizons Show Me Some Code JAX Harnessing model.fit() Imports and Setup Data Pipeline Build a Custom Model Build the Image Classification Model Train the Model Evaluation Summary References Citation Information What Is Keras Core? Enter Keras Core!

Deep Learning

Deep Learning Deep Learning Machine Learning Machine Learning

Introduction to LangChain for Including AI from Large Language Models (LLMs) Inside Data…

Heartbeat

JANUARY 5, 2024

Introduction to LangChain for Including AI from Large Language Models (LLMs) Inside Data Applications and Data Pipelines This article will provide an overview of LangChain, the problems it addresses, its use cases, and some of its limitations. Memory : Storing and retrieving data while conversing.

AI

AI AI Data Pipeline Deep Learning

Streamlining Process Configuration in Machine Learning with Hydra

Pickl AI

NOVEMBER 29, 2024

Hydra is a powerful Python-based configuration management framework designed to simplify the complexities of handling configurations in Machine Learning (ML) workflows and other projects. It also simplifies managing configuration dependencies in Deep Learning projects and large-scale data pipelines. What is Hydra?

Machine Learning

Machine Learning Machine Learning ML ML

How to Manage Unstructured Data in AI and Machine Learning Projects

DagsHub

OCTOBER 23, 2024

With proper unstructured data management, you can write validation checks to detect multiple entries of the same data. Continuous learning: In a properly managed unstructured data pipeline, you can use new entries to train a production ML model, keeping the model up-to-date.

Machine Learning

Machine Learning Machine Learning Data Lakes AI

Effective Project Management for Data Science: From Scoping to Ethical Deployment

ODSC - Open Data Science

OCTOBER 18, 2024

Although data scientists rightfully capture the spotlight, future-focused teams also include engineers building data pipelines, visualization experts, and project managers who integrate efforts across groups. Usability Do interfaces and documentation enable business analysts and data scientists to leverage systems?

Data Science

Data Science Data Scientist ETL Analytics

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

DagsHub

APRIL 7, 2024

Image generated with Midjourney In today’s fast-paced world of data science, building impactful machine learning models relies on much more than selecting the best algorithm for the job. Data scientists and machine learning engineers need to collaborate to make sure that together with the model, they develop robust data pipelines.

Machine Learning

Machine Learning Machine Learning ML ML

Explain text classification model predictions using Amazon SageMaker Clarify

AWS Machine Learning Blog

JANUARY 25, 2023

Solution overview SageMaker algorithms have fixed input and output data formats. But customers often require specific formats that are compatible with their data pipelines. Option A In this option, we use the inference pipeline feature of SageMaker hosting. We use the SageMaker Python SDK for this purpose.

Algorithm

Algorithm Natural Language Processing Machine Learning Machine Learning

Strategies for Transitioning Your Career from Data Analyst to Data Scientist–2024

Pickl AI

MAY 15, 2024

As a Data Analyst, you’ve honed your skills in data wrangling, analysis, and communication. But the allure of tackling large-scale projects, building robust models for complex problems, and orchestrating data pipelines might be pushing you to transition into Data Science architecture.

Data Analyst

Data Analyst Data Scientist Data Science Machine Learning

Generative AI in Software Development

Mlearning.ai

JUNE 16, 2023

Functional and non-functional requirements need to be documented clearly, which architecture design will be based on and support. GPT-4 Data Pipelines: Transform JSON to SQL Schema Instantly Blockstream’s public Bitcoin API. Then software development phases are planned to deliver the software.

AI

AI AI Data Analysis Data Analysis

Organizing ML Monorepo With Pants

The MLOps Blog

AUGUST 4, 2023

or as narrow as a couple of Python projects developed by a small team thrown into a single repository. These pipelines might be tightly integrated with the ML code. Keeping the data pipelines and ML code in the same repo helps maintain this tight integration and streamline the workflow. Take testing.

ML

ML ML Machine Learning Machine Learning

Top 10 Data Science tools for 2024

Pickl AI

MARCH 7, 2024

Summary: In 2024, mastering essential Data Science tools will be pivotal for career growth and problem-solving prowess. Tools like Seaborn, R, Python, and PyTorch are integral for extracting actionable insights and enhancing career prospects. It offers various libraries and frameworks for various Data Science tasks.

Data Science

Data Science Machine Learning Machine Learning Python

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

Flipboard

NOVEMBER 24, 2023

This use case highlights how large language models (LLMs) are able to become a translator between human languages (English, Spanish, Arabic, and more) and machine interpretable languages (Python, Java, Scala, SQL, and so on) along with sophisticated internal reasoning.

Database

Database AWS ETL SQL

Go vs. Python for Modern Data Workflows: Need Help Deciding?

Graceful External Termination: Handling Pod Deletions in Kubernetes Data Ingestion and Streaming…

Trending Sources

Evaluate large language models for your machine translation tasks on AWS

Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and agents with Amazon Bedrock

Cookiecutter Data Science V2

How to Build Effective Data Pipelines in Snowpark

Orchestration Frameworks 101: Simplifying LLM-App Interactions with LangChain and Llama Index

Align and monitor your Amazon Bedrock powered insurance assistance chatbot to responsible AI principles with AWS Audit Manager

Generate training data and cost-effectively train categorical models with Amazon Bedrock

Build an ML Inference Data Pipeline using SageMaker and Apache Airflow

How to Quickly Set up a Benchmark for Deep Learning Models With Kedro?

AWS Machine Learning: A Beginner’s Guide

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Best Data Engineering Tools Every Engineer Should Know

HCLTech’s AWS powered AutoWise Companion: A seamless experience for informed automotive buyer decisions with data-driven design

Create a generative AI-based application builder assistant using Amazon Bedrock Agents

Navigating the World of Data Engineering: A Beginners Guide.

11 Open Source Data Exploration Tools You Need to Know in 2023

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

MLOps Landscape in 2023: Top Tools and Platforms

These AI & Data Engineering Sessions Are a Must-Attend at ODSC East 2025

Optimizing Matillion Workflows: A Guide to Visual Design and Best Practices

How Do You Call Snowflake Stored Procedures Using dbt Hooks?

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

Build a Stocks Price Prediction App powered by Snowflake, AWS, Python and Streamlit?—?Part 2 of 3

Big Data vs. Data Science: Demystifying the Buzzwords

What Are Snowflake’s Best Features for Data Transformation?

Upcoming Snowflake Features

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Mastering Version Control for ML Models: Best Practices You Need to Know

How to Effectively Version Control Your Machine Learning Pipeline

Implementing GenAI in Practice

What Is Keras Core?

Introduction to LangChain for Including AI from Large Language Models (LLMs) Inside Data…

Streamlining Process Configuration in Machine Learning with Hydra

How to Manage Unstructured Data in AI and Machine Learning Projects

Effective Project Management for Data Science: From Scoping to Ethical Deployment

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

Explain text classification model predictions using Amazon SageMaker Clarify

Strategies for Transitioning Your Career from Data Analyst to Data Scientist–2024

Generative AI in Software Development

Organizing ML Monorepo With Pants

Top 10 Data Science tools for 2024

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

Stay Connected