Blog, Data Engineering and Data Lakes

Unifying Your Data Ecosystem with Delta Lake Integration

databricks

MAY 9, 2023

As organizations are maturing their data infrastructure and accumulating more data than ever before in their data lakes, Open and Reliable table formats.

Data Lakes

Data Lakes Data Engineer Data Engineering Data Engineering

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

databricks

JUNE 6, 2023

Apache Parquet is one of the most popular open source file formats in the big data world today. Being column-oriented, Apache Parquet allows.

Data Lakes

Data Lakes Big Data Big Data Data Engineer

CI/CD for Data Pipelines: A Game-Changer with AnalyticsCreator

Data Science Blog

MAY 20, 2024

Continuous Integration and Continuous Delivery (CI/CD) for Data Pipelines: It is a Game-Changer with AnalyticsCreator! The need for efficient and reliable data pipelines is paramount in data science and data engineering. It offers full BI-Stack Automation, from source to data warehouse through to frontend.

Data Pipeline

Data Pipeline Data Warehouse Azure Data Lakes

Webinars

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

The IKEA of Data: How to Bring Modular Thinking to Your Data Architecture (and Why It Works)

IBM Data Science in Practice

MAY 19, 2025

The Data Dilemma: From Chaos toClarity In the world of data management, weve all beenthere: A simple request spirals into a maze of thingslike: A CSV labeled final_v2_final_final.csv A Parquet file in a forgotten S3folder A table with brokenlineage An abandoned SQL notebook from yearsago What starts as a data lake often becomes a dataswamp!

Data Lakes

Data Lakes SQL Data Science Data Engineer

Build a financial research assistant using Amazon Q Business and Amazon QuickSight for generative AI–powered insights

Flipboard

MAY 14, 2025

Their information is split between two types of data: unstructured data (such as PDFs, HTML pages, and documents) and structured data (such as databases, data lakes, and real-time reports). Different types of data typically require different tools to access them. Traditionally, businesses face a challenge.

AWS

AWS AI AI Database

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData

SEPTEMBER 19, 2023

With the amount of data companies are using growing to unprecedented levels, organizations are grappling with the challenge of efficiently managing and deriving insights from these vast volumes of structured and unstructured data. What is a Data Lake? Consistency of data throughout the data lake.

Data Lakes

Data Lakes Data Modeling Data Models Data Warehouse

Data Cataloging in the Data Lake: Alation + Kylo

Alation

FEBRUARY 20, 2020

When it was no longer a hard requirement that a physical data model be created upon the ingestion of data, there was a resulting drop in richness of the description and consistency of the data stored in Hadoop. You did not have to understand or prepare the data to get it into Hadoop, so people rarely did.

Data Lakes

Data Lakes Hadoop Tableau Big Data

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

AWS Machine Learning Blog

AUGUST 8, 2024

Managing and retrieving the right information can be complex, especially for data analysts working with large data lakes and complex SQL queries. This post highlights how Twilio enabled natural language-driven data exploration of business intelligence (BI) data with RAG and Amazon Bedrock.

SQL

SQL Data Lakes Data Analyst AWS

Why Open Table Format Architecture is Essential for Modern Data Systems

phData

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? It can also be integrated into major data platforms like Snowflake.

Data Lakes

Data Lakes Data Warehouse Database Azure

Azure Data Engineer Jobs

Pickl AI

APRIL 6, 2023

Accordingly, one of the most demanding roles is that of Azure Data Engineer Jobs that you might be interested in. The following blog will help you know about the Azure Data Engineering Job Description, salary, and certification course. How to Become an Azure Data Engineer?

Azure

Azure Data Engineer Data Engineering Data Engineering

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Flipboard

DECEMBER 11, 2024

Organizations are building data-driven applications to guide business decisions, improve agility, and drive innovation. Many of these applications are complex to build because they require collaboration across teams and the integration of data, tools, and services. Big Data Architect. option("multiLine", "true").option("header",

SQL

SQL AWS Data Lakes AI

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 20, 2023

Data and governance foundations – This function uses a data mesh architecture for setting up and operating the data lake, central feature store, and data governance foundations to enable fine-grained data access. This framework considers multiple personas and services to govern the ML lifecycle at scale.

ML

ML ML AWS Data Lakes

How to Shift from Data Science to Data Engineering

ODSC - Open Data Science

JANUARY 18, 2024

Data engineering is a rapidly growing field, and there is a high demand for skilled data engineers. If you are a data scientist, you may be wondering if you can transition into data engineering. In this blog post, we will discuss how you can become a data engineer if you are a data scientist.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Top Use Cases of Data Engineering in Financial Services

phData

SEPTEMBER 29, 2023

When you think of data engineering , what comes to mind? In reality, though, if you use data (read: any information), you are most likely practicing some form of data engineering every single day. Said differently, any tools or steps we use to help us utilize data can be considered data engineering.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

Shaping the future: OMRON’s data-driven journey with AWS

AWS Machine Learning Blog

APRIL 3, 2025

Amazon AppFlow was used to facilitate the smooth and secure transfer of data from various sources into ODAP. Additionally, Amazon Simple Storage Service (Amazon S3) served as the central data lake, providing a scalable and cost-effective storage solution for the diverse data types collected from different systems.

AWS

AWS Data Governance Data Silos SQL

Introducing the Amazon Comprehend flywheel for MLOps

AWS Machine Learning Blog

MARCH 1, 2023

MLOps focuses on the intersection of data science and data engineering in combination with existing DevOps practices to streamline model delivery across the ML development lifecycle. MLOps requires the integration of software development, operations, data engineering, and data science.

Data Lakes

Data Lakes AWS ML ML

Media Mix Modeling, ML Safety Concerns with LLMs, and Data Engineering Cloud Options

ODSC - Open Data Science

APRIL 27, 2023

Unlock the Power of Media Mix Modeling for Effective Advertising In this blog, we will provide a quick overview of media mix modeling and how you can get started with it. 5 Data Engineering and Data Science Cloud Options for 2023 AI development is incredibly resource intensive.

Data Engineer

Data Engineer Data Engineering Data Engineering Data Engineering

What is the Snowflake Data Cloud and How Much Does it Cost?

phData

NOVEMBER 9, 2023

This blog was originally written by Keith Smith and updated for 2024 by Justin Delisi. Snowflake’s Data Cloud has emerged as a leader in cloud data warehousing. What is a Data Lake? A Data Lake is a location to store raw data that is in any format that an organization may produce or collect.

Data Warehouse

Data Warehouse Data Lakes Clustering Cloud Data

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

FEBRUARY 21, 2025

He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazons operations. Chaithanya Maisagoni is a Senior Software Development Engineer (AI/ML) in Amazons Worldwide Returns and ReCommerce organization.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

Use Amazon SageMaker Canvas to build machine learning models using Parquet data from Amazon Athena and AWS Lake Formation

AWS Machine Learning Blog

JUNE 5, 2023

Many teams are turning to Athena to enable interactive querying and analyze their data in the respective data stores without creating multiple data copies. Athena allows applications to use standard SQL to query massive amounts of data on an S3 data lake. Create a data lake with Lake Formation.

Machine Learning

Machine Learning Machine Learning AWS Data Lakes

5 Ways Data Engineers Can Support Data Governance

Alation

JANUARY 26, 2023

Governance can — and should — be the responsibility of every data user, though how that’s achieved will depend on the role within the organization. This article will focus on how data engineers can improve their approach to data governance. How can data engineers address these challenges directly?

Data Governance

Data Governance Data Engineer Data Engineering Data Engineering

What is Snowpark — and Why Does it Matter? A phData Perspective

phData

SEPTEMBER 20, 2023

This blog was originally written by Keith Smith and updated for 2023 by Nick Goble & Dominick Rocco. You’ve probably heard of the Snowflake Data Cloud , but did you know that Snowflake also offers a revolutionary set of libraries and runtimes called Snowpark? What is Snowflake’s Snowpark?

SQL

SQL Python Data Lakes Machine Learning

Munich Re Launches Enterprise-Wide Data-Driven Platform for Analytics

Alation

FEBRUARY 13, 2020

Andreas Kohlmaier, Head of Data Engineering at Munich Re 1. --> Ron Powell, independent analyst and industry expert for the BeyeNETWORK and executive producer of The World Transformed FastForward Series, interviews Andreas Kohlmaier, Head of Data Engineering at Munich Re. But it is a little hard to consume.

Data Lakes

Data Lakes Analytics Analytics Data Engineer

Data Mesh vs. Data Fabric: A Love Story

Alation

JANUARY 13, 2022

Thoughtworks says data mesh is key to moving beyond a monolithic data lake. Spoiler alert: data fabric and data mesh are independent design concepts that are, in fact, quite complementary. Thoughtworks says data mesh is key to moving beyond a monolithic data lake 2. Gartner on Data Fabric.

Data Lakes

Data Lakes Data Governance Data Quality Data Warehouse

How Rocket Companies modernized their data science solution on AWS

AWS Machine Learning Blog

FEBRUARY 21, 2025

Despite the benefits of this architecture, Rocket faced challenges that limited its effectiveness: Accessibility limitations: The data lake was stored in HDFS and only accessible from the Hadoop environment, hindering integration with other data sources. This also led to a backlog of data that needed to be ingested.

Data Science

Data Science AWS Hadoop Data Scientist

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

The solution addressed in this blog solves Afri-SET’s challenge and was ranked as the top 3 winning solutions. This post presents a solution that uses a generative artificial intelligence (AI) to standardize air quality data from low-cost sensors in Africa, specifically addressing the air quality data integration problem of low-cost sensors.

AWS

AWS Python AI AI

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

AWS Machine Learning Blog

SEPTEMBER 1, 2023

These teams are as follows: Advanced analytics team (data lake and data mesh) – Data engineers are responsible for preparing and ingesting data from multiple sources, building ETL (extract, transform, and load) pipelines to curate and catalog the data, and prepare the necessary historical data for the ML use cases.

AI

AI AI ML ML

eSentire delivers private and secure generative AI interactions to customers with Amazon SageMaker

AWS Machine Learning Blog

JUNE 21, 2024

eSentire has over 2 TB of signal data stored in their Amazon Simple Storage Service (Amazon S3) data lake. This further step updates the FM by training with data labeled by security experts (such as Q&A pairs and investigation conclusions).

AWS

AWS AI AI Natural Language Processing

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

AWS Machine Learning Blog

AUGUST 15, 2024

With over 50 connectors, an intuitive Chat for data prep interface, and petabyte support, SageMaker Canvas provides a scalable, low-code/no-code (LCNC) ML solution for handling real-world, enterprise use cases. Organizations often struggle to extract meaningful insights and value from their ever-growing volume of data.

ML

ML ML Data Preparation AWS

Podcast: Deciphering Data Architectures with James Serra

ODSC - Open Data Science

MAY 7, 2024

Beyond his technical achievements, James is a sought-after speaker and is a prolific voice in the data community through his blog, JamesSerra.com. James Serra discusses data lakehouses, which merge data lakes and data warehouses.

Data Warehouse

Data Warehouse Data Lakes Data Science Big Data

Why optimize your warehouse with a data lakehouse strategy

IBM Journey to AI blog

APRIL 25, 2023

In a prior blog , we pointed out that warehouses, known for high-performance data processing for business intelligence, can quickly become expensive for new data and evolving workloads. To do so, Presto and Spark need to readily work with existing and modern data warehouse infrastructures.

Data Warehouse

Data Warehouse Data Engineering Data Engineering Data Engineer

Data science vs data analytics: Unpacking the differences

IBM Journey to AI blog

SEPTEMBER 19, 2023

Data scientists will typically perform data analytics when collecting, cleaning and evaluating data. By analyzing datasets, data scientists can better understand their potential use in an algorithm or machine learning model. Watsonx comprises of three powerful components: the watsonx.ai

Data Science

Data Science Analytics Analytics Data Scientist

Imperva optimizes SQL generation from natural language using Amazon Bedrock

AWS Machine Learning Blog

JUNE 20, 2024

Our goal was to improve the user experience of an existing application used to explore the counters and insights data. The data is stored in a data lake and retrieved by SQL using Amazon Athena. Millions of counters are added daily, together with 20 million insights updated daily to spot threat patterns.

SQL

SQL Database AWS Machine Learning

Data architecture strategy for data quality

IBM Journey to AI blog

JANUARY 5, 2023

The first generation of data architectures represented by enterprise data warehouse and business intelligence platforms were characterized by thousands of ETL jobs, tables, and reports that only a small group of specialized data engineers understood, resulting in an under-realized positive impact on the business.

Data Quality

Data Quality Data Lakes Data Warehouse Big Data

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

AWS Machine Learning Blog

DECEMBER 18, 2023

Prior joining AWS, as a Data/Solution Architect he implemented many projects in Big Data domain, including several data lakes in Hadoop ecosystem. As a Data Engineer he was involved in applying AI/ML to fraud detection and office automation. Babu Srinivasan is a Senior Partner Solutions Architect at MongoDB.

Clustering

Clustering AWS Database ML

Getir end-to-end workforce management: Amazon Forecast and AWS Step Functions

AWS Machine Learning Blog

DECEMBER 7, 2023

He joined Getir in 2022 as a Data Scientist and started working on time-series forecasting and mathematical optimization projects. Mutlu Polatcan is a Staff Data Engineer at Getir, specializing in designing and building cloud-native data platforms. He loves combining open-source projects with cloud services.

AWS

AWS Algorithm Data Science Machine Learning

How to Version Control Data in ML for Various Data Sources

The MLOps Blog

JANUARY 23, 2023

However, there are some key differences that we need to consider: Size and complexity of the data In machine learning, we are often working with much larger data. Basically, every machine learning project needs data. Given the range of tools and data types, a separate data versioning logic will be necessary.

ML

ML ML Data Lakes Machine Learning

3 Major Trends at Strata New York 2017

DataRobot Blog

OCTOBER 3, 2017

Enterprise data architects, data engineers, and business leaders from around the globe gathered in New York last week for the 3-day Strata Data Conference , which featured new technologies, innovations, and many collaborative ideas. 2) When data becomes information, many (incremental) use cases surface.

Data Lakes

Data Lakes Azure Data Pipeline Hadoop

Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

AWS Machine Learning Blog

FEBRUARY 13, 2024

Let’s demystify this using the following personas and a real-world analogy: Data and ML engineers (owners and producers) – They lay the groundwork by feeding data into the feature store Data scientists (consumers) – They extract and utilize this data to craft their models Data engineers serve as architects sketching the initial blueprint.

AWS

AWS ML ML Machine Learning

Scale knowledge management use cases with generative AI

IBM Journey to AI blog

JULY 27, 2023

Powering a knowledge management system with a data lakehouse Organizations need a data lakehouse to target data challenges that come with deploying an AI-powered knowledge management system. It provides the combination of data lake flexibility and data warehouse performance to help to scale AI.

AI

AI AI Data Scientist Data Quality

How to use foundation models and trusted governance to manage AI workflow risk

IBM Journey to AI blog

OCTOBER 16, 2023

How to scale AL and ML with built-in governance A fit-for-purpose data store built on an open lakehouse architecture allows you to scale AI and ML while providing built-in governance tools. A data store lets a business connect existing data with new data and discover new insights with real-time analytics and business intelligence.

AI

AI AI Data Warehouse ML

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

In this blog, I will cover: What is watsonx.ai? sales conversation summaries, insurance coverage, meeting transcripts, contract information) Generate: Generate text content for a specific purpose, such as marketing campaigns, job descriptions, blogs or articles, and email drafting support. What capabilities are included in watsonx.ai?

AI

AI AI Machine Learning Machine Learning

How OLAP and AI can enable better business

IBM Journey to AI blog

DECEMBER 7, 2023

Automated data preparation and cleansing : AI-powered data preparation tools will automate data cleaning, transformation and normalization, reducing the time and effort required for manual data preparation and improving data quality.

Data Preparation

Data Preparation Database Data Analysis Data Analysis

Our Next Phase of Growth: Enterprise Data Catalogs

Alation

FEBRUARY 13, 2020

Expansion in our business model is driven by the number of users of the data catalog, which means that our average customer is virally successful relative to their initial investment. The Alation Data Catalog is taking years of data lake and self-service analytics investments and driving them from investments to insights.

Data Lakes

Data Lakes Analytics Analytics Machine Learning

Unifying Your Data Ecosystem with Delta Lake Integration

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

Webinars

Trending Sources

CI/CD for Data Pipelines: A Game-Changer with AnalyticsCreator

Webinars

The IKEA of Data: How to Bring Modular Thinking to Your Data Architecture (and Why It Works)

Build a financial research assistant using Amazon Q Business and Amazon QuickSight for generative AI–powered insights

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Data Cataloging in the Data Lake: Alation + Kylo

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

Why Open Table Format Architecture is Essential for Modern Data Systems

Azure Data Engineer Jobs

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker

How to Shift from Data Science to Data Engineering

Top Use Cases of Data Engineering in Financial Services

Shaping the future: OMRON’s data-driven journey with AWS

Introducing the Amazon Comprehend flywheel for MLOps

Media Mix Modeling, ML Safety Concerns with LLMs, and Data Engineering Cloud Options

What is the Snowflake Data Cloud and How Much Does it Cost?

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Use Amazon SageMaker Canvas to build machine learning models using Parquet data from Amazon Athena and AWS Lake Formation

5 Ways Data Engineers Can Support Data Governance

What is Snowpark — and Why Does it Matter? A phData Perspective

Munich Re Launches Enterprise-Wide Data-Driven Platform for Analytics

Data Mesh vs. Data Fabric: A Love Story

How Rocket Companies modernized their data science solution on AWS

Improving air quality with generative AI

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

eSentire delivers private and secure generative AI interactions to customers with Amazon SageMaker

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

Podcast: Deciphering Data Architectures with James Serra

Why optimize your warehouse with a data lakehouse strategy

Data science vs data analytics: Unpacking the differences

Imperva optimizes SQL generation from natural language using Amazon Bedrock

Data architecture strategy for data quality

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

Getir end-to-end workforce management: Amazon Forecast and AWS Step Functions

How to Version Control Data in ML for Various Data Sources

3 Major Trends at Strata New York 2017

Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

Scale knowledge management use cases with generative AI

How to use foundation models and trusted governance to manage AI workflow risk

Exploring the AI and data capabilities of watsonx

How OLAP and AI can enable better business

Our Next Phase of Growth: Enterprise Data Catalogs

Stay Connected