2023, Data Pipeline and Database - Data Science Current

Mainframe Technology Trends for 2023

Precisely

JANUARY 19, 2023

In 2023 and beyond, we expect the open source trend to continue, with steady growth in the adoption of tools like Feilong, Tessla, Consolez, and Zowe. In 2023, expect to see broader adoption of streaming data pipelines that bring mainframe data to the cloud, offering a powerful tool for “modernizing in place.”

AWS

AWS Cloud Computing Data Pipeline Big Data

Navigating the World of Data Engineering: A Beginners Guide.

Towards AI

MARCH 21, 2023

Last Updated on March 21, 2023 by Editorial Team Author(s): Data Science meets Cyber Security Originally published on Towards AI. Navigating the World of Data Engineering: A Beginner’s Guide. A GLIMPSE OF DATA ENGINEERING ❤ IMAGE SOURCE: BY AUTHOR Data or data? What are ETL and data pipelines?

Data Engineering

Data Engineering Data Engineer Data Engineering Data Engineering

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

phData Toolkit July 2023 Update

phData

JULY 29, 2023

Operational Risks identify operational risks such as data loss or failures in the event of an unforeseen outage or disaster. Performance Optimization identify and fix bottlenecks in your data pipelines so that you can get the most out of your Snowflake investment.

SQL

SQL Database Data Pipeline

phData Toolkit March 2023 Update

phData

MARCH 31, 2023

For the Data Source Tool, we’ve addressed the following: Fixed an issue where view filters wouldn’t be disabled when using enabled = false. Fixed an issue when filtering tables in a database where only the first table listed would be scanned.

SQL

SQL Data Profiling Data Pipeline Database

How to Unlock Real-Time Analytics with Snowflake?

phData

MAY 3, 2024

What is Apache Kafka, and How is it Used in Building Real-time Data Pipelines? It is capable of handling high-volume and high-velocity data. Apache Kafka is an open-source event distribution platform. It is highly scalable, has high availability, and has low latency. Example: openssl rsa -in C:tmpnew_rsa_key_v1.p8

Apache Kafka

Apache Kafka Analytics Analytics ETL

Meet the Seattle-area startups that just graduated from Y Combinator

Flipboard

SEPTEMBER 25, 2023

Y Combinator Photo) Seattle-area startups that just graduated from Y Combinator’s summer 2023 batch are tackling a wide range of problems — with plenty of help from artificial intelligence. Neum AI at its core is an enabler for generative AI applications by helping connect data into vector databases and making it accessible for RAG.

Data Pipeline

Data Pipeline AI AI Natural Language Processing

Linked Data Event Streams and TimescaleDB for Real-time Timeseries Data Management

Towards AI

FEBRUARY 25, 2023

Last Updated on March 1, 2023 by Editorial Team Author(s): Samuel Van Ackere Originally published on Towards AI. This article shows how to effortlessly insert sensor data in the form of an LDES into a TimescaleDB database. First, a data flow must be configured to ingest a Linked Data Event Stream into PostgreSQL.

Database

Database Data Pipeline AI AI

Feature Platforms?—?A New Paradigm in Machine Learning Operations (MLOps)

IBM Data Science in Practice

MARCH 8, 2023

Hidden Technical Debt in Machine Learning Systems More money, more problems — Rise of too many ML tools 2012 vs 2023 — Source: Matt Turck People often believe that money is the solution to a problem. A feature platform should automatically process the data pipelines to calculate that feature. Spark, Flink, etc.)

Machine Learning

Machine Learning Machine Learning ML ML

How to Version Control Data in ML for Various Data Sources

The MLOps Blog

JANUARY 23, 2023

However, there are some key differences that we need to consider: Size and complexity of the data In machine learning, we are often working with much larger data. Basically, every machine learning project needs data. First of all, machine learning engineers and data scientists often use data from different data vendors.

ML

ML ML Data Lakes Machine Learning

How to Translate SQL Scripts Into Matillion Jobs

phData

JULY 12, 2023

In this blog, we’ll explore how Matillion Jobs can simplify the data transformation process by allowing users to visualize the data flow of a job from start to finish. Step 1 - Source Inputs For our example, our data will come from two table inputs for “cities” and “orders”. Database : Source Database of the table.

SQL

SQL ETL Database Data Pipeline

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Heartbeat

NOVEMBER 6, 2023

Image Source — Pixel Production Inc In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You might be curious how a simple tool like Apache Airflow can be powerful for managing complex data pipelines.

Data Pipeline

Data Pipeline Clean Data ETL Python

How to Translate SQL Scripts Into Matillion Jobs

phData

APRIL 21, 2023

In this blog, we’ll explore how Matillion Jobs can simplify the data transformation process by allowing users to visualize the data flow of a job from start to finish. Step 1 – Source Inputs For our example, our data will come from two table inputs for “cities” and “orders”. Database: Source Database of the table.

SQL

SQL ETL Database Data Pipeline

AI-Powered Bots in Ocean Predictoor Get a UX Upgrade: CLI & YAML

Ocean Protocol

JANUARY 17, 2024

We launched Predictoor and its Data Farming incentives in September & November 2023, respectively. Flows We released pdr-backend when we launched Predictoor in September 2023, and have been continually improving it since then: fixing bugs, reducing onboarding friction, and adding more capabilities (eg simulation flow).

Data Pipeline

Data Pipeline AI AI Analytics

phData Toolkit December 2022 Update

phData

DECEMBER 29, 2022

We hope you’ve had a fantastic holiday season, filled up on delicious food, and are as excited as us to kick off the 2023 calendar year. Traditionally, database administrators (DBAs) would run scripts that were manually generated through each environment to make changes to the database. But what does this actually mean?

SQL

SQL Database Database Administration Data Profiling

Location APIs: Powering Greater Accuracy and Context for Your Business Applications

Precisely

NOVEMBER 9, 2023

According to the 2023 Data Integrity Trends and Insights Report , data quality is the #1 barrier to achieving data integrity. And poor address quality is the top challenge preventing business leaders from effectively using location data to add context and multidimensional value to their decision-making processes.

Data Quality

Data Quality Data Pipeline Database

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

As you delve into the landscape of MLOps in 2023, you will find a plethora of tools and platforms that have gained traction and are shaping the way models are developed, deployed, and monitored. Open-source tools have gained significant traction due to their flexibility, community support, and adaptability to various workflows.

Machine Learning

Machine Learning Machine Learning ML ML

Building a Dataset for Triplet Loss with Keras and TensorFlow

Flipboard

FEBRUARY 13, 2023

Project Structure Creating Our Configuration File Creating Our Data Pipeline Preprocessing Faces: Detection and Cropping Summary Citation Information Building a Dataset for Triplet Loss with Keras and TensorFlow In today’s tutorial, we will take the first step toward building our real-time face recognition application. The dataset.py

Data Pipeline

Data Pipeline Deep Learning Deep Learning Python

Performance Benefits of Snowpark for ML Workloads

phData

MARCH 22, 2023

Top Use Cases of Snowpark With Snowpark, bringing business logic to data in the cloud couldn’t be easier. Transitioning work to Snowpark allows for faster ML deployment, easier scaling, and robust data pipeline development. ML Applications For data scientists, models can be developed in Python with common machine learning tools.

ML

ML ML Python Machine Learning

Discover the Snowflake Architecture With All its Pros and Cons- NIX United

Mlearning.ai

FEBRUARY 16, 2023

Thus, the solution allows for scaling data workloads independently from one another and seamlessly handling data warehousing, data lakes , data sharing, and engineering. Snowflake Database Pros Extensive Storage Opportunities Snowflake provides affordability, scalability, and a user-friendly interface.

Data Warehouse

Data Warehouse Business Intelligence Business Intelligence Database

Drowning in Data? A Data Lake May Be Your Lifesaver

ODSC - Open Data Science

SEPTEMBER 29, 2023

Data management problems can also lead to data silos; disparate collections of databases that don’t communicate with each other, leading to flawed analysis based on incomplete or incorrect datasets. The data lake can then refine, enrich, index, and analyze that data. It truly is an all-in-one data lake solution.

Data Lakes

Data Lakes Clustering Big Data Big Data

11 Open Source Data Exploration Tools You Need to Know in 2023

ODSC - Open Data Science

FEBRUARY 24, 2023

There are many well-known libraries and platforms for data analysis such as Pandas and Tableau, in addition to analytical databases like ClickHouse, MariaDB, Apache Druid, Apache Pinot, Google BigQuery, Amazon RedShift, etc. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources.

Exploratory Data Analysis

Exploratory Data Analysis Data Visualization Data Analysis Data Analysis

Triplet Loss with Keras and TensorFlow

Flipboard

MARCH 6, 2023

In the previous tutorial of this series, we built the dataset and data pipeline for our Siamese Network based Face Recognition application. Specifically, we looked at an overview of triplet loss and discussed what kind of data samples are required to train our model with the triplet loss. What's next? Raha, and A. Thanki, eds.,

Deep Learning

Deep Learning Deep Learning Data Pipeline Computer Science

What Industries are Hiring for Different Jobs in AI

ODSC - Open Data Science

APRIL 26, 2023

Data Engineer Data engineers are the authors of the infrastructure that stores, processes, and manages the large volumes of data an organization has. The main aspect of their profession is the building and maintenance of data pipelines, which allow for data to move between sources. Well then, you’re in luck.

Data Analyst

Data Analyst Machine Learning Machine Learning Power BI

phData Toolkit August 2023 Update

phData

SEPTEMBER 7, 2023

This is commonly handled in code that pulls data from databases, but you can also do this within the SQL query itself. We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023.

SQL

SQL Data Profiling Data Pipeline Database

Orchestration Frameworks 101: Simplifying LLM-App Interactions with LangChain and Llama Index

Data Science Dojo

SEPTEMBER 14, 2023

This orchestration process encompasses interactions with external APIs, retrieval of contextual data from vector databases, and maintaining memory across multiple LLM calls. This makes it easy to connect your data pipeline to the data sources that you need. It is known for its extensibility and modularity.

Data Pipeline

Data Pipeline Python Database AI

phData Toolkit June 2023 Update

phData

JUNE 26, 2023

Translate CATALOG_COLLATION in CREATE DATABASE Add BOM-aware file reading so that files with a BOM are read with the encoding specified. We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023.

SQL

SQL Data Profiling Data Pipeline Data Governance

phData Toolkit February 2023 Update

phData

MARCH 1, 2023

This allows you to perform tasks such as ensuring data quality against data sources (once or over time), compare data metrics and metadata across environments, and create/manage data pipelines for all your tables and views. Fixed an issue showing invalid timestamp/precision issues when scanning an Impala database.

SQL

SQL Data Pipeline Data Quality Database

phData Toolkit December 2023 Update

phData

JANUARY 10, 2024

Please spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023. Explore the phData Toolkit The post phData Toolkit December 2023 Update appeared first on phData. The tool now runs on 8 threads as opposed to the original single thread!

Data Warehouse

Data Warehouse Data Profiling Data Pipeline Database

What are Snowflake Dynamic Tables?

phData

NOVEMBER 2, 2023

Managing data pipelines efficiently is paramount for any organization. The Snowflake Data Cloud has introduced a groundbreaking feature that promises to simplify and supercharge this process: Snowflake Dynamic Tables. What are Snowflake Dynamic Tables?

Data Pipeline

Data Pipeline SQL Data Warehouse Data Engineering

What Are dbt Artifacts

phData

FEBRUARY 8, 2024

Data Modeling, dbt has gradually emerged as a powerful tool that largely simplifies the process of building and handling data pipelines. dbt is an open-source command-line tool that allows data engineers to transform, test, and document the data into one single hub which follows the best practices of software engineering.

Data Modeling

Data Modeling Data Models Database Data Warehouse

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

More on this topic later; but for now, keep in mind that the simplest method is to create a naming convention for database objects that allows you to identify the owner and associated budget. The extended period will allow you to perform Time Travel activities, such as undropping tables or comparing new data against historical values.

Clustering

Clustering Database SQL Data Pipeline

How Data Observability Helps to Build Trusted Data

Precisely

SEPTEMBER 18, 2023

Data observability is a key element of data operations (DataOps). It enables a big-picture understanding of the health of your organization’s data through continuous AI/ML-enabled monitoring – detecting anomalies throughout the data pipeline and preventing data downtime.

Data Observability

Data Observability Data Quality Data Pipeline DataOps

Implementing GenAI in Practice

Iguazio

JANUARY 22, 2024

You can watch the full talk this blog post is based on, which took place at ODSC West 2023, here. Feedback - Collect production data, metadata, and metrics to tune the model and application further, and to enable governance and explainability. The importance of data pipelines lies in the fact that data pipelines improve quality.

Data Pipeline

Data Pipeline ML ML Database

Best 8 Data Version Control Tools for Machine Learning 2024

DagsHub

DECEMBER 11, 2023

A complete overview revealing a diverse range of strengths and weaknesses for each data versioning tool. It does not support the ‘dvc repro’ command to reproduce its data pipeline. DVC Released in 2017, Data Version Control ( DVC for short) is an open-source tool created by iterative.

Machine Learning

Machine Learning Machine Learning Data Lakes Database

How to Build Machine Learning Systems With a Feature Store

The MLOps Blog

JANUARY 26, 2024

A feature store is a data platform that supports the creation and use of feature data throughout the lifecycle of an ML model, from creating features that can be reused across many models to model training to model inference (making predictions). It can also transform incoming data on the fly.

Machine Learning

Machine Learning Machine Learning ML ML

How to Ingest Salesforce Data into Snowflake Using Salesforce Sync Out

phData

SEPTEMBER 15, 2023

To configure Salesforce and Snowflake using the Sync Out connector, follow these steps: Step 1: Create Snowflake Objects To use Sync Out with Snowflake, you need to configure the following Snowflake objects appropriately in your Snowflake account: Database and schema that will be used for the Salesforce data.

Data Warehouse

Data Warehouse Tableau Data Silos Analytics

A Primer to Scaling Pandas

ODSC - Open Data Science

AUGUST 23, 2023

Modin empowers practitioners to use pandas on data at scale, without requiring them to change a single line of code. Modin leverages our cutting-edge academic research on dataframes — the abstraction underlying pandas to bring the best of databases and distributed systems to dataframes. Run operations in pandas - all in Snowflake!

Data Warehouse

Data Warehouse Data Science Database SQL

What Are dbt Execution Best Practices?

phData

FEBRUARY 6, 2024

One of the more common practices when developing a data pipeline is rebuilding your data for testing changes. As one of the leaders in the industry, dbt provides several options on how to execute your pipelines to increase efficiency and specifically execute what you need. What is dbt Run + dbt Test?

Data Pipeline

Data Pipeline Database SQL Data Warehouse

Enable data sharing through federated learning: A policy approach for chief digital officers

AWS Machine Learning Blog

MARCH 15, 2024

In Dr. Werner Vogels’s own words at AWS re:Invent 2023 , “every second that a person has a stroke counts.” Medical data restrictions You can use machine learning (ML) to assist doctors and researchers in diagnosis tasks, thereby speeding up the process. He has worked with multiple federal agencies to advance their data and AI goals.

AWS

AWS ML ML Data Silos

Adversarial Learning with Keras and TensorFlow (Part 1): Overview of Adversarial Learning

PyImageSearch

JANUARY 8, 2024

We will understand the dataset and the data pipeline for our application and discuss the salient features of the NSL framework in detail. config.py ) The data pipeline (i.e., Next, in the 3rd part of this tutorial series, we will discuss two types of adversarial attacks used to engineer adversarial examples. What's next?

Deep Learning

Deep Learning Deep Learning Data Pipeline Computer Science

How to Optimize Power BI and Snowflake for Advanced Analytics

phData

MAY 25, 2023

How to Optimize Power BI and Snowflake for Advanced Analytics Spencer Baucke May 25, 2023 The world of business intelligence and data modernization has never been more competitive than it is today. Importing data allows you to ingest a copy of the source data into an in-memory database.

Power BI

Power BI Analytics Analytics Azure

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

Through workload optimization across multiple query engines and storage tiers, organizations can reduce data warehouse costs by up to 50 percent. 1 Watsonx.data offers built-in governance and automation to get to trusted insights within minutes, and integrations with existing databases and tools to simplify setup and user experience.

AI

AI AI Machine Learning Machine Learning

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snorkel AI

MAY 26, 2023

What’s really important in the before part is having production-grade machine learning data pipelines that can feed your model training and inference processes. And that’s really key for taking data science experiments into production. Registration is now open for The Future of Data-Centric AI 2023.

SQL

SQL ML ML Python

Top 10 Data Pipeline Interview Questions to Read in 2023

Mainframe Technology Trends for 2023

Webinars

Trending Sources

Navigating the World of Data Engineering: A Beginners Guide.

Webinars

phData Toolkit July 2023 Update

phData Toolkit March 2023 Update

How to Unlock Real-Time Analytics with Snowflake?

Meet the Seattle-area startups that just graduated from Y Combinator

Linked Data Event Streams and TimescaleDB for Real-time Timeseries Data Management

Feature Platforms?—?A New Paradigm in Machine Learning Operations (MLOps)

How to Version Control Data in ML for Various Data Sources

How to Translate SQL Scripts Into Matillion Jobs

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

How to Translate SQL Scripts Into Matillion Jobs

AI-Powered Bots in Ocean Predictoor Get a UX Upgrade: CLI & YAML

phData Toolkit December 2022 Update

Location APIs: Powering Greater Accuracy and Context for Your Business Applications

MLOps Landscape in 2023: Top Tools and Platforms

Building a Dataset for Triplet Loss with Keras and TensorFlow

Performance Benefits of Snowpark for ML Workloads

Discover the Snowflake Architecture With All its Pros and Cons- NIX United

Drowning in Data? A Data Lake May Be Your Lifesaver

11 Open Source Data Exploration Tools You Need to Know in 2023

Triplet Loss with Keras and TensorFlow

What Industries are Hiring for Different Jobs in AI

phData Toolkit August 2023 Update

Orchestration Frameworks 101: Simplifying LLM-App Interactions with LangChain and Llama Index

phData Toolkit June 2023 Update

phData Toolkit February 2023 Update

phData Toolkit December 2023 Update

What are Snowflake Dynamic Tables?

What Are dbt Artifacts

Getting Started With Snowflake: Best Practices For Launching

How Data Observability Helps to Build Trusted Data

Implementing GenAI in Practice

Best 8 Data Version Control Tools for Machine Learning 2024

How to Build Machine Learning Systems With a Feature Store

How to Ingest Salesforce Data into Snowflake Using Salesforce Sync Out

A Primer to Scaling Pandas

What Are dbt Execution Best Practices?

Enable data sharing through federated learning: A policy approach for chief digital officers

Adversarial Learning with Keras and TensorFlow (Part 1): Overview of Adversarial Learning

How to Optimize Power BI and Snowflake for Advanced Analytics

Exploring the AI and data capabilities of watsonx

Snowflake Snowpark: cloud SQL and Python ML pipelines

Stay Connected