Data Science Current

Automate PDF pre-labeling for Amazon Comprehend

AWS Machine Learning Blog

DECEMBER 14, 2023

Amazon Comprehend is a natural-language processing (NLP) service that provides pre-trained and custom APIs to derive insights from textual data. To train a custom model, you first prepare training data by manually annotating entities in documents. For the demo, we use simulated bank statements like the following example.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

Kangas: The Pandas of Computer Vision

Heartbeat

APRIL 30, 2023

Photo by Comet ML Introduction In the field of computer vision, Kangas is one of the tools becoming increasingly popular for image data processing and analysis. Similar to how Pandas revolutionized the way data analysts work with tabular data, Kangas is doing the same for computer vision tasks.

Data Analysis

Data Analysis Data Analysis ML ML

Simplify data prep for generative AI with Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

NOVEMBER 27, 2023

However, these models require massive amounts of clean, structured training data to reach their full potential. Most real-world data exists in unstructured formats like PDFs, which requires preprocessing before it can be used effectively. According to IDC , unstructured data accounts for over 80% of all business data today.

Data Preparation

Data Preparation AI AI Database

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

The Tradeoff Between Complexity and Ground Truth in AI: What You Need to Know

ODSC - Open Data Science

OCTOBER 31, 2023

For data scientists, ground truth is the holy grail. If we think of AI as software that is taught with examples , instead of instructions, then selecting the right examples is critical to building a system that performs well. This is the data of record that reflects verified examples of the correct outcome.

AI

AI AI ML ML

Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models

Flipboard

FEBRUARY 15, 2023

We show how to build an end-to-end CI/CD pipeline for data preprocessing and fine-tuning ML models, registering model artifacts to the SageMaker model registry , and automating model deployment with a manual approval to stage and production. We demonstrate a customer churn classification example using the LightGBM model from Jumpstart.

ML

ML ML AWS Natural Language Processing

How to Practice Data-Centric AI and Have AI Improve its Own Dataset

ODSC - Open Data Science

OCTOBER 11, 2023

Be sure to check out his talk, “ How to Practice Data-Centric AI and Have AI Improve its Own Dataset ,” there! Machine learning models are only as good as the data they are trained on. Even with the most advanced neural network architectures, if the training data is flawed, the model will suffer.

AI

AI AI ML ML

Churn prediction using multimodality of text and tabular features with Amazon SageMaker Jumpstart

AWS Machine Learning Blog

JANUARY 17, 2023

In addition to textual inputs, this model uses traditional structured data inputs such as numerical and categorical fields. This post aims to build a model that can process and relate information from multiple modalities such as tabular and textual features. Extract and analyze data from documents. JumpStart solution templates.

AWS

AWS Machine Learning Machine Learning Natural Language Processing

Constructing and Visualizing Datagrids in Kangas

Heartbeat

FEBRUARY 21, 2023

It’s defined as a tool for exploring, analyzing, and visualizing large-scale multimedia data. Kangas DataGrid, the fundamental class for representing datasets, can easily store millions of rows of data. Group, sort, and filter across millions of data points in seconds with a simple, fast UI. Any data, any environment.

Deep Learning

Deep Learning Deep Learning ML ML

Promote search content using Featured Results for Amazon Kendra

AWS Machine Learning Blog

APRIL 5, 2023

For example, you can specify that if your users enter the query “new products 2023,” then select the documents titled “What’s new” and “Coming soon” will feature at the top of the search results page. Choose Add data source. Under Available data sources , select Sample AWS documentation and choose Add dataset. Choose Next.

AWS

AWS Machine Learning Machine Learning ML

Predictive Health Data: A New Dataset in the Medical Domain

Defined.ai blog

APRIL 20, 2023

Health data is an extremely personal topic for many individuals, and the laws and regulations reflect this sensitivity. These screens are linked over an individual’s lifetime, making them a valuable resource for tracking life-cycle health and building medical AI models, especially when it comes to predictive health data.

Data Modeling

Data Modeling Data Models AI AI

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

AWS Machine Learning Blog

OCTOBER 5, 2023

Distributed training of Llama 2 To accommodate Llama 2 with 2,000 and 4,000 sequence length, we implemented the script using NeMo Megatron for Trainium that supports data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). Lastly, we monitor TensorBoard to keep track of training progress: tensorboard --logdir./

AWS

AWS Machine Learning Machine Learning Deep Learning

Present and future of data cubes: an European EO perspective

Mlearning.ai

JANUARY 26, 2023

Prepared by: Carson Ross (OpenGeoHub) , Tom Hengl (OpenGeoHub) , Leandro Parente (OpenGeoHub) , Vasile Crăciunescu (TerraSigna) Data Cubes are highly organised data infrastructures enabling users to run new analyses and generate insights into processes, patterns and trends.

AWS

AWS Database Clean Data Data Science

What is Power BI Report Builder

phData

JUNE 8, 2023

In this blog, we will provide an introduction to Power BI Report Builder, explore its purpose, interface, and available data connections, and discuss certain limitations. It is typically a tabular (table-based) report designed to fit well on a page that follows the exact format that the developer defines.

Power BI

Power BI SQL Azure Database

Schneider Electric leverages Retrieval Augmented LLMs on SageMaker to ensure real-time updates in their ERP systems

AWS Machine Learning Blog

OCTOBER 31, 2023

Enterprise Resource Planning (ERP) systems are used by companies to manage several business functions such as accounting, sales or order management in one system. An example of account linking would be to identify the relationship between Amazon and its subsidiary, Whole Foods Market [ source ].

ML

ML ML AWS Machine Learning

Retrieval Part 1: Document loaders, Document Transformers

Heartbeat

NOVEMBER 24, 2023

Photo by Derek Laliberte on Unsplash Retrieval in LangChain refers to fetching and retrieving relevant data or documents from external sources. Retrieval is useful because it allows you to incorporate external data into your language model, providing additional context and information that may not be present in the model’s training data.

Deep Learning

Deep Learning Deep Learning ML ML

Meet the winners of the Tick Tick Bloom: Harmful Algal Bloom Detection Challenge

DrivenData Labs

APRIL 13, 2023

We need creative solutions that combine methods with satellite and other data to make these satellites help us. The goal in the Tick Tick Bloom: Harmful Algal Bloom Detection Challenge was to detect and classify the severity of cyanobacteria blooms in small, inland water bodies using publicly available satellite, climate, and elevation data.

Data Scientist

Data Scientist Decision Trees Algorithm Data Quality

Meet the finalists of the Pushback to the Future Challenge

DrivenData Labs

MAY 24, 2023

The NAS is investing in new ways to bring vast amounts of data together with state-of-the-art machine learning to improve air travel for everyone. In this post, we'll share the results of Phase 1 of this challenge, in which participants were given access to two years of data from 10 U.S.

Machine Learning

Machine Learning Machine Learning Decision Trees Data Science

Schedule your notebooks from any JupyterLab environment using the Amazon SageMaker JupyterLab extension

AWS Machine Learning Blog

MAY 10, 2023

Jupyter notebooks are highly favored by data scientists for their ability to interactively process data, build ML models, and test these models by making inferences on data. However, there are scenarios in which data scientists may prefer to transition from interactive development on notebooks to batch jobs.

AWS

AWS Data Scientist ML ML

The Ultimate Guide to Data Preparation for Machine Learning

DagsHub

FEBRUARY 29, 2024

Introduction Machine learning models learn patterns from data and leverage the learning, captured in the model weights, to make predictions on new, unseen data. Data, is therefore, essential to the quality and performance of machine learning models. In this blog, I will describe how to prepare data for machine learning in depth.

Data Preparation

Data Preparation Machine Learning Machine Learning Data Governance

10 Essential Topics to Master LLMs and Generative AI

ODSC - Open Data Science

NOVEMBER 8, 2023

Over the past year, new terms, developments, algorithms, tools, and frameworks have emerged to help data scientists and those working with AI develop whatever they desire. This involves providing the LLM with a dataset of labeled data, where each data point is a pair of input and output. Generative AI is a new field.

AI

AI AI Natural Language Processing Data Science

Data labeling a practical guide (2023)

Snorkel AI

SEPTEMBER 29, 2023

Data labeling remains a core requirement for any organization looking to use machine learning to solve tangible business problems, especially with the increased development and adoption of LLMs. This is where data labeling fits in. What is data labeling? This approach applies across all data modalities.

Machine Learning

Machine Learning Machine Learning ML ML

Managing Dataset Versions in Long-Term ML Projects

The MLOps Blog

MARCH 20, 2023

An example of a long-term ML project will be a bank fraud detection system powered by ML models and algorithms for pattern recognition. In such ML projects, you can expect to find large volumes of data accumulated over time, complex algorithms, and increasing compute resources; all of these are characteristics of a maturing ML project.

ML

ML ML Machine Learning Machine Learning

Fine-tuning YOLOv8 for Image Segmentation

Heartbeat

JULY 20, 2023

Fine-tuning a model involves taking a pre-trained model and adapting it to perform well on a new, specific task or data set. This process helps enhance model performance on previously unseen data. This helps deliver an optimized model without going through the entire training process, saving time and computing resources.

Machine Learning

Machine Learning Machine Learning Deep Learning Deep Learning

Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets

AWS Machine Learning Blog

SEPTEMBER 19, 2023

Multi-modal data is a valuable component of the financial industry, encompassing market, economic, customer, news and social media, and risk data. Financial organizations generate, collect, and use this data to gain insights into financial operations, make better decisions, and improve performance.

AWS

AWS AI AI ML

Use foundation models to improve model accuracy with Amazon SageMaker

AWS Machine Learning Blog

NOVEMBER 16, 2023

Photo by Scott Webb on Unsplash Determining the value of housing is a classic example of using machine learning (ML). Almost 50 years later, the estimation of housing prices has become an important teaching tool for students and professionals interested in using data and ML in business decision-making.

AWS

AWS ML ML Machine Learning

Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain

AWS Machine Learning Blog

MAY 25, 2023

Furthermore, FMs are trained with a point in time snapshot of data and have no inherent ability to access fresh data at inference time; without this ability they might provide responses that are potentially incorrect or inadequate. Amazon SageMaker Processing jobs for large scale data ingestion into OpenSearch.

AWS

AWS Clustering Python ML

Search for answers accurately using Amazon Kendra S3 Connector with VPC support

AWS Machine Learning Blog

MARCH 2, 2023

Using Amazon Kendra connectors enables you to synchronize data from multiple content repositories with your Amazon Kendra index. The post also demonstrates how to configure your connector for Amazon S3 and configure how your index syncs with your data source when your data source content changes.

AWS

AWS Database Machine Learning Machine Learning

Predictive Maintenance using Azure Machine Learning AutoML and Inference using Managed Online…

Mlearning.ai

FEBRUARY 18, 2023

with sdk v2 import libraries import tqdm import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns sns.set_style("whitegrid") import the data set # Import required libraries from azure.identity import DefaultAzureCredential from azure.identity import AzureCliCredential from azure.ai.ml

Azure

Azure Machine Learning Machine Learning Clustering

Implementing Agents in LangChain

Heartbeat

DECEMBER 8, 2023

Examples of end-to-end agents. Agents can be used for applications such as personal assistants, question answering, chatbots, querying tabular data, interacting with APIs, extraction, summarization, and evaluation. This enables the agent to gather relevant data and use it for decision-making. France: 82.7:

Deep Learning

Deep Learning Deep Learning AI AI

The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation

Google Research AI blog

NOVEMBER 17, 2022

Earlier this year at the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT), we published Data Cards , a dataset documentation framework aimed at increasing transparency across dataset lifecycles. The Data Cards Playbook incorporates the latest in fairness, accountability, and transparency research.

ML

ML ML Data Governance Data Scientist

Real-World MLOps Examples: End-To-End MLOps Pipeline for Visual Search at Brainly

The MLOps Blog

MARCH 28, 2023

In this second installment of the series “Real-world MLOps Examples,” Paweł Pęczek , Machine Learning Engineer at Brainly , will walk you through the end-to-end Machine Learning Operations (MLOps) process in the Visual Search team at Brainly. Say the teams have the data and can immediately start the labeling.

Machine Learning

Machine Learning Machine Learning ML ML

Managing Computer Vision Projects with Micha? Tadeusiak

The MLOps Blog

FEBRUARY 27, 2023

He has led several data science projects spanning multiple industries like manufacturing, retail, healthcare, insurance, safety, et cetera. Then, what’s usually the first thing to do after defining the goal, the scope is to see the data. They were yet to build the entire device to collect the data, et cetera.

ML

ML ML Data Scientist Machine Learning

Zero-shot and few-shot prompting for the BloomZ 176B foundation model with the simplified Amazon SageMaker JumpStart SDK

AWS Machine Learning Blog

AUGUST 14, 2023

This is useful where limited labeled data is available for training. BLOOM is an autoregressive LLM trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text that is hardly distinguishable from text written by humans.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

Unlocking Tabular Data’s Hidden Potential

ODSC - Open Data Science

MAY 10, 2023

Experience shows that tabular data is a highly valuable data source for sales, marketing, churn management, operations, and risk management, among other business use cases. Tabular Data Has a Popularity Problem So why the bias against tabular data? It’s an innate part of our evolutionary journey.

Data Scientist

Data Scientist Data Science Deep Learning Deep Learning

Top 5 Machine Learning Model Testing Tools in 2024

DagsHub

MAY 7, 2024

In machine learning, the model evaluation focuses on performance metrics and plots to summarize the correctness of a model on an unseen holdout test data set. Benefits of Testing ML Models ML model testing is crucial to creating a robust production-ready model for diverse real-world data. Testing: Are They Different?

Machine Learning

Machine Learning Machine Learning ML ML

Top 5 Machine Learning Model Testing Tools in 2024

DagsHub

MAY 7, 2024

In machine learning, the model evaluation focuses on performance metrics and plots to summarize the correctness of a model on an unseen holdout test data set. Benefits of Testing ML Models ML model testing is crucial to creating a robust production-ready model for diverse real-world data. Testing: Are They Different?

Machine Learning

Machine Learning Machine Learning ML ML

Log and visualize tabular data using Comet data panel

Heartbeat

MAY 10, 2023

Image source: Freepik Do you want to quickly log your data and visualize it in Comet with the new built-in data panel tool? In this article, we will talk about how to quickly log tabular data(this means data that is displayed in columns or tables) such as generic tabular data (.dat), csv”, or “.tsv”

Deep Learning

Deep Learning Deep Learning Machine Learning Machine Learning

Databricks DBRX is now available in Amazon SageMaker JumpStart

AWS Machine Learning Blog

APRIL 26, 2024

The DBRX LLM employs a fine-grained mixture-of-experts (MoE) architecture, pre-trained on 12 trillion tokens of carefully curated data and a maximum context length of 32,000 tokens. The model is deployed in an AWS secure environment and under your VPC controls, helping provide data security.

ML

ML ML AWS Python

On Privacy and Personalization in Federated Learning: A Retrospective on the US/UK PETs Challenge

ML @ CMU

MAY 12, 2023

Patient data collected by groups such as hospitals and health agencies is a critical tool for monitoring and preventing the spread of disease. Unfortunately, while this data contains a wealth of useful information for disease forecasting, the data itself may be highly sensitive and stored in disparate locations (e.g.,

Data Silos

Data Silos Algorithm ML ML

Edge Computing vs. Cloud Computing: Pros, Cons, and Future Trends

Pickl AI

AUGUST 24, 2023

These innovative approaches have revolutionised the process we manage data. It is the practice of storing and accessing data and applications over the internet. The businesses and individual users can use remote servers maintained by cloud service providers to store data. This minimizes the risk of data loss and downtime.

Cloud Computing

Cloud Computing Big Data Analytics Big Data Analytics Machine Learning

Automate chatbot for document and data retrieval using Agents and Knowledge Bases for Amazon Bedrock

AWS Machine Learning Blog

MAY 1, 2024

Numerous customers face challenges in managing diverse data sources and seek a chatbot solution capable of orchestrating these sources to offer comprehensive answers. It allows you to retrieve data from sources beyond the foundation model, enhancing prompts by integrating contextually relevant retrieved data.

AWS

AWS Machine Learning Machine Learning SQL

Responsible AI at Google Research: PAIR

Google Research AI blog

MAY 18, 2023

For example, they may visit architectural blogs to learn what domain-specific vocabulary they can adopt to help produce distinctive images of buildings. As an example, we developed new methods for extracting semantically meaningful structure from natural language prompts.

AI

AI AI ML ML

Watch all Future of Data-Centric AI 2023 videos now!

Snorkel AI

OCTOBER 12, 2023

Snorkel AI hosted the 2023 installment of its Future of Data-Centric AI virtual conference in June. The two-day event brought together researchers, practitioners, and industry leaders to discuss the latest trends and advances in data-centric AI, and we recorded each session as a video.

AI

AI AI ML ML

Watch all Future of Data-Centric AI 2023 videos now!

Snorkel AI

OCTOBER 12, 2023

Snorkel AI hosted the 2023 installment of its Future of Data-Centric AI virtual conference in June. The two-day event brought together researchers, practitioners, and industry leaders to discuss the latest trends and advances in data-centric AI, and we recorded each session as a video.

AI

AI AI ML ML

Automate PDF pre-labeling for Amazon Comprehend

Kangas: The Pandas of Computer Vision

Webinars

Trending Sources

Simplify data prep for generative AI with Amazon SageMaker Data Wrangler

Webinars

The Tradeoff Between Complexity and Ground Truth in AI: What You Need to Know

Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models

How to Practice Data-Centric AI and Have AI Improve its Own Dataset

Churn prediction using multimodality of text and tabular features with Amazon SageMaker Jumpstart

Constructing and Visualizing Datagrids in Kangas

Promote search content using Featured Results for Amazon Kendra

Predictive Health Data: A New Dataset in the Medical Domain

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Present and future of data cubes: an European EO perspective

What is Power BI Report Builder

Schneider Electric leverages Retrieval Augmented LLMs on SageMaker to ensure real-time updates in their ERP systems

Retrieval Part 1: Document loaders, Document Transformers

Meet the winners of the Tick Tick Bloom: Harmful Algal Bloom Detection Challenge

Meet the finalists of the Pushback to the Future Challenge

Schedule your notebooks from any JupyterLab environment using the Amazon SageMaker JupyterLab extension

The Ultimate Guide to Data Preparation for Machine Learning

10 Essential Topics to Master LLMs and Generative AI

Data labeling a practical guide (2023)

Managing Dataset Versions in Long-Term ML Projects

Fine-tuning YOLOv8 for Image Segmentation

Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets

Use foundation models to improve model accuracy with Amazon SageMaker

Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain

Search for answers accurately using Amazon Kendra S3 Connector with VPC support

Predictive Maintenance using Azure Machine Learning AutoML and Inference using Managed Online…

Implementing Agents in LangChain

The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation

Real-World MLOps Examples: End-To-End MLOps Pipeline for Visual Search at Brainly

Managing Computer Vision Projects with Micha? Tadeusiak

Zero-shot and few-shot prompting for the BloomZ 176B foundation model with the simplified Amazon SageMaker JumpStart SDK

Unlocking Tabular Data’s Hidden Potential

Top 5 Machine Learning Model Testing Tools in 2024

Top 5 Machine Learning Model Testing Tools in 2024

Log and visualize tabular data using Comet data panel

Databricks DBRX is now available in Amazon SageMaker JumpStart

On Privacy and Personalization in Federated Learning: A Retrospective on the US/UK PETs Challenge

Edge Computing vs. Cloud Computing: Pros, Cons, and Future Trends

Automate chatbot for document and data retrieval using Agents and Knowledge Bases for Amazon Bedrock

Responsible AI at Google Research: PAIR

Watch all Future of Data-Centric AI 2023 videos now!

Watch all Future of Data-Centric AI 2023 videos now!

Stay Connected