GETTING STARTED | DATA SCIENCE TOOLS | KNIME ANALYTICS PLATFORM

The Best kept Secret in Data Science is KNIME

The ultimate guide to unfold the secrets of the KNIME’s ecosystem

Dennis Ganzaroli
Low Code for Data Science
14 min readFeb 2, 2023

--

Discover KNIME, the best kept secret in data science. This powerful and versatile open source platform offers a visual interface and a wide range of built-in algorithms to unlock the full potential of your data. Whether you’re an experienced data scientist or just starting to get into the field, KNIME is a game-changer for anyone working with data.

Fig.1 : The Secret Door to the Success (image from author created with bluewillow.ai).

Why KNIME?

It happens more and more often that Business Analysts or BI Specialists ask me:

“Why KNIME? Can you show me a demo? And how do you use it in your company?”

Based on what level of knowledge they have, they also ask me:

“I have heard that there is a free open source version. But the premium version is surely not for free, right?”

“I can do this and that in SPSS, SAS, Alteryx, Python,… can you do the same with KNIME?”

“I have heard that it is a No-Code tool. But why don’t you code?”

All these people have heard or seen that we have been successfully using KNIME in our company for almost a decade.

Some of those who have evaluated the software say: “it’s a No-Code tool that can be used to create data pipelines. Our XYZ tool can do that too!”.

In the past, it was very difficult to find interested Data Engineers or Business Analysts for KNIME. They were happy with their (expensive) tools. Today they come to me by themselves.

But why this sudden change of mind?

In fact, if you search Google Trends for the keyword KNIME, you will see how queries have steadily increased since July 2021.

Fig. 2.: Google trends for the search “KNIME” over the last 2 years (image by Google).

Two reasons seem to be gaining more and more importance:

Companies start saving on licensing costs as well

Not long ago, if you worked for a large company, it was not a big problem to renew the high license costs for data tools such as SAS or IBM modelers.
These days it’s: “we don’t pay anymore. Find a cheaper tool!”

Collaboration between Business Analysts, Developers and Data Scientists

Business analysts used to work only with Excel. Today more and more of them are realizing that it can be done more efficiently with a Low-Code or No-Code tool.
On the other hand, the developers would like to code only. Cooperation on this basis is very difficult. Misunderstandings and the repetition of certain tasks are pre-programmed.

The solution would be a cheaper or open source tool that would provide a programming language with which both business analysts and developers could work together on the same platform and principles.

and for us, that solution is KNIME

KNIME covers all these aspects:

  • it is and will remain open source, as this is the basis of the KNIME philosophy
  • provides a Low-Code platform suitable for business analysts as well as developers, data engineers and data scientists
  • it is a cross-platform software, so there is a version for Windows as well as for Mac OSX and Linux.
Fig. 3.: KNIME Homepage (image by KNIME).

At KNIME, they believe in openness and the power of the community. Their philosophy is to maintain and develop an open source platform that contains all the functionality any individual might require and to continue adding functionality through both our own work and that of the community.

Unlike other open source products, KNIME is not a cut-down version and there are no artificial limitations on execution environment or data size:
If you have enough local or cloud based space and compute power, you can run projects with billions of rows, as many KNIME users currently do.

The second point covers the Low-Code approach. Because…

the best representation of an ETL pipeline is a visual workflow.

ETL (extract, transform, load) is a type of data integration that refers to the three phases (extract, transform, and load) that are used to combine data from various sources. Data is extracted from one or different source systems, transformed into an analyzeable format, and loaded into a data warehouse or other system throughout this process.

Fig. 4: The ETL-Process (image by author).

Judge it for yourself: Which of the following two representations is closest to the process above?

Fig. 5.: Python Code vs KNIME Workflow (image by author).

The visual environment provides just the right amount of abstraction to build and share your work. Yes, because you will always be working with others and therefore need to be able to document and share your work.

Nowadays no data scientist or data engineer works alone anymore. We are all part of teams and we all need to communicate together. Discussion of the tasks, best practices, documentation are all necessary tasks in the daily work.

Starting with KNIME

Installation

You can download KNIME here.

Fig. 6.: Download the latest KNIME for Windows, Linux, and MacOS (image by KNIME).

The following video shows how easy the installation is.

Fig. 7.: Installation of KNIME (image by KNIME).

Once it’s been installed, locate your instance of KNIME — from the appropriate folder, desktop link, application, or link in the start menu — and start it.

Before we get started with a concrete example, we need to familiarize ourselves with the elementary concepts in KNIME.

The Workspace

When the splash screen appears, a window will ask for the location of your workspace. This workspace is a folder on your machine or on a cloud-server that will host all your work. The default workspace folder is called knime-workspace:

Fig. 8: Selecting the KNIME Workspace (image by KNIME).

Be careful! Not every cloud provider is suitable for creating a workspace. So far, I have been able to successfully use Onedrive and iCloud. But other clouds failed.
If in doubt, save the workspace locally.

After clicking “Launch”, the workbench for KNIME will open.

The Workbench

The workbench is the place where you will be building your workflows.
It’s also where you’ll find all the resources you need to help you build your workflows.
The Workflow Editor is what you’ll be using to build your workflows. Workflows are made up of individual tasks, which we refer to as “nodes”.

They perform all kinds of operations, for example reading or writing files, transforming data, training models, creating visualizations etc.

Fig. 9: The KNIME Workbench (image by KNIME).

You build your workflow by dragging nodes from the Node Repository to the Workflow Editor, then connecting, configuring, and executing them (see video below).

Fig. 10.: Intro to the KNIME Workbench (image by KNIME).

Nodes and Workflows

In KNIME, individual tasks are represented by nodes. They are the smallest possible unit in KNIME and have been created to perform all sorts of tasks, including reading/writing files, transforming data, training models, creating visualizations, and so on.

Fig. 11.: Different nodes for every data transformation step (image by author).

A sequence of nodes creates a workflow. A workflow — as a sequence of nodes — is the graphic equivalent to a script or a series of instructions.
(see video below).

Fig. 12.: Intro to the Nodes and Workflows (image by KNIME).

The Node Repository

You’ll find it in your KNIME workbench in the bottom left-hand corner. It contains the nodes that can be used in your workflow.
The nodes are organized in categories. Each category represents a specific functionality in data analytics.

Let’s have a look at the different categories:
IO — contains the nodes you need to access data, for file reading and writing, using a number of file formats, such as csv, excel, pmml, images, tables and more.

Fig. 13: The Node Repository — the “Read” subcategory (image by author).

The Manipulation category contains nodes for filtering, aggregating, and transforming data tables. For example for column operations we have a number of column filters, conversions,joining, splitting, and a number of transformation nodes.

Just to quote two of the nodes most used here, the Joiner is for joining two tables and the String Manipulation to modify the content of string type column cells.

Fig. 14.: The “String Manipulation” and the “Joiner” Node (image by author).

In the Row subcategory there are lots of filter nodes, also nodes for aggregation, partitioning and sampling (see video below).

Fig. 15: Intro to the Node Repository (image by KNIME).

Learning Resources

The Learning Curve

Visual Programming also makes the learning curve much faster than for code-based tools.

If you want to know “Why every Data Engineer should learn a Visual Progamming Language” read my following article:

A GUI-based tool can be learned and applied in less time than a code-based tool, freeing up again precious time and resources for more important investigations.

I have seen too often entire months dedicated to learn the coding practice, before even approaching any data analysis technique. With KNIME in a few weeks you can already assemble quite complex workflows for data transformation and for the training of machine learning algorithms.

KNIME Self-Paced Courses

This courses are free and allow you to learn from the basics to get into machine learning. Courses are organized by level: L1 basic, L2 advanced, L3 deployment, L4 specialized.

In each course, go through the lessons with ~5 minutes videos, hands-on exercises, and knowledge-check questions.

Fig. 16: Roadmap of Self-Paced Courses (image by KNIME).

KNIME Cheat Sheets

Cheat Sheets are very helpful for beginners and help to quickly find the desired information about a node function.

KNIME Community Hub

The KNIME Community Hub is the public repository for the KNIME community. Here, you can share your workflows and download workflows by other KNIME users. Just type in the keywords and you will get a list of related workflows, components, extensions, and more. It is a great place to start with plenty of examples!

For example, just type in the search box “basic” or “beginners” and you will get a list of example workflows illustrating basic concepts in KNIME Analytics Platform; type in “read file” and you will get a list of example workflows illustrating how to read CSV files, .table files, excel files, etc. Notice that a subset of these example workflows is also reported in the EXAMPLES server in the KNIME Explorer panel in the top left corner of the KNIME workbench.

Fig. 17.: search the topic “read file” on the KNIME Community Hub (image by KNIME).

Once you isolate the workflow of interest, click on it to open its page, and then download it or open it on your own KNIME Analytics Platform. Once in the local workspace, you can start adapting it to your data and your needs. Following the popular fashionable trend in programming — that is searching for ready to use pieces of code — you can just download, reuse, and readapt workflows or pieces of workflows from the KNIME Hub to your own problem.

Books about KNIME

Books for becoming successful & efficient in using KNIME. Includes beginner and advanced topics, plus how to transition from Alteryx, Excel, SAS and SPSS, for users who already have experience with a similar platform or tool.

Fig. 18: Transition Booklets for experienced users with similar tools (image by KNIME).

KNIME TV Channel on YouTube

Be sure to check out also the KNIME TV Channel on YouTube. With a wide range of tutorials, webinars, and other resources, this channel is an invaluable resource for anyone looking to master KNIME. Whether you’re a beginner just getting started with the platform or an experienced data scientist, the KNIME TV Channel has something for you.

KNIME on Medium

On Medium on Low Code for Data Science content is published on successful data stories, data science theory, tips & tricks to get you started with KNIME and more. And the best is that it collects articles written by the community for the community.

Find out how to contribute and share your stories here.

Data Access with KNIME

In short, it is possible to read all kinds of data sources into KNIME.
I have extensively explained this topic in the following article:

Import flat files

Whether you need to import flat files, like:
- text or csv file
- Excel-files
- SAS-Files
- SPSS-Files

Fig. 19: The ‘CSV Reader’ and the ‘Excel Reader’ data access nodes in KNIME (image by author).

Access relational Databases

or need to query any relational database, there are special DB connectors for:

- MySQL
- Oracle
- SQLite
- Snowflake
- PostgreSQL
- H2
- Microsoft Access
- Microsoft SQL Server

Fig. 20: DB connectors in KNIME (image by KNIME-Cheat Sheets).

It is also possible to connect to cloud-based databases like Google BigQuery and Amazon Redshift.

Fig. 21: Cloud-DB connectors in KNIME (image by KNIME-Cheat Sheets).

To query the databases, you have two options to choose from. Either you write, as usual, the SQL code directly in a node, …

Fig. 22: Querying a database in KNIME with SQL (image by author).

or for the less experienced in SQL, this can be done directly with appropriate nodes.

Fig. 23: Querying a database in KNIME with the DB Nodes (image by author).

Access NoSQL Databases

Finally, NoSQL databases such as mongoDB can also be accessed in KNIME via a corresponding node.

Fig. 24: Loading JSON dataset in MongoDB (image by author).

And the best part is that you can combine all data sources together, as Rosaria Silipo and Lada Rudnitckaia show in their example workflow in the e-book “Will they blend?”.

The six databases are: MySQL, MongoDB, MS SQL Server, MariaDB, Oracle and PostgreSQL, and the corresponding workflow can be seen in the following image.

Fig. 25: Database Jam Session in KNIME (image by Rosaria Silipo).

Machine Learning

KNIME provides a wide range of machine learning algorithms. Some of the key machine learning algorithms available in KNIME include:

  1. Supervised learning algorithms: These include popular algorithms such as Linear, Polynomial and Logistc Regression, Generalized Linear Models (GLM), Regression Trees, Random Forest, Support Vector Machines (SVMs), and Neural Networks.
  2. Unsupervised learning algorithms: such as K-Means Clustering, Principal Component Analysis (PCA), and Hierarchical Clustering.
  3. Semi-supervised learning algorithms: such as Self-Organizing Maps (SOMs) and Expectation Maximization (EM).
  4. Ensemble learning algorithms: such as Bagging, Boosting and Random Subspace.
  5. Deep Learning algorithms: These include algorithms such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)

All of these algorithms can be easily integrated with the corrisponding nodes into KNIME workflows, allowing users to build complex machine learning models quickly and easily.

Fig. 26: KNIME cheat sheet for Machine Learning (image by KNIME).

Additionally, KNIME provides many pre-built nodes to simplify the process of building models, such as feature selection, normalization, and cross-validation.

In the following short video you can see an example of a regression tree learner.

Fig 27.: Example of Regression tree learner (image by KNIME).

With KNIME, users can quickly and easily build, evaluate, and deploy machine learning models with minimal coding required.

Extensions for Python, R and more

KNIME is a highly flexible and extensible platform that integrates with a variety of programming languages, including Python and R, to provide users with even greater power and versatility.

With the integration of Python and R, data scientists can use their preferred programming language to build and deploy models directly within the KNIME environment. This means that they can take advantage of the vast array of packages and libraries available in these languages, including TensorFlow, scikit-learn, and others, to create complex models and workflows.

Coding in Python

Starting with KNIME Platform 4.7 and above, the Python Integration is pre-installed with a selection of packages (e.g. Python libraries) as a bundled environment. This means you can use the Python Script node without needing to install, configure, or even know about environments.

Fig. 28: Python Script Node in KNIME (image by KNIME).

Add Additional Custom Packages

If you need to adopt a Python package that is not available in the bundled environment, there is a solution for this as well.
The “Conda Environment Propagation node” enables you to snapshot details of your Python environment, be that installed packages or simply the environment name, and “propagate” that environment onto any new execution location where the Conda tool is also installed (any system with Anaconda already installed certainly qualifies).

Fig. 29: Use custom Python packages using the Conda Environment Propagation node (image by KNIME).

In the node configuration dialog, you can select the required environment and the packages to be available at the execution locations.

Fig 30: Configuration Dialog for Conda Environment Propagation node.

Get Started with the Python Script Space

On KNIME Community Hub, you will find the Python Script Space, which contains example workflows for you to quickly learn how to use the Python Script node in your workflows. This space of examples is especially for KNIME users who are keen to use Python scripts inside KNIME.

Fig 31: The Python Script space in the KNIME Hub (image by KNIME).

For more details coding with Python in KNIME, I recommend the following articles.

Conclusion

KNIME is a highly powerful and versatile tool for data science that has become increasingly popular in recent years.

With its user-friendly interface, extensive library of algorithms and extensions, and ability to integrate with programming languages like Python and R, KNIME is a complete solution for data scientists of all levels of expertise.

Whether you are working on a simple data preprocessing task or building complex deep learning models, KNIME has the tools you need to get the job done.

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member.

It’s $5 a month, giving you unlimited access to thousands of Data science articles. If you sign up using my link, I’ll earn a small commission with no extra cost to you.

Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “
Data Science with Yodime”.

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.