Skip to main content

SAP's recent announcement of a strategic partnership with Databricks has generated significant excitement among SAP customers. Databricks, the data and AI experts, presents a compelling opportunity for leveraging analytics and ML/AI capabilities by integrating SAP HANA with Databricks. Given this collaboration's immense interest, we are thrilled to embark on a deep dive blog series.

In many customer scenarios, a SAP HANA system serves as the primary entity for data foundation from various source systems, including SAP CRM, SAP ERP/ECC, SAP BW. Now, the exciting possibility arises to seamlessly integrate this robust SAP HANA analytical sidecar system with Databricks, further enhancing the organization's data capabilities. By connecting SAP HANA (with HANA Enterprise Edition license) with Databricks, businesses can leverage the advanced analytics and machine learning capabilities (like MLflow, AutoML, MLOps) of Databricks while harnessing the rich and consolidated data stored within SAP HANA. This integration opens up a world of possibilities for organizations to unlock valuable insights and drive data-driven decision-making across their SAP systems.

Multiple approaches are available to federate SAP HANA tables, SQL views, and calculation views in Databricks. However, the quickest way is to use SparkJDBC. The most significant advantage is SparkJDBC supports parallel JDBC connections from Spark worker nodes to the remote HANA endpoint.

Architecture
Figure 1: Architecture

Let us start with SAP HANA and Databricks integration

First, SAP HANA 2.0 is installed in the Azure cloud and we tested the integration with Databricks.

Installed SAP HANA info in Azure:

version2.00.061.00.1644229038
branchfa/hana2sp06
Operating SystemSUSE Linux Enterprise Server 15 SP1

Here is the high-level workflow depicting the different steps of this integration.

Please see the attached notebook for more detailed instructions for extracting data from SAP HANA's calculation views and tables into Databricks using SparkJDBC.

SAP HANA
Figure 2: Steps to access SAP HANA data into Databricks

Configure the SAP HANA JDBC jar (ngdbc.jar) as shown in the image below

SAP HANA JDBC Jar
Figure 3: SAP HANA JDBC Jar

Once the above steps are performed, perform a spark read using the SAP HANA server and JDBC port.

SAP HANA table
Figure 4: SAP HANA table

Start creating the dataframes using the in shown below with schema, table name.

df_vbap_table = (spark.read
 .format("jdbc")
 .option("driver","com.sap.db.jdbc.Driver")
 .option("url", "jdbc:sap://20.XX.XXX.XXX:39015/?autocommit=false")
 .option("dbtable", "ECC_DATA.VBAP")
 .option("user", "******")  ## should be using databricks secrets instead of putting credentials in code
 .option("password", "******")  ## should be using databricks secrets instead of putting credentials in code
 .load()
)

Also, we can do a filter pushdown by passing SQL statements in the dbtable option.

.option("dbtable", "(select country, sales from ECC_DATA.VBAP where country = 'USA')")

To get data from the Calculation View, we have to do the following:
For example, this XS-classic calculation view is created in the internal schema "_SYS_BIC".

Calculation View
Figure 5: Calculation View

This code snippet creates a PySpark dataframe named "df_sap_ecc_hana_cv_vbap" and populates it from a Calculation View from the SAP HANA system (in this case, CV_VBAP).

calc_view = '"_SYS_BIC"."ecc-data-cv/CV_VBAP"'

df_sap_ecc_hana_cv_vbap = (spark.read
 .format("jdbc")
 .option("driver","com.sap.db.jdbc.Driver")
 .option("url", "jdbc:sap://20.XXX.XXX.XXX:39015/?autocommit=false")
 .option("dbtable", calc_view)
 .option("user", "*******")  ## should be using databricks secrets instead of putting credentials in code
 .option("password", "*******")  ## should be using databricks secrets instead of putting credentials in code
 .load()
)

After generating the PySpark data frame, leverage Databricks' endless capabilities for exploratory data analysis (EDA) and machine learning/artificial intelligence (ML/AI).

Summarizing the above data frames:

Dataframes

Dataframes from SAP HANA
Figure 6: Summarize of Dataframes from SAP HANA

The focus of this blog revolves around SparkJDBC for SAP HANA, but it's worth noting that alternative methods such as FedML, hdbcli, and hana_ml are available for similar purposes.

Download Notebook

Try Databricks for free

Related posts

See all Platform Blog posts