How to Connect Pentaho Data Integration to SAP HANA

pentaho-saphana

Recently I had to connect Pentaho Data Integration to SAP HANA and I made some notes along the way:

The first step is to get the SAP HANA JDBC driver, a file called ngdbc.jar. The quickest way is to download and install the SAP HANA client from the SAP Software Download Center, and then extract the file from your installation directory
(C:\Program Files\sap\hdbclient\ on Windows, /usr/sap/hdbclient/ on Linux)

If you don’t have a user account for the SAP Software Download Center, you’ll need to get the driver from some other source. For example:

  1. Install the SAP HANA tools into eclipse (oh, yeah, you may need to install eclipse first)
  2. Look for the eclipse plugin files that were installed with the SAP HANA tools  (mine were in <home_folder>/.p2/pool/plugins)
  3. Look for any jar file with jdbc on its name. I found one called com.sap.ndb.studio.jdbc_2.3.5.jar.
  4. Change the jar extension to zip and unpack it. Cross your fingers.
  5. Look for a lib directory and see if the ngdbc.jar file is in it.
  6. Repeat 4,5 until you find the file.

(I don’t get why SAP wants to hinder development for their platform by making it almost impossible to get their JDBC driver… but anyways…)

Go to the folder where you installed Pentaho Data Integration and copy your recently obtained ngdbc.jar file into the lib directory.

Restart (or start) Pentaho Data Integration.

Pentaho Enterprise Edition: If you have the Enterprise Edition of Pentaho Data Integration, doing a bulk load in SAP HANA is pretty straightforward. Just follow the instructions here.

Pentaho Community Edition: I believe Pentaho doesn’t provide the SAP HANA bulk load plugin for you. In that case, you need to set up a Generic Database connection and use a regular table output step.

Create a new transformation, and then create a new connection.

Screen Shot 2016-06-21 at 11.16.56 PM

Per this page, we learn that the connection string for a SAP HANA connection is

jdbc:sap://<server>:<port>

Where the port is

3<instance_number>15

So if your instance number is 10, the port would  be 31015.

The custom driver class name is com.sap.db.jdbc.Driver

Screen Shot 2016-06-21 at 11.17.47 PM

Fill the form, click test, Ok and you’re good to go!

You may now use this connection to read/write data from/to SAP HANA.

I tested this with PDI 4.4, 5.4 and 6.1. Help me keep this post updated by pinging me at twitter if any of this is no longer valid.

 

Nate Silver And The Age of Data Journalism

538_intro4

A few days ago, the new version of Nate Silver’s FiveThirtyEight went live, backed by ESPN.

According to Silver’s observations, explained in his site’s manifesto, the market is ripe for a data-oriented journalism.

I totally agree. A day doesn’t go by in which I hear or read an argument that painfully drags along because of its lack of data. Facebook, Twitter and the traditional media are full of these poorly documented ideas and debates.

I don’t know how it is in other countries, but where I come from, us engineers chose our career because we were socially awkward and/or bad with words, whereas others chose mass communications because they were poor at math. Turns out that basic math and the arithmetic rule of three are insufficient tools for explaining complex phenomena in the real world.

One of the first posts in FiveThirtyEight was precisely about my home country and it coldly and matter-of-factly explains the current political crisis just using numbers.

I think that you either believe in science (and the importance of science) or you don’t. And it’s a dogma. People from one camp just cannot have arguments with the other because they live in different realities. “Do you believe in science? Do you know how to spot a tainted poll?” should be the starting questions for any debate.

And if you believe in science and want to have intelligent discussions or want to report the issues, then you better take a look at the numbers and take the time to understand them. We are living in a world with a fantastic overabundance of data, in which public databases are just a click away, tools like Excel and R allows you to do statistical analysis and sites like FiveThirtyEight digest and explain what the data says in (somewhat) easier terms. There’s no excuse –besides trolling– for being an ill-prepared journalist or discussing well-documented issues using false premises.

 

Moneyballing criminal justice

One of the problems of justice systems everywhere is that they depend on subjectivity and have near zero data-mining expertise. Because of that, tons of money are wasted in keeping low-risk offenders in jail.

As the attorney general for New Jersey, Anne Milgram changed the panorama of her state’s criminal justice system. By applying statistics to create projections, she devised a dashboard to single out the worst offenders and make sure that they were prosecuted. By applying Moneyball concepts, her methods minimized subjetive decisions, lowered costs and optimized the justice system.