Exploratory v6.2 Released!

Kan Nishida
learn data science
Published in
9 min readOct 13, 2020

--

I’m super excited to announce Exploratory v6.2! 🎉🎉🎉

As always, we have a bunch of new features & enhancements, here’s a quick overview of the following areas.

  • Summary View
  • Analytics
  • Chart
  • Data Wrangling
  • Dashboard
  • Parameter

Summary View

Reference lines for Mean & Midian

Now you can see the mean and the median values as reference lines on top of the histogram charts for numerical columns.

Analytics

XGBoost

Finally, we have added XGBoost to the Analytics view. 🔥

We had this as a step even before, but having it under the Analytics view makes it not only easier to build the models with XGBoost but also much easier to get insights from them, thanks to the grammar-based tabs with rich visualizations that are common among all the machine learning / statistical learning models.

For example, here is the ‘Prediction’ tab that shows what would be the predicted values when the values change in a given variable.

Learning

One thing that is unique to XGBoost is the ‘Learning’ tab, which shows how the prediction model improves as it adds more trees.

For example, the above chart shows that the model continues to improve the model quality, which is judged by ‘Negative Log Likelihood’ (Smaller is better), as it adds more trees with the training data (Blue line).

However, you can also see that it stops improving after about 10th tree with the validation data (Orange line).

You can configure how you want the learning to stop from the property.

Time Series Forecasting — ARIMA

This too, we have finally added ARIMA to our Time Series Forecasting family. 🔥

Here, I have run the ARIMA (Auto ARIMA) to forecast the electricity price for the next 12 months for each state of California, Florida, New York, and Texas.

Unlike the Prophet, which is another Time Series Forecasting algorithm, the ARIMA requires more knowledge around how it builds the model.

But, the ARIMA has an auto model building capability called ‘Auto Arima’, which is turned on by default. So you can start building ARIMA based models even without knowing what it is and gaining insights from the Seasonality and the Trend tab.

If you know more about the ARIMA, then you can manually fine-tune the model from the property with a series of diagnostic information presented under the various tabs.

With the above picture, you can see that it does a pretty good job for California (left hand side top) and Texas (right side bottom), but not so good for the other two states.

Just as a reference, here is one that is done by the Prophet. As you can see it does a pretty good job for fitting the model for all the states.

Hypothesis Test — t Test

The t Test is to see if the difference between the means of two groups is significant or not. Typically, you use P-Value as the threshold.

Here, I’ve run the t-Test to see if the difference in the means of Income between Female employees and Male employees is significant for each Job Role.

By looking at the P-Value column, we can see that it is significant only for the Research Director (at the bottom) if we take 5% (0.05) as the threshold.

But often, the P-Value alone is not enough to determine the significance because it depends on the sample size and the degree of difference. And what can be the threshold value of the P-Value can become a never ending argument.

So there are 2 ways to address this problem.

One is to report the Effect size and the Power value, which can be found under the same Summary tab.

Another is to use the confidence interval of the difference, which can be also found under the same Summary tab.

With v6.2 release, you can see the confidence interval visualized with Error Bar chart under the new ‘Difference’ tab.

If the line includes 0 it means that the real difference could be 0 at 95% chance, hence there might not be any difference at all.

Technically, the confidence interval (range) becomes shorter when you have more data (more employee data). So if you want and you can then you can add more data until it becomes significant. But, that is called ‘P hacking’.

You want to make sure that you have enough data (or enough power) for your testing, but you don’t want to cheat the test by adding a crazy amount of data.

This is why the confidence interval is useful. It helps you visualize how much is the difference and gives you more context than just if it is significant or not.

Chart

Confidence Interval for Line Chart

Now you can enable ‘Confidence Interval’ to show the range along with the line when the Y-Axis is either Mean or Ratio.

Mean

Ratio

When you have multiple lines with Color or Repeat By it will show the confidence interval for each line.

Error Bar — Ratio

We are introducing ‘Ratio by’ control when the Calculation Type is ‘Ratio, which you can use to control the denominator of the ratio.

For example, here I’m assigning a logical data type column ‘Attrition’ and selecting ‘Number of True’ as the calculation.

Since I’ve selected the ‘X-Axis’ as the denominator, each ratio with the error bar is showing the ratio of TRUE for each Job Role, which is in fact the attrition rate.

Remove NA Level

The Factor data type is a special data type that looks like the Character data type but it can have an order (or level) information. For example, the month names are better to be set as the Factor rather than the Character because they have the order. (Jan, Feb, March, etc…)

When you assign such a Factor data type column to X-Axis, not only does it preserve the order so that it starts with ‘Jan’ and ends with ‘Dec’, but also it preserves all 12 bars for all the month names even when there is no value at all.

But sometimes, you might not want to display the levels (or values) with no value. Now you can hide such levels from the ‘Missing Value Handling’ dialog.

You can control whether you want to show such levels with no values or hide them.

Data Wrangling

Disable Tokens inside Step

When I get errors or some results that I don’t expect while I clean or transform data, often I get confused because I don’t know where the problem starts. This happens especially when there are many tokens (operations) within the same step such as ‘Create Calculation’ step.

To address this problem, we are now supporting ‘Disable’ for each token.

So you can now disable any of the tokens and see if the step without those tokens can work as expected.

As you guessed, this can be useful for the Filter step as well.

Duplicate Tokens inside Step

You can duplicate (copy) existing tokens to create similar, say, calculations by clicking the Duplicate icon.

Assigning Values based on Multiple Conditions

Sometimes you want to assign values based on conditions. You can use the ‘case_when’ function to do this, but the syntax might look a bit too complicated.

So we’ve built a new UI for this.

Select ‘Replace Values’ and ‘By Setting Conditions’ from the column header menu.

Then you can start constructing the conditions and the value assignments.

To construct the conditions you can click the button to open the same dialog that is used for the Filter.

Dashboard

You can now adjust the height of each row section when ‘Fit to Screen’ is not checked.

For example, I have the 2nd row of the dashboard to show a commentary text, but the default height is too high for the amount of text I have.

Now you can change the height under the Edit mode.

And you can set different height for each row.

Parameter

Dynamic Values for Dropdown and Slider

You can now get the values for the List of Values (Dropdown) and the Slider types dynamically from any data frames.

This used to be ‘Copy’, which means once you have copied the values then the values will stay the same.

With this new ‘Get Values from Data Frame’ option, the values are dynamically extracted from a specified data frame.

And not only for the List of Values, but also it supports for the Slider with numerical values to set the min and the max values.

The great thing about this ‘dynamic’ way of getting the value is that the values will always reflect the latest even after the underlying data has been updated by re-importing new data.

And this means that when you publish to the server and schedule your dashboard (or chart, analytics, data, etc.) the underlying data that is referenced by the parameters will also get scheduled (if necessary) behind the scene so that you will see the up-to-date values for the dropdown and the slider.

That’s all!

But, we have many more enhancements and bug fixes in this release. Don’t forget to check out the release note for the full list!

And, download Exploratory v6.2 from the download page today!

Cheers,

Kan, CEO/Exploratory

Try Exploratory!

If you don’t have an Exploratory account yet, sign up from our website for 30 days free trial without a credit card!

If you happen to be a current student or teacher at schools, it’s free!

--

--

CEO / Founder at Exploratory(https://exploratory.io/). Having fun analyzing interesting data and learning something new everyday.