Key Driver Analysis (KDA)
The built-in key driver analysis (KDA) feature has been removed with version 5.0.0. All existing KDA cells will remain as read-only cells in your existing canvases. At this time you are no longer able to re-run those cells or create new KDA cells. You can however still export the .csv file. We will update this page with an alternative to KDA with the new release.
From KDA to PopDelta
We've made the code for our previously built-in KDA feature available as the
PopDelta Python library.
PopDelta is a data mining library for Python, developed to process and analyze
pandas DataFrames, interchangeably referred to as Populations. The primary objective of
PopDelta is to identify underlying patterns within the data. This library is suitable for characterizing frequent patterns in a single DataFrame or comparing differences between two DataFrames, which proves useful for tasks such as contrasting customer groups or analyzing values' temporal shifts through cohort comparisons.
!pip install popdelta
from popdelta.pop_delta import PopDelta
# Create two datasets from a DataFrame, df
over_40 = df[df[“age] > 40]
under_40 = df[df[“age] <= 40]
# Set target variable for weighting to "age"
# Set 3 bins for discretizing numerical attributes
# No predefined string attributes
popDeltaW2 = PopDelta(target=”age”, num_bins=3, string_attributes=)
# Display results
for result in popDeltaW2.process_batch([over_40, under_40]):
The KDA cell allows you to automatically discover fundamental patterns/drivers in your data.
You can use it to characterize a single population, or to find the differences between two populations.
- 1 Pop: Find frequent co-occurring patterns in H1B applicants._
- 2 Pop: Which patterns distinguish applicants which have been certified from the ones have been denied H1B.
Some common use cases include quickly comparing two groups of customers, or comparing how values change over time by comparing two time-based cohorts.
1 Population: Select a dataset or dataframe output that represents the population of interest to profile.
2 Population: Select a "Test" dataframe that represents your focus population, and a second, separate "Control" dataframe that is the reference population to contrast. You will see which characteristics of your "Test" population end up over/under-represented in your "Control."
These are the different ways that your two populations might be different, or the characteristics you want to understand about your one population.
If I am comparing customers from January to customers from February, I might want to see features of "Age", "City", "Loyalty Status", "Is first time?" etc...
Target = Weighting
In certain cases, you don't just want to understand the number of rows that are in a population, or differ across two populations, but some other factor is more important than just frequency of rows.
For instance, I have customers from January compared to customers from February. I want to see how the two populations differ, and I might see first that 30% of January and 25% of February customers are first time customers.
But if I weight by "$ revenue" instead of rowcount (which is implicitly "# of transactions"), I might see that 25% of January revenue and 25% of February revenue come from first time customers.
You can toggle between natural, equiheight, and equiwidth bins.
- Natural Bins: Creates bins by using k-means clustering along a single dimension
- Equiwidth Bins: Each bin is the same width, but may not have the same number of observations
- Equiheight Bins: Each bin contains the same number of observations, regardless of how wide the bins are
Number of Buckets
Change the number of bins that exist for each feature included.
Choose between 1 or more drivers being intersected to define each bin.
For instance, the bins for:
- 1 Driver: Age 1-20, Age 21-40, Age 41-60, Height 150-175cm, Height 176-200cm
- 2 Driver: Age 1-20 & Height 150-175cm, Age 1-20 & Height 176-200cm, Age 21-40 & Height 150-175cm...
- 3 Driver: Age 1-20 & Height 150-175cm & Weight 0-60KG, ...
- Relative: The % of rows represented by the bin in overall population
- Absolute: The # of rows represented by the bin in overall population