This got me thinking about baseball statistics and specifically, could I apply a common analysis seen in Financial Services reporting & analytics to America's pastime? In this case, I want to line up the best Home Run hitters to compare their path to greatness. I planned on applying a technique that is commonly found in both Banking and Insurance firms.
Vintage Loss Curves
In banking the analysis is the cumulative losses in portfolios with similar characteristics/dimensions. This is very common way to compare the impact of changes in credit policy or changes in price (spread).
Here is an example of a dynamic Vintage Loss Curve on the Lending Club data.
This is a picture of the
existing solution that replicates the type of complex analysis we are seeing
customer implement with the Qlik platform.
Loss Triangle/Loss Development:
In insurance, a claims loss triangle or a loss development table provides the cumulative claims experienced and allows for comparison of different portfolios (or vintages) of policies. Here is a succinct definition: https://www.irmi.com/term/insurance-definitions/loss-triangle.
Here is a video of me building a Loss Triangle using one of Qlik’s demo apps:
In both cases, the cumulative totals (loan charge off’s and claims experience) is calculated and tranches of the portfolio are re-aligned for comparative & predictive purposes. This is very straight forward to create in the Qlik platform and once it is created, new data can be added with relative ease.
How can you apply this to other use cases?
The comparison of cumulative results over a period of time, whether it be a three year term loan or the lifetime of an insurance policy holder, allows firms to measure, interrogate and react to important decisions in pricing/underwriting, reserve balances and credit policies. The key elements of the comparison are time and occurrence, if you have those key factors, this type of analysis can be applied to data with a broad set of dimensions and an even broader set of use cases. Some examples:
- Loan Losses/Charge-off’s & Vintage Loss Curves (discussed above)
- Claims experience & Loss Triangles (discussed above)
- Portfolio returns
- Customer attrition
- Employee turnover
- Equipment maintenance
- Compliance breaches
- Cyber Security breaches
- Marketing responses
Traditionally, this type of analysis is created using manual, pre-aggregated and static solutions that do not allow a full exploration of what is really going on within each portfolio (or set of data points). This is in direct contrast with key differentiators of the Qlik platform: dynamic nature allows for any combination of characteristics/dimensions selected; specific combinations can be found with ‘google-like’ search; and, drill to detail is always part of the value proposition.
About the Data:
With the help of Chuck Bannon (Director of the Qlik Demo Team, sports analytics nut and fellow blogger) I was able to source data from the Top 50 Home Run hitters in Major League Baseball’s history from Baseball-Reference.com. This data was captured very easily with Qlik as the source data was in the form of html tables, also known as web files.
First pass at applying to baseball statistics – some housekeeping
When first looking at the data, I was quickly reminded of why I stepped away from baseball (as a fan) for many years. The “Steroid Era” left a bad taste in my mouth that remains today. I wanted the ability to remove players that have been implicated or have admitted to using Performance Enhancing Drugs (PED’s)…a simple footnote with an asterisk (*) is not enough, they need to be completely isolated so I can re-rank the portfolio of players that were not implicated.
Note: this is an extremely common scenario in business analytics. A significant event occurred, the organization needs to understand two things:
1. What do the numbers look like with that event
2. What do the numbers look like without that event
Fortunately, this was very easy to do with the Qlik platform. The main fact table with details on 26,281 home runs hit by the top 50 has a PlayerName, all I have to do is load an additional table into Qlik with two columns (PlayerName & SteroidsImplicated). The data model looks like this:
Here is a “walk” to my personal view of the ‘Greatest of all
Time’ Home Run hitters (just showing the Top 20 here for demonstration of my
initial review of the data):
Chart A below shows how close Albert Pujols (656) really is to Willie Mays (660) in career home run totals. According to this chart, Mays and Pujols are ranked 5th and 6th on the list (which may explain why you are not seeing a lot about this in the news).
However, while building the data model and diving in, I was reminded of the impact of the “Steroid Era” on that numbers.
Chart B looks at the same chart but this time with players that were implicated or admitted to using Performance Enhancing Drugs (PED’s) highlighted in red. These players were sourced from two articles:
"Baseball's most notorious steroid users" - Fox Sports
"Notable players linked to performance enhancing drugs" - ESPN.com
This chart was really eye opening to just how many players need to be removed from the list to get my “clean” list.
And finally, Chart C is the “clean” list providing me the ability to be in complete control of the analysis. I don’t care about the statistics of the players implicated or admitted, I only care about the players on Chart C.
Notice now, that the implications of what Albert Pujols is about to accomplish reveals itself, this is big news folks!
Let’s get started
Now that I have removed the dark shadow of the “Steroid Era” from my analysis, we can start to dive in on comparing different players path to greatness side by side by applying the same techniques described in Vintage Loss Analysis and Claims Triangles above.
A simple line chart with 'Cumulative HR’s’ on the y-axis and the player’s ‘sequential year played’ on the x-axis does the trick. Let’s take a look at Albert Pujols’ numbers:
As you can see, the top left line chart shows the cumulative # of HR’s hit over Albert’s 19 year career, the bar chart on the bottom right show his HR totals by year and the table indicates that I have selected ‘Albert Pujols’ and also shows the others included in my analysis.
This is interesting but not too insightful, as I want to really understand the “Home Run Curve” or path to achieving the accumulated career HR’s. Let’s take a look at the chart with both Willie Mays and Albert Pujols selected:
Now this becomes exponentially more valuable in comparing the two sets of information:
- Willie Mays career lasted 22 years, while Albert Pujols’ career is only in his 19th year.
- If Albert hangs on for three more years and simply matches Willie Mays totals for his last three years of 18, 8 and 6 HR’s, then 3 years from now we are talking about not only overtaking Willie Mays but actually eclipsing Babe Ruth’s HR Total (714) for the number 2 HR hitter of all time!
- Willie Mays had a very slow start to his home run hitting career. The first two years set him back, he has never been able to make up the difference.
- Notice the change in slope of the curves.
- Pujols started off stronger, averaging 41 HRs per Year in his first 9 seasons compared to Mays (31).
- In Mays next 9 seasons (10-18), he turn on the afterburners, averaging 38.5 HRs per Year compared to Pujols (31).
- Another KPI was also added to the table for comparative purposes, ‘# of Grand Slams’, and comparing these two players by this stat made me sit back in my chair. I am blown away by the difference here (Pujols 14 v Mays 8).
Three important findings with this exercise:
- Barring injury, Albert Pujols is on a trajectory to surpass BOTH Willie Mays AND Babe Ruth to become the number 2 greatest HR hitter of all time.
- Complex analysis used in the Financial Services industry can be (and should be) applied to other use cases and types of data models.
- The ease of segmenting my data by simply adding a table highlights a key differentiator for Qlik. I did not have to worry about the pitfalls of traditional data wrangling, I simply added the table and a full outer join on both of these tables were created within the Qlik engine. For spreadsheet junkies, what I am saying is that I didn’t have to run a sumif() or vlookup() process which has the effect of streamlining my work flow and reducing the number of opportunities for error.
Final note, any dashboarding tool can make a chart. The challenge to the market is to look beyond the pretty pictures and understand the transformative power of the Qlik associative engine. This engine the foundation of the modern, web based, end-to-end Qlik platform which solves complex business problems throughout the entire Data, Analytics and Insights supply chain.
When customers realize this, they realize they have the option of utilizing all, or any part of the platform to begin their own organization’s transformation.
Explore for yourself
If you want to explore this data and find some of your own insights, we have put the app on a server and can be explored here.
If you find an insight that is interesting, consider taking
a screenshot and share it on twitter, tagging it with #Qlik and #DingerDash.