Clustering Starting Pitchers

My favorite machine learning algorithms are unsupervised clustering. I think there is elegance in finding patterns in data that we don’t know about. Clustering is grouping similar data together in a data set when the groupings are unknown. This is based on how far away data points are from other groups and how close they are to other data points in their group. In layman’s terms, you are like your group but different from other groups. As an example, let’s think about cats and dogs. There are many different types of dogs, but they are all similar because they are all dogs. Same for cats, many types but all cats. However, cats are different than dogs, thus are different animals. There are many more technical insights into clustering, but this is the idea. Once we get clusters, since groupings are unknown, we use a domain expert to determine what these groups represent. 

I may be wrong because I’ve never worked in a front office, but I believe clustering is underutilized in the baseball industry. It is such a powerful tool that can lead to many discoveries. From a team perspective, we can identify undervalued players that are like superstars in the game and at a cheaper price. From an agency perspective, we also identify undervalued players to target, as well as potential superstars at the high school, college, and minor league level. Of course, with any machine learning algorithm, it’s never a guarantee but it can glean insights that we may not be aware of. There are many questions as to which variables to use, from what time period our data comes from, and many other data questions that need to be taken into consideration but I will keep it simple. 

Methodology

For this analysis I took starting pitchers that had 100+ innings pitched in 2021, which resulted in 115 starting pitchers. Each variable used in clustering adds a dimension to the data. For example, if we have three variables then the data lives in three dimensions. Obviously for four or higher we cannot visualize the data but also the data gets further apart, thus not as good clusters. I decided on three metrics: WAR, Swinging Strike %, and Barrel %. It is up for debate if these are the best to use and there very well could be a better combination, but I decided on these because they exemplify a pitcher’s value, swing and miss ability, and how well they limit hard contact. 

The clustering technique I will be utilizing is called k-means. This algorithm clusters the data into k groups. The difficulty with this algorithm is that we must choose a k value to run it. So how do we choose k? Well sometimes it is chosen with domain knowledge if the number of groups is known already and other times we use a metric to determine the best k value. In this case I will be utilizing a mix of domain knowledge and a metric. The metric I will use is called a silhouette score, where a value closer to 1 indicates better clustering. A silhouette score considers how close data is to their own cluster (cohesion) and how far away they are from other clusters (separation). For k from 2-10 I will calculate the silhouette score, and I will repeat this 10 times and average them to avoid variation caused by randomness. When clustering, we always scale the data because the algorithm is distance based.

We have all the tools, let’s cluster!

Clustering

From the silhouette scores, the best option is two clusters with a value of 0.396. However, there are more than two types of pitchers in the league so I will go with three, which had a value of 0.32. Neither of these scores are great, as they are far from 1, but real data is never like in textbooks. Here are the averages of the metrics per cluster. 

ClusterWARSwinging Strike %Barrel %
13.670.1310.068
20.9890.1010.094
32.190.0990.077

Results

Since we have only three variables, we can visualize the data in a 3D plot and see the separation of clusters.

x: WAR, y: Swinging Strike %, z: Barrel %

Another neat way of visualizing the groups and analyzing them is a radar plot. It shows what variables are the largest signals for each group.

Cluster 1

Cluster 1 (in red) is the group of elite pitchers. They have a high WAR and Swinging Strike % while limiting hard contact with a low Barrel %. Here are the pitchers in Cluster 1.

NameWARSwinging Strike %Barrel %Cluster
Carlos Rodon4.90.150.0661
Corbin Burnes7.50.1660.0311
Max Scherzer5.40.1590.081
Walker Buehler5.50.1160.0681
Brandon Woodruff4.70.1290.0581
Trevor Rogers4.20.1410.051
Lance Lynn4.20.120.0571
Zack Wheeler7.30.1240.0461
Kevin Gausman4.80.1530.0711
Robbie Ray3.90.1550.0981
Freddy Peralta3.90.1440.0571
Marcus Stroman3.40.1160.0651
Logan Webb40.1240.0561
Pablo Lopez2.30.1180.0741
Lance McCullers Jr.3.30.1160.0531
Shohei Ohtani30.1290.0711
Sandy Alcantara4.20.1330.0611
Alek Manoah20.1260.0581
Gerrit Cole5.30.1450.0981
Joe Musgrove3.20.1270.0741
Charlie Morton4.60.1240.0491
Frankie Montas4.10.1370.0871
Shane McClanahan2.50.1480.1071
Kyle Gibson3.10.1040.0451
Lucas Giolito40.1530.0671
Clayton Kershaw3.40.1670.0691
Luis Garcia3.10.1320.071
Tyler Mahle3.80.1140.0651
Nathan Eovaldi5.60.1260.0631
Alex Wood2.50.1250.0531
Jordan Montgomery3.30.1370.0741
Dylan Cease4.40.1480.0991
Sean Manaea3.30.1230.081
Luis Castillo3.70.1310.0451
Sonny Gray2.40.1060.0471
Yu Darvish2.90.1210.0881
Jameson Taillon20.1220.0821
German Marquez3.40.1210.0531
Aaron Nola4.50.1280.0711
Kenta Maeda1.70.1360.0631
Logan Gilbert2.20.1250.0881
Eduardo Rodriguez3.80.1170.0681
Brady Singer20.1020.0561
JT Brubaker0.30.120.0881
Andrew Heaney1.20.1270.0931

We see the top starting pitchers like NL CY Young award winner Corbin Burnes, AL CY Young award winner Robbie Ray, Max Scherzer, Kevin Gausman, Shohei Othani, and others. Some up-and-coming talent like Alex Manoah, Logan Webb, and Logan Gilbert are also a part of this group. There are many interesting names in this group, but a couple of names stand out here as surprising are JT Brubaker and Andrew Heaney. These players are not known as elite yet are in this group. Does this mean they will be elite? Not necessarily, but the metrics indicate they could be. This is a big reason why the Yankees last season and Dodgers this season took a chance on Heaney. They have the tools to be elite but could be a steal for a team.

Cluster 2

Cluster 2 (in green) are the underachievers with low WAR and Swinging Strike % and allow a lot of hard contact with a high barrel %. Here are the pitchers in Cluster 2.

NameWARSwinging Strike %Barrel %Cluster
Trevor Bauer1.80.1260.1062
Framber Valdez1.90.1020.0582
Adrian Houser1.50.070.052
Ian Anderson1.90.1190.0952
Casey Mize1.30.0940.12
Rich Hill1.60.0980.0882
James Kaprielian1.30.110.0942
Marco Gonzales0.60.0910.1142
Joe Ross1.30.1110.092
Blake Snell2.10.1290.112
Zac Gallen1.50.0910.0792
Jake Odorizzi10.0940.0962
Tarik Skubal0.80.1110.1432
Yusei Kikuchi1.10.1250.112
Taijuan Walker1.20.0950.1022
Austin Gomber1.30.1130.0942
Dane Dunning1.80.10.082
Nick Pivetta2.10.1060.0822
Jon Gray2.30.110.0692
Jon Lester00.0870.0772
Vladimir Gutierrez0.60.0960.0842
Drew Smyly0.20.1180.1082
Martin Perez0.50.080.0942
Kris Bubic-0.10.0890.0992
Triston McKenzie1.10.1240.12
Adbert Alzolay0.40.1150.112
Dallas Keuchel0.70.0860.0912
Garrett Richards0.40.0940.0932
Erick Fedde1.10.0890.092
Brad Keller1.10.0910.1092
Jordan Lyles-0.20.1050.0962
Wil Crowe-0.30.1050.0952
Zach Davies0.10.090.0912
J.A. Happ0.50.0810.1162
Patrick Corbin0.20.1120.0922
Mitch Keller1.10.0820.0682
Jorge Lopez0.80.0820.0932

We see some names that aren’t surprising like Dane Dunning, Drew Smyly, Wil Crowe, and others. Also, there are pitchers who took a step back like Yusei Kikuchi, Jake Odorizzi, and Blake Snell. Several surprises in here but a couple that stand out are Martin Perez and Trevor Bauer. Martin Perez got a long-term deal with the Tigers but was in this group. Bauer’s WAR was affected not playing the entire season but his Barrel % was high, which contributed to him being in this group.

Cluster 3

Cluster 3 (in blue) are the average performers for the season. Not the best but not the worst WAR, Swinging Strike %, and Barrel %. Here are the pitchers in Cluster 3.

NameWARSwinging Strike %Barrel %Cluster
Julio Urias50.1120.0533
Eric Lauer1.80.1050.073
Max Fried3.80.1110.0633
Adam Wainwright3.80.0810.0623
Cal Quantrill1.80.0890.073
Chris Bassitt3.30.1010.0653
Anthony DeSclafani30.110.0813
Wade Miley2.90.1010.0613
Jose Berrios4.10.0990.0913
Chris Flexen30.0860.0633
Jose Urquidy1.80.1180.0933
John Means2.50.1190.1013
Michael Pineda1.30.1040.0913
Steven Matz2.80.0940.073
Aaron Civale0.80.0940.0823
Johnny Cueto1.50.0970.0663
Zack Greinke1.40.0910.0663
Zach Eflin2.20.1020.0683
Cole Irvin2.10.0890.0733
Kyle Freeland1.50.0850.0823
Hyun-Jin Ryu2.50.0970.0853
Antonio Senzatela3.50.0860.0593
Merrill Kelly2.40.0880.0633
Tyler Anderson2.10.1150.0853
Michael Wacha1.50.1150.0973
Zach Plesac1.10.1120.0943
Madison Bumgarner1.50.0960.0753
Kyle Hendricks1.30.0890.0843
Mike Minor2.30.1070.0933
Chris Paddack1.80.1120.0853
Ryan Yarbrough0.90.0960.0783
Mike Foltynewicz-0.80.080.0963
Matt Harvey1.90.080.0793

There are some players that make sense like Johnny Cueto, Zack Grienke, and Hyun-Jin Ryu. A few surprises in here like Matt Harvey and Julio Urias for different reason. We’d expect Harvey in Cluster 2 while Urias would have been expected in Cluster 1. 

Conclusion

We can clearly see differences in groups and identified players that are undervalued and even possibly overvalued. There could very well be a better combination of metrics to use and even could employ other clustering techniques. The beauty of machine learning is that there is always another way to do something that could be an improvement, and the fun part is finding it. Clustering is not perfect, and we saw with the silhouette scores that it definitely wasn’t perfect in this case, but clustering is still very useful to group players to see what insights there could be. 

Data

Leave a comment