Datasets too big? CS Discoveries, AI Unit, Lesson 15

My students are trying to train their models for the Mini-Project (Lesson 15), but it seems like the datasets have too much information. When they try to pick a column to be the label or a feature, the box on the right side says “Categorical columns with more than 50 unique values cannot be selected as the label or a feature”. This limitation really restricts the predictions students can make. For example, for the movie dataset, students can not use Director, Name (movie name), Star, Writer as a Label or a Feature.

Am I missing something here? This is the first time I’ve taught the AI unit, but I did do the self-paced teacher training.

If I didn’t miss something and this is a real restriction, does anyone have an idea of how to get around it (a way to use just a subset of the data so those columns can be used as a feature or label)? Thanks!!

Hey @claudetteguy16 ! Great question - happy to help illuminate!

It’s true that when you have a column with more than 50 unique values, they can’t be used for labels or features. This has to do with how AI Bot would use them to find patterns and make decisions behind the scenes:

Even though AI Lab doesn’t let you select these columns for labels or features, it’s worth asking how reasonable it is to choose those particular columns as labels or features in the first place. For example, you might try to predict who directed a particular movie based on the rating that the movie received. Even if we showed the data for this particular scenario, it’s a bit overwhelming. Here’s just a subset of that data:

With so many people in this column, you’d inevitably have some collisions that occur - for example, both Aaron Schneider and Aaron Setltzer both directed just one PG-13 movie. If AI Bot is trying to predict a director and all it knows is it’s a PG-13 movie, which one of those two directors does AI Bot decide to pick and use for its recommendation? With just this data, it’d have to pick one of them to use for its prediction and it wouldn’t know which one to choose. To help make that decision, AI Bot would probably need another column of information - maybe rating and studio. But again - with so many directors to choose from, you’d need a lot of additional features for AI Bot to determine enough patterns to make reasonable predictions about each person in this list. At this point, we’re essentially creating these large AI models that exist in the world. For example, in order for facial recognition to accurately predict who you are, it needs a wide collection of data points on your face and it needs those points to be inter-related in such a way that it’s unlikely to form collisions where two people have the same set of features. When that happens, that’s when you hear stories of people being misidentified based on their features, such as Aaron Schneider and Aaron Seltzer being mistaken for each other in my example above since they both directed only one PG-13 movie.

All of this is a long way of saying: maybe choosing a column with mostly unique features isn’t actually the best decision to make when training a model, which we hope is one possible realization in this unit. It’d be great to predict a director based on these other features, but do we have enough features to do that uniquely? And if we don’t, is it reasonable (and ethical) to train a model where we know it can make mistakes with identifying people? These are some powerful conversations that can happen from considering which columns to use when training a model.

Hope that helps clarify!
-Dan Schneider
Code.org Curriculum Writer

1 Like