Datasets too big? CS Discoveries, AI Unit, Lesson 15

claudetteguy16 · June 9, 2022, 1:54pm

My students are trying to train their models for the Mini-Project (Lesson 15), but it seems like the datasets have too much information. When they try to pick a column to be the label or a feature, the box on the right side says “Categorical columns with more than 50 unique values cannot be selected as the label or a feature”. This limitation really restricts the predictions students can make. For example, for the movie dataset, students can not use Director, Name (movie name), Star, Writer as a Label or a Feature.

Am I missing something here? This is the first time I’ve taught the AI unit, but I did do the self-paced teacher training.

If I didn’t miss something and this is a real restriction, does anyone have an idea of how to get around it (a way to use just a subset of the data so those columns can be used as a feature or label)? Thanks!!

Dan · June 9, 2022, 2:28pm

Hey @claudetteguy16 ! Great question - happy to help illuminate!

It’s true that when you have a column with more than 50 unique values, they can’t be used for labels or features. This has to do with how AI Bot would use them to find patterns and make decisions behind the scenes:

Even though AI Lab doesn’t let you select these columns for labels or features, it’s worth asking how reasonable it is to choose those particular columns as labels or features in the first place. For example, you might try to predict who directed a particular movie based on the rating that the movie received. Even if we showed the data for this particular scenario, it’s a bit overwhelming. Here’s just a subset of that data:

With so many people in this column, you’d inevitably have some collisions that occur - for example, both Aaron Schneider and Aaron Setltzer both directed just one PG-13 movie. If AI Bot is trying to predict a director and all it knows is it’s a PG-13 movie, which one of those two directors does AI Bot decide to pick and use for its recommendation? With just this data, it’d have to pick one of them to use for its prediction and it wouldn’t know which one to choose. To help make that decision, AI Bot would probably need another column of information - maybe rating and studio. But again - with so many directors to choose from, you’d need a lot of additional features for AI Bot to determine enough patterns to make reasonable predictions about each person in this list. At this point, we’re essentially creating these large AI models that exist in the world. For example, in order for facial recognition to accurately predict who you are, it needs a wide collection of data points on your face and it needs those points to be inter-related in such a way that it’s unlikely to form collisions where two people have the same set of features. When that happens, that’s when you hear stories of people being misidentified based on their features, such as Aaron Schneider and Aaron Seltzer being mistaken for each other in my example above since they both directed only one PG-13 movie.

All of this is a long way of saying: maybe choosing a column with mostly unique features isn’t actually the best decision to make when training a model, which we hope is one possible realization in this unit. It’d be great to predict a director based on these other features, but do we have enough features to do that uniquely? And if we don’t, is it reasonable (and ethical) to train a model where we know it can make mistakes with identifying people? These are some powerful conversations that can happen from considering which columns to use when training a model.

Hope that helps clarify!
-Dan Schneider
Code.org Curriculum Writer

Topic		Replies	Views
AI and Machine Learning - Lesson 22 Unit and Lesson Discussion	6	110	February 2, 2025
AI and Machine Learning Lesson 15 Coding and Debugging Help	4	1511	December 13, 2021
Learning how to use dataset/codes in applab CS Principles applab , csp , csd , data	7	3324	February 14, 2022
AI Unit final project question Unit and Lesson Discussion	4	635	January 8, 2022
New AI teaching assistant CS Discoveries csd-unit-3	6	183	October 14, 2024

Datasets too big? CS Discoveries, AI Unit, Lesson 15

Related topics