Has anyone found any websites that offer different datasets for students to look at?
I did this unit with my 8th grade students last year and we created a survey for the entire school to take. While I was happy with it, I wanted to see if there is anything out there that they could use as a guide.
We decided on a topic as a class, I sent the survey out to the school and then they created the app from that information.
Kaggle.com is a good site with a lot of relevant datasets. I found a fun one on Harry Potter characters that makes a good lesson on data cleansing. It’s got lots of relatable data (with some errors!)
In realistic situations: the “testing” dataset should be a random sample from the original dataset. This means the model is created using 90% of the data, and then validated and tested against the other randomly-collected 10% of the data.
However, as the teaching-tip notes, this can lead to slight changes in accuracy since different testing data is grabbed each time the model is trained. In classroom settings, this can lead to curious situations where two students might use the same dataset with the same features, but get different accuracy.
To avoid this confusion throughout the unit and create a consistent, expected teaching experience: most of the lessons use the last 10% of the dataset as the testing data so it’s the same every single time. But this is really more of a pedagogical consideration - by the time we get to these projects, we use the more realistic real-world setting of choosing a random 10% of the data instead.