Educational Data Mining for Technology-Aided Formative Assessment
Liveblog notes from a CALRG seminar given by Ilya Goldin, from the HCI Institute at Carnegie Mellon University, at the Open University, on 12 Feb 2013, entitled “Educational Data Mining for Technology-Aided Formative Assessment”.
His PhD research was a system to help peer review. Mike Sharples met him at the Alpine Rendez-Vous.
Postdoc at Carnegie Mellon; PhD at Pittsburgh.
His dissertation ‘Peering into Peer Review with Bayesian Models’, sits in those sorts of work where humans assess work where the correct answer isn’t known (rubric-based peer assessment); current postdoc work sits there too but also computer ‘learner differences in hint processing’ where computers assess work where the correct answer is known.
Feedback from peer learners – multiple people feedback on each paper. Want to help students who are learning to analyse open-ended problems. Also want to keep instructors in the loop.
Domain is open-ended problems in legal education. Problems in law are open-ended! Students may note different facts, invoke different concepts and theories. Multiple solutions may be Ok. Answer is free text argument justifying choice over others. Want to provide formative assessment to explain current performance, target performance, and how to improve it. Instructors need to know this too. (AI in Education conference this summer, CfP for workshop on peer review may be coming soon.)
Want to make inferences about hidden, latent artefacts – e.g. true student proficiencies, rather than how a few peers assess a given piece of work. EDM/statistical modelling is the approach to do this.
Research Questions – Is formative peer assessment valid, reliable; are the dimensions distinct; is the feedback helpful; can it be mined for latent information.
Rubric – like a scoring or marking guide; informs assessors of criteria (e.g. insight, logic, flow) and supports them in evaluating (by defining criteria – e.g. what is good, better, best). May structure feedback. Claim is any kind of writing can be evaluated for these criteria (insight, logic, flow) as a minimum default – history, physics, whatever. For open-ended problems it’s less clear. Two strategies: One, focus on argument, per se, be closer to the domain – leads to domain-independent or domain-relevant rubric. Two, focus on content of argument – which leads to problem-specific rubric, can use concept-oriented criteria.
Study: N=58 law students, midterm essays, uploaded in to Comrade (he built). Each essay sent to 4 random peers, reviews submitted back to author, authors gave back-reviews to peer reviewers. Also multiple-choice quiz (concept-oriented). Assigned students to one of two rubrics – one problem-specific, concept-oriented; the other domain-relevant, argument-oriented.
Question was hypothetical fact situation, with parties with legal claims against each other; task is to assess the factual strengths and weaknesses for each claim. Peer assessment was (quasi?) anonymous, with users identified by self-chosen nicknames (usernames).
Domain-relevant rubric: identifying issues, developing the argument; justified conclusion; writing quality. For each of these four dimensions, a rating 1-7 and an explanatory comment – rating defined at 1, 3 and 7 – e.g. “7 = essay identifies and clearly explains all relevant IP issues; does not raise irrelevant issues.”
Problem-specific rubric: Breach of non-disclosure, trade-secret misappropriation, violation of right of publicity, idea misappropriation x 2. Same rating scale for all dimensions. e.g. “3 = essay identifies claim, but neglects arguments pro/con and supporting facts; some irrelevant facts or arguments.” Different assignment with different rubric would need different key issues, but could re-use the rating scale.
Ratings alone often not well-justified, but getting reasons improves validity.
Build models of peer review, validated. For each essay: instructor’s score, each peer assessor’s mark. Choose best-fitting model on the two rubrics.
Multilevel Bayesian Models – three. First is no-pooling: treat each student independently, as their own fixed effect. Second is partial-pooling: what we know about one student’s proficiency tells us something about their classmates. Third is distinct dimensions – not just students differ, but rubric dimensions.
Compared three models, for both types of rubric, using DIC (lower is better, penalty for number of parameters and reward for accuracy of prediction).
For domain-relevant rubric, partial pool model best; for problem-specific rubric, distinct dimensions best. (Side joke about statistical significance for Bayesians.)
Correlation of mean rating with instructor score: domain-relevant was 0.46; problem-specific was 0.73, but 95% confidence intervals overlap.
Under no-pooling, mean peer ratings were not significant predictors of instructor scores; but! under partial pooling, mean peer ratings were significant predictors of instructor scores for both datasets. If we pretend the students’ scores are independent, we can’t predict the instructor scores – e.g. if you do a naive average of them! As done by many bits of software. Much better to model it better to get a summary estimate for the quality of the work.
Reliability. Intra-Class Correlation Coefficient ICC (C,4). Neither were very reliable – most dimensions from each rubric were not reliable; it may simple indicate varied perspectives on an open-ended problem. Maybe reliability is not so relevant for the instructor – depending on context.
Were the dimensions distinct? Problem-specific rubric did have distinct dimensions (2 of 10 pairs of dimensions correlated); not sure about domain-relevant distinctions (all 6 pairs of dimensions correlated). In this problem, you can’t talk about a claim of non-disclosure agreements if you don’t also talk about trade-secret misappropriation – could be in the broader context, but not in this example. So the correlation reflects that – which this modelling taught the instructor (they didn’t know beforehand).
Inter-dimension relationships – did some studies about whether students found all of the dimensions, vs a trained grader; the student peers did a good job.
Under distinct dimensions model: domain-relevant: only on dimension, Argument Development was a good predictor of instructor’s score (probabaly because of co-linearity of the dimensions). Problem-specific: three of five dimensions were good predictors.
The model lets you answer all sorts of interesting questions about the assessments, the students, and the assignment.
Quiz was correlated with nothing at all! Not valid. It’s hard to put together really valid multiple-choice tests.
Summary: peers can provide valid feedback. Problem-specific rubrics seem preferable. Feedback not always reliable, but we may not care about that. Rubrics can encourage students to treat dimensions as distinct. And valuable information from model parameters.
Future work: multiple exercises, further comparisons; wider evaluation of models.
Bayesian models and Comrade software may be re-usable; but useful ideas about rubrics to transfer very easily.
Learners differ; skills are distinct. We should accommodate these differences.
Q: When were the students and instructors introduced to the rubrics? Also, surprised/interested about correlation – students are not independent. If outside Bayesian network, what is the nature of that inter-connection? Pre/post score don’t correlated at all – are the students ‘aping’ what they assume they think the tutor will do, but not learning?
A: Instructor created exercise. Allowed me to have students engage in peer assessment. Then worked on rubric; had no expectation that one would be better than the other. Expected his dissertation to be about reviewer learning, from pre-test to post. Students saw the rubrics after they wrote their essays. Instructor helped create the rubrics before marking the exams. Could the peer assessors be trying to replicate instructors marking? Sure, but that’s a good thing. Marking is a teaching act. Would be interesting to get students involved in rubric creation, help buy in to assignment task, in to what it’s about, before they do the task.
Denise Whitelock: Thanks. Question – did you just run this through with one set of students?
DW: The Pearson correlation – 0.4 for domain, 0.7 for problem-specific. My prediction is, if you ran this again with different essay with same students, expect the domain would go up. Reason I say that is because the 0.4 couldn’t be bigger because asking students to mark insightfulness, qualities that’d be different from an expert. Have to learn those terms, only through practice; problem-specific was easier for them to follow. Important to differentiate those. How we move people from novices to experts. Experts giving expert advice to novices often means they don’t understand the advice. Understanding what we can give and what’s timely, that’s the holy grail.
A: Interesting – and empirically testable idea here! American Law School curriculum is 3y; these are 2nd and 3rd years, have been around these ideas, should have a good grasp of them. Not sprung on them. Have been evaluated on these criteria on others.
DW: Instructor’s conception of insightfulness will be different and difficult for the students.
A: Also not well-defined. Not straightforward.
DW: Can teach people for 3y and they perhaps don’t get it until the end.
Q2: I wonder whether instructor score of the peer assessors would be predictive of how accurate those assessors were in grading their peers?
A: That work is ongoing! Have a model distinguishing peer assessors in terms of their bias – are they strict/lenient. Some tentative, early evidence that that bias is not a mere personality trait, it seems to be correlated to the instructor’s score of the assessor’s essay.
Q2: You could explore Denise’s point this way. May differ between the rubrics.
A: They are different by dimension, and by rubric.
This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.