Liveblog notes from Friday morning at LAK16 [full list of blogging]: Robin Mislevy keynote – A Dispatch from the Psychometric Front.
Dragan welcomes everyone. We want to make the conference better every year, so fill in the survey when you get it. Thanks to all the helpers.
Carolyn Rosé takes over to introduce Robert Mislevy. It was apparently colder here yesterday than in Siberia.
Keynote: A Dispatch from the Psychometric Front
Robert Mislevy, Educational Testing Service
There are a lot of overlaps between psychometrics and learning analytics. Noisy data about people’s learning capabilities. LA is about 100 years behind psychometrics (pm). In other ways, pm is 100 years behind LA.
Most people have never heard of the word ‘psychometrics’. The psychic impression gained from holding an object belonging to them – that’s much cooler.
Test items, item statistics, generate scores, calculate reliabilities. In almost all the applications of large-scale test theory – the most visible part of pm, has most impact – what is being doing is under the standard psychometric (pm) paradigm. To measure a construct, framed in trait or behavioural psych. One single measure. Each task/item is self-contained. Each response provides an item score, then test score accumulates evidences; sometimes just summing, sometimes latent-variable model like item response theory (IRT).
Over 100y of trying to tackle how you get evidence, use it, what the qualities are, some key insights have come about. The big problem is, at the beginning of pm, they were at the forefront of psychology, data, inference, and statistics. For large scale uses, a lot of progress on statistical analysis and reasoning from evidence, but much less in terms of the psychology, and understanding how people learn. Or about the technologies and the data you can get from it. That’s where pm is trying to catch up. At ETS, trying to bridge that gap. Will talk about our work taking insights, implicit wisdom in pm, that’s frozen inside particular methods and processes that still hook back to psych and data that pm evolved in. Richer data possible, continuous interactions, heart rate, eye focus.
Example of assessment more recently: SimCityEDU: Pollution Challenge. Methods from pm don’t apply when assessments look like this. Formative assessment, in the area of systems thinking. Students work on challenges that stepwise get them to think of more advanced problems in what systems are doing. Events having multiple effects, or causes; unintended consequences. Target is middle school students and policymakers.
The data looks like this [SGML/XML data, interaction log]. We have teachers, sci ed researchers, systems programmers, psychometricians – our job was to help figure out in this complex environment where students are working interactively, what do we see there of evidence of understanding and their strategies for these underlying concepts, so we can evaluate and give feedback?
We had a number of people working in the team. Working out how to talk to each other was hard. Another thing that was important, like in LA articles, to make this work we needed lots of iterations, link design to discovery. Try it out with kids, guess what, it doesn’t work like you expected. Game designers. Seek clues about their thinking.
Insights in the development of psychometrics / educational measurement.
Want to bust free from this paradigm and work in things like SimCity.
Probability-based reasoning. How to move from a test that applies in more fluid contexts. Three words, four: probability-based knowledge construction. One word in German. How these worked with simple problems, you can see the hooks to do a lot more.
Building models about what your data is supposed to be about. Data is not the same as evidence. How you build models to structure your thinking about data. This can only happen through some lens, some psychological theory about learning, and some kind of data. Psychometrics did really well with trait/behavioural lenses. Our challenge now is building models for psych theories. Interplay between theory, data, modelling. People are unhappy when someone else generates the data and just throws it over the wall.
Seeing reliability, validity, comparability, generalisability and fairness not just as measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgements and decisions are made. If you’re neglectful you’ll get sued. Large-scale assessment, some advances in fairness, awareness of alternatives, have come from lawsuits. I’ve had some legal advice about that joke, but I made it anyway. I’d be happy if you LA folks take these seriously. And also if you don’t – I’ll be an expert witness to the people who sue you.
The standard Ed Measurement Paradigm
A more scientific framework, from 150y, a layer over the examination paradigm that’s about 2000 years old, where the judgement was all implicit. Layering measurement over that, what we see may not be what we’re most interested in. You find that out if you do it twice and don’t get the answer. Most people only think about measurement error when their score is lower than they think it ought to be.
Not much focus on underlying cognition or the learning processes. More about performance at some decision point.
The data can be quite complex, but human judges come out with something like a set of numbers, much complexity hidden and don’t scale. But automated, we can’t just get 20,000 judges; you have to get in to the detail. Objective scoring, e.g. MCQs, they do scale better, but a serious costs in constraining the environment and richness of performance you can capture.
Like in SimCity, looking for a blend: something that’ll scale, and something that’s complex, that captures what’s important to know.
Models – Galton, Cattell, Spearman, Thurstone. Lots of statistical advances. Early data mining – regression, correlation, cluster analysis, factor analysis, path diagrams. This was then the biggest data anyone had seen.
Probability-based reasoning. Probability isn’t really about numbers; it’s about the structure of reasoning (Shafer in Pearl, 1988). You can build probability models to structure your thinking. If you had thought that original test theory had no psych in it, or your models have no assumptions about the nature of learning, well, you’re wrong. You’re just not aware of them. One big advance is being aware that the models you use are like fire – advantages, but you can burn yourself. No matter how fancy the model, it’s based on a structure of reasoning. Important to make it more explicit. Important in SimCityEDU, so we can talk about it together.
Classical test theory.
X_j = theta + e.
X_j ~ N(theta, sigma^2_e).
What you see, the X, isn’t what you’re interested in. You’re interested in theta. Previously, said it was in the head of the examinee. Can get pretty far with that, but I no longer think that’s true. If you don’t believe it you’ll do a better job of using models ‘as if’ it were true. You know you’re reasoning with a model that may lead to improper infererences.
Anyway, simple structure: the theta, what you really care about, drives the probabilities of what you might see. You take different looks, and get different answers. If you only have one, it only becomes useful as an idea when you have multiple observations. You can build a probability model that the relationship of the Xs (that you see) has a relationship to the thetas.
You can have more complex models, with latent variables, and what you can’t see is suggested by what you can, but in a more complex and noisy way. Still using model-fit tools, you can see where the nice theories you have are wrong. You can get better data, or a better model, or if the worst comes to the worst, get a better model.
From what you see, having estimated the structure, make inferences back. Bayesian inference – what would you be likely to see for the X if you knew the theta. If you say, what do I know before I see any response at all – initial probability distribution. That might be flat, but in some situations, e.g. in Algebra 2, they’re only here because they passed Algebra 1. If in ITS, step 57, you know about what they know or not from steps 1-56. Make some observations, use Bayes theorem, get updated belief about what the data is. More data, get a sharper distribution about what theta is.
A nice thing about this, important for SimCity situations, you can modularise your link functions. Start with expert opinion, make it up, your theory, what those link functions might look like. As the situation changes, update them, and change the world to get better information – e.g. if good at reading, more challenging material. This modularity is really important. Hidden inside pm work in the 1940s, just inventing IRT to do better job making tests. That machinery to do a little bit better had seeds for doing things they couldn’t dream of.
Real-time inference, decisions. Probability inference gives you a metric for quantifying evidence, a common framework.
Reliability. Replicate observations. Classic definition, quality of your evidence for comparing different people. Locked in the population, comparing one person against another.
Comparability, then they did parallel tests, similar conditions from the outside.
Generalisability – what if different tasks, raters. Another aspect important, you talk about it, you have this model, the conclusions look wonderful for this class, let’s go to same class next year. Sometimes it holds up, sometimes it doesn’t.
Validity – originally, correlation btw your theta estimate and what you’re trying to predict. Good, but not good enough. What is the underlying of evidence and theory that supports your inference. Correlation is not causation!
Fairness … [skipped]
One place in testing we’ve fallen behind is in psychology. Jim Pellegrino said it too.
More situative, sociocognitive view of psychology. Many disciplines, learning sciences, domain-based learning, sociolinguistics, net literacy, anthropology, psych branches.
Person Acting in Situation
Things happen, people learn, in the moment. Now cast in more socio-cognitive domain.
There’s a layer of human-level events. The social psych literature, linguistics, tells you we experience only the tip of an iceberg, all kinds of things are happening at levels above – socio-cognitive. Regularities in interactions between people. Language. LCS patterns. How we’re supposed to behave. We’re assembling pieces from many cultural extra-person patterns, shared among people who interact personally, through books, TV shows – that’s what we make meaning of, and make complex things. Becoming competent means becoming attuned to those patterns.
How does that happen? That’s the cognitive part (below), the iceberg. Within-person processes. Resources to assemble patterns to understand, act. KLI, CI theory, ACT-R, Lave, Hutchins, Engeström. If you infer poor reading comprehension using English, and they don’t speak English, you’ll make a big mistake.
Data live in the middle. We’re trying to make sense in terms of what’s happening cognitively, and what are the patterns important to learn about. Our first language we learn.
These are important because as people move through the world, the patterns they become attuned to depend on the situations they move through. We need to be aware, what are we assuming that is going on that will influence performance.
From that pov, what do we need to know? What are the targeted patterns? What can I do next, etc. For assessment, what capabilities are about aren’t just knowing stuff. You have to know some stuff, but what’s important is using that in situations to do things. At least some of the time we have observations where people are doing things in the environment. Another project, troubleshooting of networks – just knowing the terminology isn’t enough, have to interact with the systems to get things done.
What do we think of constructs? I no longer think constructs are in the head of the student. If you polled pms, I would guess about 1/3 would say they’re in the head of the analyst, where I say. 2/3 say the head of the student. Not bad, you can do useful things with that model, you just shouldn’t believe it. You can work “as if” it were true.
Each person is idiosyncratic. Patterns change over time, they’re contingent, that will affect how far the models you have travel.
The tasks will be more like SimCity, a lot less like those standard repackaged tests. Continuous evidence – we must characterise evidence, not ‘score responses’.
To make a claim about a student, you need data concerning their performance, but also a warrant, and data concerning the task situation. That data comes from the student acting in the assessment situation. And al that takes place in a social/cultural contextualisation. The student acts in a particular LCS milieu. So need alternative explanations. Sometimes have other information – e.g. you don’t speak English, here’s a Tagalog version. That’s different from assessment is fair if it’s identical, now it’s fair if different on the surface, but different in ways that let us see what we really care about.
That’s fine for a static test. When it’s something like SimCity, it’s continuous. You keep moving to a new situation, you have moving arguments.
New forms of assessment – SimCityEDU; Packet Tracer, Cisco; Tetralogues, like Art Glaser tutors; Sao Pedro, Ryan Baker.
I’ll skip some things here and go through to the cooler slides.
Computational Psychometrics. Very low-level data, features in those have syntactical meanings. Then sentences, then what that means takes place in discourses [up to high-level interpretations]. Interaction back and forth between the theory, design of the situation.
In SimCity, at the top, high-level interpretation, looking at a construct that is levels on a systems-thinking learning progression variable. What do you get a the lowest level? Screen interactions, locations, times, clicks. Means nothing at all on its own. First level up, can interpret them in terms of game-world verbs: e.g. “bulldoze”, “rezone”. Then look at patterns in those, it tells you about strategies, or lack of strategies. E.g. did you build a new low-pollution power plant before bulldozing the old high-pollution one. Strategic action sequences. It’s up at this level, we have psychologically-relevant thing we can hook in to the probability level. So here I have my thetas, that are maintained as the kids work through, updated with different processing beneath. In SimCity EDU, summary functions of counts of these actions and system-state variables that are input variables in to a dynamic Bayes net – hidden Markov model wrt level on learning progression. Mapping end states as 2D map of pollution and jobs; aim to get low pollution and jobs. Bayes net model, can hook things up as we go along.
We have a library of fragments of link functions, so in recurring situations, they’re unique but come up in a recurring way, they require some way of thinking. We have a flow of activity in a game. SimCity has a state machine, a state vector, there’s a couple of thousands variables. We have some variables they already have, many they don’t, that are psychologically relevant. We get evidence about how a kid is thinking about a system. A layer of evidence-bearing opportunity detectors – agents monitor state vector for opportunities, feature-detectors, when they see an evidence-bearing opportunity, we hook up. You can increase your library as you do more feedback.
Other examples. Using network theory to improve task design and scoring – Zhu, Shu and von Davier 2016. Characteristics of networks related to what judges came up with implicitly in their heads. Can move to automated scoring, scale up, see ambiguities and improve.
Other models – Hidden Markov models.
Social values. Reliability. A Bayes net, or structured IRT, you can characterise the amount of evidence you’re getting. I’ve had interesting discussion about whether reliability is important in formative assessment. Quick opportunities to change, reliability is not so important as when you have less info, high-stakes decisions. Why would you say you have a framework that can’t quantify the evidence you have? If A/B testing about designs to give you better evidence, you can’t even decide which is better without that. It might make you sad if things turn out to be not what you believed, but from the pov of the community it’s a good thing.
Validity too. I’m happy to see pressing on where my beliefs and models might go on. replicability, different structures and techniques. Triangulation, you’re happier, some robustness. If you get different answers, better not believe any one of them are true.
The value of probability-based reasoning.
Situative/ sociocognitive psychological perspective. Necessary for the kinds of moment-by-moment interactions that simulations, interactive situations.
Assessment as measurement – is nested inside – model-based reasoning – is nested inside – assessment as argument – is nested inside – social context. You need to think about fairness and validity at that level.
Validity, reliability, comparability, generalisability, fairness – they’re important, and probability models help address them rigorously.
Dialectic between design and discovery is very important to improve the work we do, the products we have and they effect it has on kids and education.
Finally, computational psychometrics. Our first job is to figure out what that means. Synergy of psychometrics, la, data mining.
Jeff: PM has a lot of offer LA community. What transition educators can make to better assessment design? They focus on grades.
To do just one thing to improve assessment, it wouldn’t be models or probability-based reasoning. But assessment situations more attuned to what we have in mind. Some things in LA more like SimCity EDU, a challenge to the type of capabilities being developed, great. Others are same items used 100y ago, not happy about that. To do that more efficiently, we get faster to where we wanted to be in 1945. We need to do better than that. LA, move to richer understanding of our targets of learning. How do you improve practice in the classroom? Read Grant Wiggins, not my stuff. Reverse design, what would you like the kids to be able to do, don’t make a test for something else. Work backwards. Can use classical test theory, but the right information about the right stuff, that’s more important than the models. Policymakers view of assessment is still stuck in 1945. Biggest successes I’ve had are outside the educational machine, e.g Cisco Networking Academy. Much richer experience, much better assessment.
Q: Your views about non-standard data. I work on a lit on predicting decisions, human decision-making, neuromarketers interested in seeing your facial movements, can you predict purchase. There is predictive there, but not additional to pm. Is state of the art here similar, a demonstration that adding this data to your models with more standard data?
Yes, it’s different than marketing. Marketing, look for predictors, who cares why they predict, they buy more. Whatever information you can use, the better. But assessment, our own desire, and fairness, to have a rationale. Paul talked about this. Can go with lower prediction with higher validity. Prediction based on association is prediction based on correlations; they change in different communities, over time. Better overall, but worse for individuals. We want predictions where even if we can’t get the rationale – e.g. a neural net – but you ought to have a story for why the features have some relevance to what you care about. If not, I say don’t use it. Non-standard data has useful impact – one in computer vision, projects in cultural understanding. Non-standard data in that world would be e.g. gaze, body motion, proximity. All of those have real relevance, from ethnomethodological studies about communication. It’s non-standard but we have a very good rationale to use it.
Simon: Despite emphasis on social, cultural dimensions and the impact, I didn’t hear self-report. Do you have strong view about self-report data?
In SimCity EDU, some self-report that was essential. We wanted to give opportunity of learning, evidence of assessment, but also be engaging as an experiment. Kids self-report, would you play again, tell your friends, what parts were fun, where did you get stuck. Those were really useful. What were you thinking when you did this? Those were less useful. Human events are the tip of the iceberg. Some things we can give accurate self-reports to. Lots of other things we don’t understand consciously. We’re happy to give answers, but we’re just making them up. I don’t believe those. HIgher-order skills, grit, very popular, people self-report on those. Recent article that said, slow down. A study in persistence, grit, getting some usefulness out of self-reports, they’re very noisy, but some patterns that are useful. But slow down where policy-makers say let’s assess grit, measure teachers on how they have improved grit, self-reports from kids at start and end of the year. And that’s a really terrible idea. Self-reports, for high-stakes decisions, I would take it off the table.
This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.