LAK16 Tue pm – L@S Ken Koedinger Plenary

Liveblog notes from Tuesday afternoon at LAK16 – the closing plenary of the Third Annual Conference on Learning At Scale, by Professor Ken Koedinger of Carnegie Mellon University

Starts with a pre-plenary survey of the audience at:

Arthur's Seat Edinburgh

There were 102 responses to the survey … 103, 104, as it’s going. 66% say they know when they’re learning. Do you know if your students are learning? 88% no. To improve a course, 50% say apply learning science. If they answer a video Q right, did they learn it? 85% say no.

That’s your pre-assessment.

Practical Learning Research at Scale

Making a difference in educational practice. I have two daughters, reminding ourselves it’s highly personal. We want to make a difference for students of the world, not just children but adult learners too.

Preview of overall argument.

You can’t see learning. You should beware of illusions of learning. As a student, might think you’re learning from a lecture, and you may or may not be right. Instructors watching, you’re convinced you can’t tell if their learning.

What do we do? Data is a huge opportunity. Students need data, formative assessment feedback. Instructors need data too. Not just to adapt, but to improve instruction. Scaling it up, two big challenges. First, specifically, we don’t know what we know. Experts only aware of 30% of what they know. Second, the design space is huge, and not good evidence of transfer between domains. These are opportunities. Education science and tech can be really revolutionary. Not just spreading access and personalising, but the notion that we can have data to drive iterative improvement. We know a lot, but we’re maybe at the Model T stage. We could have jet airplanes in education by pushing the science further. I’m a big fan of data. Iterating on that data, gaining insights, not black boxes, but understanding what your models tell you.

A bit about what we know and don’t know, then what to do. Examples of scaling content analytics and iterative course improvement.

What do we know and don’t know?

Do students know when they’re learning? Can we see it? No. Students are poor at judging it. Predicting their assessment scores lead to pretty low correlations. Study, Eva et al 2004, questions but had to predict how well they’d do, relate to actual score. Correlations all over the place from -1 to +1. The ability to predict how much you’re learning is not associated with expertise, in this dataset, though some debate about that.

Furthermore, liking is not learning. If you ask what they think of the course, that’s important. Does that correlate with learning? The evidence is not good. There are efforts looking at observations of engagement or confusion are not strongly predictive of learning outcomes. Confusion can be a good thing for long-term learning. That’s all hard to.

Why is this so hard? They’re using their mental resources to learn, so have no capacity to monitor it, and they have no comparator.

What do we know about what works in supporting learning?

In general, the public dialogue sounds like there’s two ways to learn. Basics vs understanding. The research elaborated many others. We have some indications of progress. Spaced practice is better than massed for long term recall, but there are sometimes circs where massed it better. Conflicting evidence on direct instruction vs discovery learning. Active learning > lecture. Explaining good vs bad, explaining can be bad for learning. A lot come down to emphasise reducing cognitive load vs forms of learning support that reduce assistance to put students in situations that are more difficult but in which they construct their own knowledge. Assistance dilemma.

Designing instruction requires combining all this, which gets really hairy.

How big is the design space for learning support?

Two possibilities: More help/basics to more challenge/understanding. Except …Spectrum from focused practice, gradually widen, distributed practice. Then another dimension – study examples, 50/50 mix, test on problems. That’s just two dimensions. We articulated 30 dimensions! When you combine them concrete/abstract/mix, immediate/delayed/no feedback, …. many others. They blossom in to a complex space, even with a subset of the 30 dimensions. All these options, and research suggesting what works for novices vs advanced … the design space is way more than 2 ways to teach. More like 3^(15*2) = 205 trillion.

If cross-course expts on the 30 dimensions were consistent, we could just apply learning science. Not just cognitive tutors but others.

Cognitive tutors

Learning by doing adaptive support. Problem-solving activity, analysing authentic problems – e.g. analysing cell phone plans. As they work through it, and ITS tracks each move, not just individual moves, but the plan as its unfolds. The context matters – one response depends on previous answers the student made, tracking dependencies. Notice that e.g. they’ve missed the fixed cost, an available feedback message. Challenging questions, like where’s the cross-over. Context-specific instruction available when you ask for help. When they do this, or make an error, the system updates model of what they do and don’t know, individualising it.

Lots of evaluations over many years. Recent study, 2nd year of implementation, doubling of student achievement of using the full cog tutor course, including teacher professional development. Also taking these ideas in college online courses, CMU Open Learning Initiative. Study comparing intro stats course taught by CMU instructor vs blended course with OLI materials plus instructor, half a semester. Outcomes tested in a test. Trad course 3% learning gain, blended course 18% gain in half the time.

With these technologies, that big space, there’s an opportunity to do A/B test – we did this from 2004, Pittsburgh Science of Learning Center. Committed to idea of using ed techs – not just cog tutors and OLI, but language learning, educational games – to do basic research. Social infrastructure to get researchers out of labs in to the course contexts. More than 360 in vivo experiments, now 820 datasets in DataShop.

What did we learn from that? Does one size fits all?

Table, reflecting on different studies in different content domains vs different instructional principles. Some of the rows – e.g. worked examples – they’re not all plus signs. Came to believe that there are different instructional goals, and what works depends on those knowledge goals. Furthermore, trying to understand the learning processes that are at work. We see that, for facts, memory processes are critical. For when goals are about inducing a rule across contexts, need support for induction, refinement with practice and negative examples. If goals involves deeper understanding and rationales, need engagement in understanding and sense making. They don’t work if misapplied – e.g. long discussion about why the Chinese word for something is what it is. Competing recommendations from psych, ed psych, ed – but because coming for different goals.

Delving in to worked examples, or reverse and call it the testing effect. Practice guide, from 2007. Recommendation about retrieval practice, use quizzing to promote learning. Worked example … looks like two opposite things.

Learn by doing or by studying?

Tests enhance later retention more than additional study of the material – testing effect. Vs worked example effect – worked example constitutes the epitome of strongly guided [good] instruction. Both stories have plausibility.

Testing ‘produces desirable difficulties’. Vs worked examples reduce ‘cognitive load’.  What’s right? What’s the mechanism?

Consider a typical content in a testing effect study. Study lao3shi1 = teacher, test just shows lao3shi1, looking for teacher. Give three study trials, then test trial. Vs 1 study vs 3 tests. On long term, better for more testing.

In contrast, worked example study, solve (a+b)/c = d, with steps. Then problem with the prompt. Similar, but the typical results are flipped, and stronger – example/test is better for long term than just test.

Did studies across domains. One in geometry. One-step example, but got more complex. Practice – solve problems step-by-step and explain. In both conditions, have to explain. Tutor feeds back. In worked example condition, half steps shows you how, tells you to study it, then explain. Results, worked examples improve efficiency – less time to study worked example than to do it – 20% less time. Greater conceptual transfer in 2/3 studies.  Similar studies in chemistry, algebra. Always reduction in time, often better long term retention. Theoretical work on SimStudent. Found, if we give it examples and problems it learns more than just examples. Problems induce the patterns they need to generalise across contexts, but are often over-general, source of misconception. Feedback from formative assessment important.

Use KLI to explain discrepancy – Knowledge, Learning process needed, optimal Instruction.

Testing recall, the cog tutor studies, are mostly on factual content. But the others are on general skills in STEM domains, get the opposite effect. That explains the difference.

What should we do to scale practice?

Take theories and apply learning science? We should do at least this. But this is not enough.

US Dept of Education, spent millions of $ on big randomised field trials. About 10% positive results. We need to do better, have a scientific pipeline wiht a hit rate better than that. The science has been ignoring external validity issues. If probe more deeply, “learning scientists” don’t agree on what’s best.

Go beyond applying learning science, to DO learning science. Start with the existing theories, build models, but also collect data, use insights from data to redesign. Go through loops. Might also tell the world about the results and contribute to theories of learning. T

his process through better social infrastructure … LearnLab is one of those, but our NSF funding is over. Every school and university should be a LearnLab! Global Learning Council report: Not just loop of continuous improvement, but administrators to change culture. Credit not just for research, but for improvement in our courses. We should draw upon other universities and contribute to that.

Three avenues for scaling practical learning research

  1. Every university a learn lab

Cognitive Task Analysis

Do this if nothing else! Expert systems, designers thought just ask experts, but they can’t articulate what they do. Clark done many of these studies, 70% of expertise is tacit, outside conscious awareness. Automaticity, but also we learn implicitly. Courses modified by CTA produce much better learning, 1.5 sd effect size, even in a medical school context.

Transfer – paper-based post assessments, follow up surgeons in to the field. Catheter insertion. The CTA-based instruction students were much better, including how frequently they had to re-insert catheters.

Costly in time and effort, though.

Use data from our online systems, do a more quantitative approach.

Quantitative Cognitive Task Analysis

If you have a good cognitive model, you should see a smooth learning curve.

Graph average error rate as they get successive opportunities to exercise some area of competence. You want to see the error rate go down. Example from geometry cog tutor. They’re assessments, but also learning opportunities because they get feedback.

One example – up and down all over the place, not smooth, looks random. Not as simple as do more learn more. A model with only one component to categories all of the domain. If you code it with finer-grained skills, knowledge components, then you get a smooth(er!) learning curve, suggests the more detailed model is better.

Upshot: distinction between rough and smooth curves can produce better cognitive models. Generalisation of item response theory. Students general proficiency, plus how easy/hard they are, plus for each opportunity to practice, what’s the increase in log odds of getting it right. Can see which model works better.

A lot of individual curves look good, but some are flat, so are opportunities for improvement. Error rates – some are easy, some are hard, suggests difference in cognitive demands, so obviously not the same cognitive skill, must be doing something harder if the error rate has gone up.

Example where it’s not just applying a formula, but combining them. That’s particularly hard when asked to do it without any prompting. e.g. find area of square, or of circle. If same question but different values, very easy, just arithmetic. Better with three skill components, fits the data better.

Important thing is to interpret those results to improve the course. Realised not giving enough practice at key planning skills.

New problem type – to focus instruction on greatest need, so have a practice planning step only. That decreases busy-work on formulas, reduces overall time by 25%, and spend more time on the planning problems, and they do better on effectiveness too.

Students learn what they spend time on … but both groups spent time on problems involving combining formulas – and control spent more time on those problems. A well-engineered MCQ is better than intuitively-designed active/hands-on activity.

Scale CTA

Using theory and computational modelling. Is it obvious what the cognitive components are? Four algebra problemsm: 2x = 12, 6 = 3x, -x=5, -24=-4x.

SimStudent works on this. Give it a problem, it says I’m stuck. Show it, it learns a production rule. Builds a representation of the domain. Then, needs more demonstrations because hasn’t seen them. Next problem, does things itself. But doesn’t quite get the induction right, but the feedback helps them do it. Also learning fraction arithmetic. Here it gets really good at it. Uses (also) a representation learning through perceptual chunking. Shows better representation learning leads to better skills learning, generates more accurate models that fit the data.

Of those ones, SimStudent found -x=5 hard. No number in front of the x! As human students have problems too.

Scaling iterative course improvement

Learning by doing is 6x better than learning by watching. Does that generalise? Come along on Friday to our talk! Pathway to do this across courses. These are correlational, we need to do the experiment. These found data, analytics, can do them across domains and get a sense for generalisation. Lose internal validity, but gain external validity.

Overall argument

You can’t see learning; beware illusions.. Data breaks illusions, for students and instructors.

Two big challenges – we don’t know what we know, and one size does not fit all. Need more A/B testing in this huge design space.

Ed science and tech is revolutionary. We can make these courses more widely accessible, more personalised, and much more effective, through iterative improvement.



Post-survey, same questions.

Joseph Williams: iterative improvement, sharing data. Mentioned expts with the $. Focus more microscopically, what if we built better randomised experiments. Ask instructor, which is the explanations, which the examples. Not just expts, but build it like Google.

Ken: Agree. We need a socio-technical infrastructure. Make it easier to construct A/B tests. Great examples, in educational games space. 7000 conditions in one study, varying features of a game along several dimensions. Gets 10k kids a day to play it, so can do that. That’s where scale can allow us to do something we couldn’t before. Even randomisation for features where we don’t know what’s going on is something we could do more effectively. People worry abut ethics, that’s important, but we’re doing experiments on learning all the time but we don’t [check it].

Q: Many illusions about learning. Breaking them is important, we should do learning science. There are only so many learning scientists. How can we have more? Is there a way every practitioner could be one?

Ken: Yeah. Instructors already look at data. They’re not necessarily looking at pre/post results, because no pre tests. But part of that is to change the culture. Most of our instructors are scientists, should be Ok to do experimentation. Most universities have teaching centres that could support faculty to apply these techniques. Maybe we should be a model for K12 instruction. [!]

Q2: I feel depressed by the whole story. You said, we have to keep trying things, different variables, situations. Will there be 3, 5, 7 principles you could extract? If you want to give advice, predictive theory for someone in practice. 470-variable model is not intuitive or useful. Are there big questions we could aim for?

Ken: Definitely. KLI framework. Work through what these principles function to do, what are they learning. There will be meta-principles. It’s more about the if part than the then part. It’s not, do this, but if – under these circumstances. If focus on learning vocab, do optimised scheduling. Those are not the kind we’re producing, we leave the conditions off. We don’t have the science or theory yet. I don’t mean to be depressing. A/B testing lit, it is the case that even there, simple HCI things, we get them wrong. They’re often surprising. We need this iterative engineering process on top. The science needs to get better, and we need to check it’s working in our particular context.

Q3: I’m confused, good for learning. Started from theory. Once build distributed learnlab, it’s the case study world. My own design, improve what I do, it works in this case. It’s the death of theory in favour of data. Something is missing to bring it back?

Ken. Yes. [laughter] Interaction has to be there. Can get a long way with data and experiementation, but without interpretation, you don’t know what they’re telling you. Understanding, and actionable. Even local iterative changes, have to think about interpretation and using that to make changes. It’s not just the data.

Q4: There are people doing meta-analyses and reviews of the lit indicating what is good and what is bad. One problem is that it is not theoretically driven. If not a good theory, not just facts or memory, but comprehension and problem-solving and deeper learning, will have simplistic findings that don’t generalise. We’ll confuse practitioners. Testing effects are the answer! They’re not. We need to take in to account what the type of learning is, and have a theory as a basis. It may be wrong. Your approach did start with a theory.

Ken: These meta-analyses are useful. Most of them give them average effect size, but negative effects on every one. There’s random variation, but sometimes because of these mismatches. Let’s dig in.

Piotr: I’m a fan. A lot of content at university level is vague, ill-defined. What is the cog task analysis on a presentation like this? That’s hard. Graduate level courses especially. Does this apply in contexts like that? Is it a different set of techniques? Is it economical to do there?

Ken: Important question. Not the only strategy, but one strategy on those ill-defined domains is to define them better. That’s what CTA tries to do.

Piotr: More like, research here, cites from last 5y, it’s very new, distinguishes it from K12, when you have it well-defined, the field has moved on.

Ken: We do want to get to the point, we’re close, to have a pretty good first shot at a course. If only taught a few semesters and it changes, maybe that’s the best we can do. But I don’t think it’s [quite like that]. We might start teaching it to younger students, change bits. Opportunities to iterate there. We should continue to do new stuff. When you have to, you have to use your intuition.

Looking at the responses to the post-test: only 58 responses. Come on folks!

I don’t know when I’m learning, that flipped. We should do learning science has increased. Answering in-video question means they have learned, [the answers are] even more right. Good job!

This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.


Author: dougclow

Data scientist, tutxor, project leader, researcher, analyst, teacher, developer, educational technologist, online learning expert, and manager. I particularly enjoy rapidly appraising new-to-me contexts, and mediating between highly technical specialisms and others, from ordinary users to senior management. After 20 years at the OU as an academic, I am now a self-employed consultant, building on my skills and experience in working with people, technology, data science, and artificial intelligence, in a wide range of contexts and industries.

%d bloggers like this: