LAK14 Tuesday am (3): Machine learning workshop

Liveblog from the second day of workshops at LAK14 – morning session. Agenda and timetable for the day.

Learning analytics and machine learning: George Siemens, Dragan Gasevic, Annika Woolf, Carolyn Rosé.

Industrial architecture

George kicks off, with an introduction. Today aimed to be introductory, an opportunity to onboard a next generation of people in the field. At the University of Texas at Arlington, previously at Athabasca. Introductions from people round the room; a fun diversity of experience, from no idea about machine learning (ML) to world experts.

Current state of learning analytics research through methodological lens: George Siemens

Bring together technical communities and the social/pedagogical communities together, was the aim of first LAK. Interested in bridging those two, danger that middle road would exclude both. Have been quite successful in getting that balance reasonably articulated. Seeing representation of backgrounds in this session reflects people who are interested in the educational process are interested in technical improvements, and vice versa. You may find conversation is above your comfort level, whether about ML or educational theory. That’s Ok. Not so much the zone of proximal development but zone of maximal confusion. Growing process.

Definition – LAK11 one. Specifically to improve learning process and the environment. We’re interested in the learner, critically. If something around analytics doesn’t benefit the learner, it’s not learning analytics. It might be analytics in education. Adam Cooper (2012) definition, developing actionable insights. Big interest has been the centrality of the learner, and need for the process to improve the learning experience.

Historical influences. “I didn’t realise I was in to learning analytics” said someone. Broadly speaking, LA is a bricolage field, draws from a variety of specific disciplines. Not unusual.

1. Citation analysis. Garfield (1955) – used as model for Google – Page et al (1999). Just looks at how things are linking, not all nodes are equal. Today far beyond that, knowledge graph models, contextual factors. But general idea about analysing structure.

2. Social network analysis. Milgram (1967), Granovetter (1973), Wellman (1999), Haythornethwaite (2002). Configuration of social networks and how they generate experience.

3. User modeling. Rich (1979), Fischer (2001). Understand users by profiles, models.

4. Education/cognitive models. Anderson et al (1995), but also back to Ausubel etc, about how concepts formed/explored. Then in to ITS, and CMU became the epicentre of activity.

6. Knowledge discovery in databases – Fayyad (1996)

7. Adaptive hypermedia, Brusilovsky (2001). Understanding users, and content, bringing profile of student and structure of content, to improve learning experience.

Could also throw in stats models, AI.

8. Digital learning, online, elearning, MOOCs – because digital trails are available, big driver. Treating that data as a mediating agent, or a process, a way to gain insight in to a process. Many students. This raises the process, many universities have lots of data but not evaluated meaningfully, still at early stages.

LAK is more focused on HE, EDM a bit more at K-12, with overlaps.

Related fields – business intelligence, academic analytics – about how to better allocate resources, e.g. rooms for lectures. Important and relevant, but if learner doesn’t benefit, it’s outside learning analytics.

Two papers that articulate this: Baker and Yacef (2009) – EDM – five primary areas: prediction, clustering, relationship mining, distillation of data for human judgement, discovery with models.   And Bienkowski, Feng and Means (2012) five areas of LA/EDM application – modeling user knowledge, profiles of users, modeling knowledge domains, trend analysis, personalisation and adaptation – latter is a bit of a holy grail.

Why machine learning?

As a field, LA is young. Looking at a field that has a 4-5y history. Some Educause publications from 2000-2004, Oblinger and Campbell, more academic analytics. But as a research community we’re young. How do we increase the sophistication of the analysis we’re doing? The prevalence of techniques like SNA, discourse analysis, recommenders. Very healthy for a new community. To mature, make a distinct mark and impact, need to begin to advance discipline-specific approaches. The ones that are central in my mind are: neuroscience, and machine learning. Neuroscience isn’t quite ready for broad consumption; hear ‘brain-based learning’, it’s a bit like ‘butt-based sitting’. What else are you going to use? It’s going to be critical, but not in the next 5y timeline. Right now, the highest potential is machine learning. Fortunate to collaborate with machine learning people to help LA as a discipline mature. LA can provide benefits back to the ML community too. High potential to engage. Another session coming at ?EMLP meeting, George and Carolyn.

Also opportunities with wearable/ambient computing, but that’s really a data source not a technique of analysis.

Learning analytics literature analysis: Dragan Gasevic

Looked at previous 3 conferences, see the common themes, level of maturity, temporal trends, and see what the quality is. Validated evidence!

Method – ML! Concept graph extraction – OntoCMaps. Clustering. Manual content analysis. Research methods, author background, categories of papers. Interpretation. So ID the key phrases, then the relationships between them, then build conceptual graphs. A graph for each paper, feed them in to the clustering algorithm (same method, K-means). Also did manual analysis of the content – what are the research methods? Qual/quant/mixed. Also many that couldn’t be allocated, so tagged as other. What type of research – many papers were conceptual or position papers, and panel papers. Hope to see papers that are validation research, validating techniques with evaluation, applied in practice and evaluated in a real context. Then finally the background of the authors.

Found six clusters.

Always interpretative, subject to researcher bias.

Cluster 1: Learning behaviour and feedback. Dominant theme for LAK13 (N=24). Ways of crafting feedback, visualisations, PLEs, data mining. Most papers were proposal of solution; roughly 50:50 comp sci/educ; dominant category of method was ‘other’. Not strong research methods. This is a dominant trend. But some validation and evaluation research papers.

Cluster 2: Learning success and retention. Retention, at-risk students, adaptive learning systems, metacognition. Big focus in LAK12 (N=22), only a couple in LAK11 and LAK13. Similar types of papers – proposal of solutions, mostly ‘other’ methods, and balance comp sci/ed.

Cluster 3: Data mining and learning enhancement. LAK12/LAK13. Different approaches to improve T&L, enhancing learning, opinion papers – e.g. George and Ryan Baker LA/EDM one. Also pedagogical models using SNA. Mostly opinion papers, very little actual research, and 2:1 educ:comp sci.

Cluster 4: Recommender systems. Mostly in LAK11 (N=13). But not there in LAK13. Here more hard research, more quant. 2:1 comp sci:educ.

Cluster 5: Social network analysis. Dominant. LAK11 had N=11, LAK13 big too. Mainly proposal of solution.

[These cluster names are a bit arbitrary – some are exactly as labelled, but some are not – this cluster includes predictive analytics as well as SNA work.]

Main contrast LAK vs EDM is social learning – specific to LAK. EDM papers more focused on ITS.

Cluster 6: Social learning. LAK12 =17, LAK13=16. A fair amount of validation/evaluation research, more quant. Again 2:1 comp sci:ed.

Paper on trends in SNA and citation networks in LAK conference and special issues, paper on Friday with Shane. Dominated by computer scientists, but cultural thing to deal with. Conference papers more important to computer scientists than educational researchers. Need to make sure we get their voices here too. That’s a dangerous trend we’re seeing here, that conference will be taken over by computer scientists, and we don’t want that to happen.


Chris: At early LAKs, Banff, many people from provosts offices, we have some here. Missing in Leuven. Touch on whether you saw that wrt administrators of HE, within the institution, or Educause, and how that’s changed, or not prevalent in paper analysis?

Dragan: Wrt representation of authors, couldn’t identify their roles within their institutions. Most papers that are commonly cited are published by Educause. The other highly-cited papers are from SNA. There are one or two from educational data mining, the Baker/Siemens one. Also as a trend, more focused on different learning techniques.

George: Shane’s paper on Friday. Thinking about how to onboard new researchers, faculty members who’ve worked in related field for 20y. We have inefficient processes for that. We want to give newcomers to the field, at whatever level, a way to understand the structure of the field. Maybe look at the role in an organisation. Leuven didn’t have strong univ leadership presence; this year we do. Want to let the community see itself, newcomers ID their role. Interest in maturation of the field.

Q: Comment, ML and neuroscience weren’t a cluster in your work – why not?

Dragan: Not an independent cluster, but within each one there were 20-30% papers that used ML. We don’t have a ML cluster, but each cluster has cross-cutting concern with ML. That’s one view. Or we could tag each paper by the use of ML algorithms.

What can machine learning contribute to learning analytics? Carolyn Rosé

EMNLP workshop, at a ML conference, making prediction of dropouts from MOOCs, interdisciplinary exchange. Looking for good invited speakers, want to get them to understand that more goes on beyond the models.

Picked a topic not predictive modelling – probabilistic graphical models.

What does the graph mean? Structural equation modeling – diagram of it. Aim here to help you have stepping stones to good conversations with ML researchers. They’re more exploratory than predictive models, but help you find structure in data, can be used as features in a predictive model. Or to find connections between people in an emerging community, use that for recommendation. Good for exploratory data analysis – a very sophisticated factor analysis.

Model. This model is nearly in a semi-sharable state. The idea is of a conversational role. The role has a connection to not only what you talk about, but what aspects you focus on, and how you reflect your stance irt other people. People have a way they tend to behave in groups, we get a reputation over time. In this model, there’s an overall tendency people have (alpha), and each person fits within that space (Pi_p, Pi_q). But that’s modified by who they’re talking to at a particular time. Maybe you’re usually a leader, but here you’re not an expert, so perhaps you take more of a learner role. So your role (Z_q->p and Y), and that’s reflected in how you talk – Z’ and W. It’s a way of identifying in a large corpus the roles people take on. Model for each person what roles they take on typically, and in a particular conversation.

Everything is conversation to me! It’s important in learning.

How does it actually work? I started 25y ago, after my PhD I went off for 6y when my field shifted to large scale ML, big data. But I was learning education, had to come back and catch up. IME, it’s important for you to feel comfortable that you have your own expertise. The goal here isn’t for you to do your own ML, but to be able to have useful conversations with ML specialists.

Structural Equation Model (SEM) – look superficially different. But are about IDing latent factors, want to see more structure than what we see in a flat linear regression model. Arrows in a plate diagram are same as in an SEM. The boxes are because usually we have huge number of observed variables that come together in sets; there are loops, and the boxes represent iterations.

Fundamentally, it’s a Bayesian kind of modelling. We’re not making a hard and fast prediction, but modeling tendencies. Simplest aspect is Bayes Theorem. The category of something given an observation equals the probability of observation in that class times prob of that class of events in general, divided by prob of that observation. We’re reasoning, is it more like this category or that, so can throw away the denominator since it’s the same for all because we keep the same observation. Naive Bayes is very simple to build. In every likelihood formulation, always have a term that’s the prior prob of that term – how often do we see this category. The other terms come from observations. What’s naive is we assume that each term is independent, and sometimes that’s not true. Easy to compute.

Simple example – two bowls, A and B, with red balls in A and blue in B. Randomly pick one bowl, and one ball from each. Bowls=categories, balls=observation.  It’s obvious here – if red from A, if blue from B. If assume it’s independent, easy to calculate probabilities.

Usually, it’s more complex and the bowls are mixed, they’re not homogeneous. Have to think harder about which ball it was drawn from. First think about probability of each bowl – it’s 1/2, since they’re the same size. Also prob of red ball from bowl A, and red given bowl B. Simple Bayes multiplication, gives you the prob of where you drew it from. Likelihoods don’t usually sum to one (although they did here), so normalise them by summing them and dividing each by the total. [I’m pretty sure nobody who doesn’t already understand this is following it at all here. Alas.]

Now we ask, given the observations, what were the bowls? They’re like latent variables, clusters of tendencies – those are our roles. Generate them so that the observations make the most sense. Most complex ML is developing techniques for figuring out what those bowls are. ML people interested in how to scale their model, but they don’t care about what it means, how to interpret – it’s not fun for them. So it’s important to have a partnership and conversation.

If we have drawn two balls from mixed bowls, we combine by multiplying. Thinking about proportions, we scale, and we can evaluate what we think happened to give the combinations we saw.

Given that insight, doesn’t tell us how to design own graphical model, build our own. Starting places, books – Latent Class and Latent Transition Analysis (Linda Collins and Stephanie someone). Also Multilevel and Longitudinal Modeling Using State Volumes I & II. Probabilistic graphical models, don’t use the same words, but much the same. Written for social sciences, written with concern for methodology with ideas about interpretation.

Growth models. Simple form of graphical model. The idea is first, not everything fits a line. Usually growth is non-linear. It’s not that at every place at x axis the effect has the same impact on your y variable. So fit to e.g. a logistic function, get a curve that get more trend at one point than another. Our datasets have more structure, people going at different rates. So each user’s trajectory is comb of personal trajectory (intercept and slope), trajectory of their subcommunity, and effect of their community. So growth of knowledge over time needs this hierarchical structure. It’s really very simple. (!) First you calculate a general trend. Next, adjust for each user’s intercept – they come in with different background. Then also adjust for slope, the trajectory – weight/coefficient on that term, which also vary. So see pattern for each particular individual or subcommunity, the shape is consistent, but the weight and slope is adjusted.

Also looked at how online support changes over time. Group trajectories and find the average. [Baffled by graph here of disease progression over time. Suspect this was an analogy I didn’t get, a quick domain jump.]

Can do this as an exploratory technique too.

EM Clustering – to find the clusters. Two kinds of probability distribution. Naive Bayes has a closed-form solution, can just figure it out from the data. But for more complex approaches, you have to explore it, don’t try every combination. So one class is expectation maximisation (EM), does it iteratively. Start with more or less random state. Each dot is a vector of features (randomly spread over an x-y graph). Randomly assign instances to bowls (clusters). That’s a starting place. Then based on that assignment, compute a central tendency for each bowl, then posit places in the multidimensional space is where th ebowls are. Then compute how well associated each instance is. It’s not a hard and fast clustering, just each instance belongs kind of to one and kind of to another, want to make the fit make sense. So reassign instances to clusters, reassign the clusters, iteratively. Again, it’s not hard and fast at the end.

Latent Dirichlet Allocation LDA is a way of doing this with text. Not ust which bowl does each word sit in. Every one has that in. Each word comes from a bowl, each bowl comes from a text, prior distribution on words to take in to account as well. It’s like a SEM, but the box represents iteration.

Can compare LDA with extra machinery for the network structure – and that’s what her initial model came from, but underneath is a soft graph partitioning model. Soft assignment of people to subcommunities. Sometimes I talk to work colleagues, other times to other parents, sometimes to people who like to knit socks; and sometimes there’s overlaps of those communities. So her model ties this together. These models are ways of bringing in structure. Builds on methodology we already know. Building these required your buddies.

Then Daphne Koller’s MOOC on graphical models. Cotton, Learning R. – predictive model thing.

George: It’s not that LA folk who have more social inclination should become experts in this. It’s important to look at what’s being said about articulating the idea, and what the implications are. The complex technical details are connected specialisation. The ideas are complex compared to what we encounter in the social learning space. The emphasis is critical in LA progressing.

Predictive modeling in Learning Analytics: Annika Wolff, Zdenek Zdrahal

Draw on OU work, presented in more technical detail later. What are predictive models, how to choose a model, the importance of data, and what to do with it.

Predicting from student data some learning outcomes. So in a course, hit milestones – e.g. assessment, or final result. Goal is to use the predicted information to intervene with the student and change the outcome. At OU, ID students at risk of dropping out, and target early intervention. And advise student of best learning steps – will touch on this briefly.

The approach is using historical datasets, you already know what the outcome was. Apply the model to some data as it’s runnign to predict the outcome based on what happened in the past.

Support vector machine example – it never works well for prediction, but it’s useful for finding out what’s going on in a predictive model. So example of octagons that show training instances – ones we know the outcome for. The model finds line between students to find pass/fail. Then take some data where we test instances, see if predicted result is right – so can test the accuracy of our model. We can refine the model, and when it’s pretty good, can run it on real student data. Drawback is we don’t know what it’s predicting what it’s predicting. Can say here’s the data, we predict this student will pass, that one will fail, don’t know what.

Another example – decision tree. Series of questions – parents visiting, weather, money. First one is the most important aspect, and that can be quite informative, so building decision trees from data is a fun first step. But tree on OU data is more complex and harder to interpret! We can prune it, and it makes it friendlier. Predicts several scores, based on activity in different weeks prior to the assessment, each decision node represents degree of clicking. Shows clicking on one activity is quite good, and if also high in next week, predicts high outcome. Tells you which activities are predictive of a good outcome.

Another example – simple Bayes. Simple example of Bayesian network – dependencies within your data. Graphical way of thinking. Prob of pavement slippery depends on prob that it’s wet, which could be because it’s raining or the sprinkler’s running, which depends on the season. Bayesian models are nice for talking about data. OU example of Naive Bayes – Gender, New vs continuing, Education, and VLE all influence prob of failing at first assessment. You can build this quickly, find out impact of different attributes, can add/take away to see how it affects your prediction. So if based on just basic demographics, then add in VLE and see what’s the impact in improving the prediction. We did this, found VLE data very important in improving it. Which is good – people wary of using demographics, and they are not the important thing – activity on the course quickly overtakes tendencies from demographics.

Each model has different strengths and weaknesses. Some give you similar results. If you want to find out something about your data vs accuracy. More important than choosing a model is what to do with your DATA.

It’s important, comes from students, who are all different. They interact with each other and the environment.

Where does it come from? Demographic data – age, gender, previous education, where they live. Other static data – e.g. at OU, how they did on previous courses. Can be useful. Handy to understand how demographics change from one module to the next. Can’t assume all are the same; impact can be strong. E.g. course in the humanities, typically older students – we have students in their 80s. They don’t feel the need to do the exam, since it’s not their goal. So missing this out would miss out something important. Personal learning goal achieved, but got a fail result; have to understand the whole context.

Also assessment data. Often intermediate assessment, as well as final outcome – pass/fail/distinction. Often want to predict this.

Third set – learning data, interaction, learning activities. E.g online actions on a virtual learning environment (VLE). Also from lecture attendance, library access. I’m focusing on VLE data.

Example of it – e.g. total click counts over the whole course, plotted over course – can see drop off of student activity over time, but more constant from tutor. Shows increased efficiency, but perhaps also loss of motivation. Also peaks, which occur just before the assessments are due. It’s always the same, look at 100 modules and it’s all the same.  Asking people important.

Not all students click exactly the same. Total number of clicks, categorise in to bands. Interesting – no activity at all, N=317 – not all fail! and top category, N=355 – not all pass! So need more understanding about what it’s telling us about student engagement.

Categories of data – quiz, forum, glossary, ‘oucontent’ – learning pages, wiki, page, resource, subpage. Not all used in the same way, some courses may use forums a lot, others not; but may be idle chat or actual learning. Hard to know, and what impact they have, without extra investigation. We need to understand what we are doing to interpret our model, and know it’s doing what we want it to do.

Example – inactivity in a forum has different impact in different modules. In Module A, students who didn’t use forum were likely to fail. But in Module B, students who didn’t interact didn’t make much difference at all. Need to know what’s important and when, by testing, and talking to the module designers. Sometimes not clear cut.

Probabilistic model – Markov chain – working from the data up. More detail presented later. Different forum activities leading up to an assignment, then to final outcomes – pass, fail, withdraw.

Finally – what and when to predict. Not always clear cut. Predicting far in to the future is often not very accurate. If half way through the module, and there’s been three assessments – you know their demographics, you know their outcome from a couple of assessments, and the activity for the periods up to now. From history, can predict; in our experience, from about half way through, can easily predict outcome at the end. HOWEVER, people drop out around the first assessment. (I’m not sure about that – the survival curves I’ve been looking at don’t look like that – must pursue this with Annika.)

So: ML methods can predict student outcomes, and even discover which are critical milestones to predict. They can discover what students are doing and when, and how that impacts on final outcome – like with the decision tree (useful first step in modeling). And can directly test assumptions, e.g. about what students are doing when. Can ask module team what students should be doing that’s critical to their learning, and then test whether that really does matter.

Based examples here on OU, but same principles apply in classroom learning and MOOCs. Main difference is sources of data and size.


Q: When you look at number of clicks, and final outcome. Look at what kinds of activity. Have looked at correlation?

Annika: Yes, in some respects. Tried to remove some data, find out which are more informative.

Q: E.g. accessing content more important?

A: It really varies from module to modules since they’re all different. So that’s why we use Markov model, it’s so different from one to the other. Not all modules use the VLE, some learn offline. The answer is no, because of variations. Within course yes, but not across.

Dragan: I want to confirm that. Some courses, forums not used at all. Others, Turnitin used a lot, others not at all. So multiple linear regressions, sometimes these explain 70% of variability of final grade; but in overall model only get 20%, and not even the same predictors. But wrt the interpretation of the data – one big general model, e.g. number of logins and resource views. But no good when the course level model has the fine-grained detail. We don’t account for individual agency of students, different metacognitive skills – e.g. effective social learners and not.

Carolyn: Structural equation modeling – e.g. logins, video views, they stand for something, but that’s not really the variable you care about, so construct the latent variables you do care about.

Phil: In predicting questions about drop out, I understand some of the data. As an ed psych, interested in focusing on what they know. In the courses you’re working on, is it sensible to count each item of knowledge as having the same importance?

A: Speak to people who structure the modules to understand more.

P: Maybe a grade is not the right thing to look at; a profile of competencies.

A: Yes, focused on retention, but this more focused modeling is very interesting, and how we can help individual learners to improve.

Chris: Your perspectives on what’s appropriate, not just research, but innovative LA for institutions. To be a leader, want to be innovative, what teams do we need, what competencies. Work with techie and non-techie, policy people.

Carolyn: Different people mean different things by ‘innovative’.

A: Have to bring in people from all walks of life in the institution to have input in to the process. Speak to people who are doing all sorts of things. We need interpretation at every level, without that view, risk of missing important things. It’s about bringing people together.

George: Relationship between speaking across boundaries, vs specific details that allow us to achieve something, Intent of the workshop, to become comfortable at the abstract level of what this means, what the behaviours that contribute to student success. Trying to emphasise, be clear on what we’re trying to articulate, so we can move between boundaries. The technical skills do require a diverse team. Issue is losing the big picture in the fine details. Workshop this w/e, if conversation turns too quickly to technical details, lose capacity to be innovative. Concepts vs practicalities. Conceiving broad applications is different, separable from detailed discussion.

Coffee break

Towards A General Method for Building Predictive Models of Learner Success using Educational Time Series Data
Christopher Brooks, Craig Thompson and Stephanie Teasley

Chris speaking. Wants to get feedback on this work.

Interested in interaction patterns. How to find them in diverse TEL datasets? Data from variety of sources (different platforms), and variety of contexts (different pedagogical contexts). Towards a ‘1-click’ model of course – don’t have to be a tech to build a model for your course, just say ‘tell me about my course’.

The approach is building features from time series data. Activity and interactions looked at as time series data – all the prior knowledge, demographics and so on put to one side, focus just on the time series data. Create access features in the data mining model – binary values for daily, 3d. Then N-grams of patterns of access across the interaction corpus (parallel to text analysis), considering patterns from n=2 to n=5 along daily, 3d, weekly, monthly scales. (0,0,0,0,0)_day: 23 is five days of no access – and this person had 23 of those. (1,1,1,1,1)_3day: 15 – did you access at any point in 3d period. Regularity measure. (1,0,1)_month: 1 – was in first month, not second, but was in third.

Dataset – Coursera MOOC data from U Michigan. 5k users who received non-zero grade. (distinction vs non-distinction) 87K forum accesses, 130K quiz accesses, 2.8M lecture video accesses. No semantics about what they wrote (e.g. sentiment analysis, SNA) – all things to bring in later. What can we do with just access data.

Used same method – decision trees – very interpretable. Built and evaluated using 1y. 10-fold crossvalidation; normal but not very rigorous. Classification accuracy of 91%, kappa=0.8199. Scaled from 0-1, 1 is perfect categorisation, 0 is chance. Training dataset is test dataset too, but not build on this class, test on next class.

Tried with a second class, failed miserably, because only had lecture view data, not all the others (corruption).

Trained & tested on sequential course offerings – built on this year, tested on next. More rigorous! Classification accuracy 78.1%, k=0.563 when just trained on lecture view data. Still pretty strong.

Trained and tested on sequential course offerings, using only first 5w. Got kappa of about 0.3.

Single weeks and single days are predictive, but multiple N-grams of 3d patterns is the most. Some patterns are more complex, and are also valuable.

Time series data approaches have actionable levels of accuracy. No input from subject matter experts of learning designers; neutral wrt source of logs – Blackboard, lecture capture, whatever – exciting since lots of tools do that, and no one vended solution that works with it; assumes similarity between course offerings.

Question primings: How much data do we need? Can we scale down to an LCMS? Can we identify and use more sophisticated temporal patterns? Skew/kurtosis in distributions? How do we make models more understandable? What should be the role of pedagogical and educational experts in the data mining process?


Carolyn: Since you know there are course-specific features, try multi-level models? How much weight is general vs specific, I can pass on resources. The features, I like them, did you just generate all patterns in a space, or some data-driven way to pick them out?

Both great. One thing we’re excited about is using this as digital learning styles, can we cluster courses by successful patterns – e.g. learners in STEM classes study in this class. Then change the boundaries. Secondly, we just brute-forced it. We need to examine that and understand it. Put in calendar day. Also in the MOOC data, timezone issue which we ignored. The process has become time-intensive to generate all these features, first time I haven’t been able to just click the button and it spits it out. Interested in scaling up, would love insight.

Phil: Actionable. What might be actionable wrt the instructor, and student?

I had good, passionate discussion in Michigan, LA fellows group, showing models to students. Comes down to latent variables and proxies using to measure these. One is really hard to take clear action on – gender as a demographic variable. What do you do? Do you make that available, what actions do you take? That’s true for a lot of demographic variables. Also see it in a lot of the activity variables too. Not always clear what the pattern means at a pedagogical level. That’s challenging. Then there’s the temporal dimension. We know the best predictor is previous grades, want to do better.

Josh: We’ve done some work on course averages on clickstream, comparing students to it, is a way to normalise it. In terms of comp sci/ML experts and educational folks working on this. The educational world does have a lot of learning science and theory background, sometimes that butts up against hard data – e.g. the theory I believe says this, the data says the other. Have you encountered that?

Definitely. One thing is technical folks, comp sci or engineers, don’t want to step on toes of educational researchers, this theory is written on tablets. We’re moving in to a world of data-driven approaches. We really need to inform theory with that data. That’s what’s exciting. I’m a comp scientist, now I can understand these theories like social constructivism, show evidence for it, see where it doesn’t work.

V: Growing the patterns, can you use association rule mining, to see what patterns make sense? It’s tricky here, you have days, typically you don’t have this constraint, not the same 1st day or 5th day.

Really interesting, fits with bag of words/bigram approach, don’t have that clearly understood. Weblogs is what I often think of. Here we’re doing associations at a different granularity, don’t grok how all that goes together.

Inference and Analytics: Visualizing results from an adaptive MOOC simulation: Zach Pardos

From Berkeley. Fall in to intersection of education and computer science.

Looking at students moving through pathways. Worked most recently on MOOCs, but applies in other environments too. What are students learning and what’s effective?

They learn rom answering problems on tutoring systems, receiving feedback, and help. They also learn by interacting with each other. Every now and then they learn by going to class (slide of students sleeping in lecture theatre!).

RQ: How effective are the objects in a pathway at bringing the learner closer to her objectives?

Approaches – RCTs (expensive, policy barriers FERPA/IRB) – pre/post tests, look for gains. Data mining – often difficult to draw causal conclusions, even post-hoc. So blend RCT and data mining – Aist and Mostow ITS ML workshop 2000. Pardos and Heffernan, Journal AIED, 2011.

Zynga do rapid A/B testing, harder in education.

One particular example of a pathway. Learning objective of answering a homework problem correctly. Sample student took through MITx course on circuits and electronics. There’s a problem figure, two parts asking about the figure. Participant answers first part incorrect, goes back to video for a minute, answers both parts incorrectly. Goes to book page, boook page 28, gets answer to the first part correctly, then the next.

Which resource was most effective? (Audience suggestion – use data!)

Look across thousands of students. (Pardos et al, EDM 2013). Find which was the most common resource before getting the question right, is a good baseline. But to score all of them, can take a probabilistic graphical model approach.

Under what conditions will this work? Difficult to do this. Simulate different conditions where we know the ground-truth efficacy, see how well we can infer the resources. So run a simulation – restart EM at different locations, different sub parts, sequence lengths, number of resources, number of users – gives 50k simulated experiments combinatorially. Ran each one 20 times with different parameters.

Sort the inferred list, vs the true list by efficacy, look for Pearson correlation. Correlation by number of resources – if <20, average correlation is about 0.6, 0.7 or so. Is that enough? Maybe not. View by number of participants, quite significant effects – 5k about 0.65, 30k about 0.8, 100k at about 0.9.

Dynamic plot using D3,, gives you a view of a histogram. Can select what you have in terms of parameters, and get the output. Interestingly, EM restarts didn’t have a big effect. Different user sizes, can look at how the distribution changes, rather than just the average number. Gives a better idea of what to expect in the real world. Good visualisation for getting closer to the data.

Scaling is popular for ML. Balance scaling vs paying attention to the domain. Worked only on this particular model, expectation maximisation step has been scaled, parallelised, other tricks too. 100k user simulations 20 times, with 200-length sequences takes a long time if not optimised.

Needs: dichotomously scored responses, a learning objective associated with the assessments, event log of student actions and assessments.

Limitations: diversity of pathways – students have to do different things. Becomes effective with 5k students. Doesn’t (yet) support multiskill composition. Needs manual coding for new dataset, begins to slow with 500k datapoints.

Applications – recommendation – for e.g. struggling students, go here because other students have benefitted. For instructors, they don’t know what’s working in their MOOC, this gives them feedback on efficacy.

Interested in datasets where there’s a large social aspect, and potential collaborations there.


Carolyn: More social recommendation, we’ve been thinking about this too. Resource recommendation vs social recommendation – can’t recommend the same person to everyone because they don’t have the capacity. How to do the load balancing?

One thought is, recording the interaction between that person and the one you’re suggesting it to, offer that to the next person. Also, treating students as resources, recommend based on where they are and how they responded, so not always the optimal student, but maybe someone who struggled and completed.

Malcolm: Sarah and her steps (example above). Is it the constellation of steps rather than a single resource? The lightbulb set off at the last one, but maybe required the previous ones.

That’s a good point. One reason why I like the broader view of resource activity, scoring the whole path not just the most recent before correct. You might see the same four resources in the lead up to correct. It should still have some benefit it. But simulation did assume all were independent. Did look at ordering effects, at least in ITS, have found effects. So good reason to look at pairs, triplets or more – not just follow this resource, but following a sequence – essentially replacing the course layout with an adaptive one.

Q: (Couldn’t hear.)

Interested to measure benefit of iteration, if put back in to the environment, does its score increase, and by how much?



This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.

Author: dougclow

Tutxor, project leader, data scientist, researcher, analyst, teacher, developer, educational technologist, online learning expert, and manager. I particularly enjoy rapidly appraising new-to-me contexts, and mediating between highly technical specialisms and others, from ordinary users to senior management. After 20 years at the OU as an academic, I am now a self-employed consultant, building on my skills and experience in working with people, technology, data science, and artificial intelligence, in the education field and beyond.

One thought on “LAK14 Tuesday am (3): Machine learning workshop”

Comments are closed.