Workshop: Machine learning
Chair: Ryan Baker
Speakers:Ian Witten (Waikato, WEKA system), Tom (Chuong) Do (analytics lead at Coursera)
Ryan welcomes everyone.
Tom Do: Practical data mining at MOOC scale
Talking about machine learning at Coursera. Bit light on machine learning. Fair amount of data analysis and tools building.
What is MOOC scale? Buzzword. Distribution of course sizes on Coursera – some small, some very large – 200,000 to 10,000. Not had hundreds of thousands of students before.
Promise and reality – global classroom: true! Quantifying learning – hard! Identifying subgroups – possible! Not always for all classes. 100k class may be 5k at the end. Split testing – it’s useful! A/B, multivariate testing.
Talking about tools used to ‘get things done’ in the Analytics team at Coursera.
Structure: each course has its own MySQL database. Course dbs distributed across multiple Amazon RDS instances. Student log data kept there. Clickstream data logged using separate eventing server. Like an in-house Google Analytics. What make available to researchers: prety much all of it. First, general course data – one db per class, in three parts: forum data, general course data, assignments. Identifiability risk based on contents – public forum data as well, linking identifiers would let you connect to grades. Have separated them out, used de-anonymization mapping only provided when there are clear needs. Separately, clickstream logs. Large files contain one row per line, each row has a user action on a website. Finally, there’s an interaction between Coursera – CourseOps team – working with institutional data coordinator.
Working with data – basic is writing big SQL query at the command line. Direct access to MySQL. Quite painful. But not missing anything, all the data is there. But hard to work with it. A second level is writing scripts (Python) that collect together SQL commands, can interface directly with DB but can also do post-processing outside SQL. Most people find themselves here once they’ve sussed the basic SQL commands. Then 3rd level is internal tools.Data access layer. Make certain things more efficient. Special-purpose data access object – object relational model (ORM) design specifically for Coursera. Data structures aim for natural syntax, avoiding writing raw SQL queries.
Using IP[y]. Demo on his laptop, tunneling through via ssh to their db. Can write small snippets of code, e.g. find all sessions on cryptography. Documentation too. Can get data on particular session. (Connection didn’t quite work, though, alas.)
Can get quick access, get down to detailed info like how many times for individual students do they watch individual lectures.
Ken: How do you deal with labelling issues – text fields in logs. Videos too. How do I figure out which video has which label.
Work with Panda ?static frames. Have an index field that IDs the resource. Has simple id numbers for the course – e.g. video id 2, id 1.
Ken: Labels on text fields people have filled out?
We don’t assume people will parse a URL. There’s a set pattern about how to access each resource.
For questions embedded in the videos – they’re associated with an in-video quiz – each question has a quiz_id.
There are analogous access objects for other elements of the courses, like forums etc.
Don’t have to write SQL queries the whole time. It’s more like Matlab, Octave, R. Interactive development, cell-based worksheets. Can work on small bits and iteratively develop in the browser.
This is currently our internal tool, we’re trying to figure out how to make it available.
Q: R or Qualtrics similar reporting tool?
Not sure what you mean. This is a development environment. You could create dashboards and reports through this.
Q: RStudio vs iPython?
Largely chose on familiarity with the tools. Lots of use are used to these, NumPy, SciPy etc.
Q: Python – Udacity, EdX, Khan Academy all using. Central programming language.
Beyond that, some people are interested in eventing data. It’s a lot of data to look at. Usually several hundred gigs. Google BigQuery. Turn our data in to CSV, upload it, can do queries online. BigQuery is a superfast online database you can do queries on. Because massive parallelisation, lots of queries that’d take a long time you’d get done in seconds. So for us it’s totally worth the cost to do.
Example 1 – mixture model analysis.
Looking at retention in various classes. (Same as the EDUCAUSE paper by Andrew Ng, Daphne Koller and others) – 2-component mixture of exponential distributions are an excellent fit to the observed data, much better than single exponential. Not clear about interpretation. But this type of analysis, and rate of drop-off in the high retention group, gives us a good example.
So find – retention = 0.4 * 0.92 ^(number of hours of lecture video).
In programming classes – high retention. Peer grading more variability. To predict retention for a given class, this is a good point: 40% belong to high retention group. 60% unlikely to. High retention is defined by the mixture model, it fits two models, one high one low retention. The higher group has about 92% retention per hour of video watched. The fit is not improved for any classes, it’s consistent across them. These models let you make predictions – if have class with 21h video, likely 7% retention in terms of watching videos to the end. But if shorten to 10h, likely 17% retention.
Ken: Thought about responses to these models – actions that can be taken. Is there anything you can do? If you ramp up enrollment with advertising, do you just get it in the high-drop group?
A lot of speculation, not much time to do much experimentation. Model suggests things you could do – the rate in the higher group, the rate in the lower, and the split between them. It’s probably not a great idea to try to boost the lower retention group, hard to really move them to completion. But if can shift them from lower group to higher group, ID the triggers that move them – could study maybe the classes with high proportion in the high retention group. Right now no confident results, but some rough things. Classes that have a very focused and narrow audience tend to have more in the high retention group because those who come in have a good idea of what they’re going to get.
Q: Any features about the design, rather than the topic?
I have no idea. We don’t have a good catalogue about the design of our classes. Need to go back and categorise. (Note to self: talk to them about this! Links to our experience at the OU with the Learning Delivery Data.)
Example 2 – course co-registration.
Model is t-distributed stochastic neighbour embedding.
Visualisation of courses mapped on to a 2D space. Look at co-registration data, see interesting bits and pieces of structure. See clusters there. Horse nutrition and dog breeding map together. (Note to self: this’d be a cool way to explore/visualise OU co-registration data too.)
Q: How do you gather data across courses?
In the structure, have coursera.org that shuffles to individual course sites, they have their own db. But main machine also has centralised registration data. Need to build out a data warehousing framework. For now, mainly use the data as is.
Example 3 – peer grading bias
Piech and Huang (EDM 2013)
Some graders too picky, some too lenient. How do I know as a student which is which, am I being penalised? Implemented a latent factor model, using an algorithm taking the grades, and adjust them to control for grader bias.
Q: Factor analysis? (inaudible)
The initial model is completely blind to stratifying variables. It’s interesting about peer grading. So far we don’t have a great way of dealing with this.
We can see that by applying these models, using Python toolkits, about 30 lines, can significantly improve grades towards true grades – fsvo ‘true’ grades.
This isn’t the total of what we do, but gives you a flavour.
George: We’re always looking at being able to give learners access to data. DataShop done in K12 space. Any commitment from Coursera to make things available to researchers?
We’re pretty liberal with sharing data with institutional partners. If they’re interested, contact CourseOps team, more or less flat SQL exports. Done less in the case of researchers who aren’t partner institutions – we refer them to partners for approval. In terms of getting data to students, haven’t done so much about that.
George: Don’t have public database where people can grab anonymised data? All requires a formal request?
Yes. The issue is data permission – whether it’s Coursera’s, the institution’s. Taken view that to share data we work with our institutions and what they’re comfortable with.
Q: Do you believe it can be anonymised?
That’s a tough one. There is some amount of hypocrisy in saying ‘deanonymisation mapping’. You’re right. We think of this in terms of level of risk. Certain data, if it got out, say the general course data associated with anonymised IDs, if it was just MCQ answers, grades – perhaps low risk. But when broadened, add short-answer questions (e.g. name!). The best we can do is ID level of risk, leave it up to combination of institution and their IRB about risk and what’s acceptable.
Q: Mostly about the data, alluded to split experiments. I’m interested in different ways you architecturally have the ability to do RCTs or A/B trials. Under the hood, how do you do it?
It’s rather primitive. Many types of A/B experiments you could do. Some are easy. We do a lot of A/B experimentation on emails – because no dependencies with rest of the codebase. Tried swapping in and out lecture videos, quiz questions – for those, a bit more hackish work that’s done. Someone on Engineering takes it on as a special case. We don’t have a full comprehensive A/B testing system. We do for the main website, but not surfaced for instructors. Main thing is email testing.
Q: Paths through courses – take this course, then that course – suggest courses for them. Can you combine data?
Sure. It tends to do that in a wrapper-ish way, it’ll access data from different dbs. You can write scripts that’ll do it, running them will take say an hour.
Ken: Looking at learning outcomes, do you have outcomes with good post-tests of achievement? To look at say what behaviours are predictive of good outcomes.
Haven’t done that much looking at it. Small select cases where we’ve taken an interest in the pedagogy. Concept of allowing repeated submissions of assignments.
Ken: Independent variables – but what about the dependent one – anything with trusted final exam?
Not always the case. Leave it up to the instructor. Some have final exams. Some go beyond and offer the signature track program, students authenticate prior to every assignment. A subset of those are available for accreditation with external company, 3h proctored test. For those you have assessment beyond whatever happens in the class. The data is available, but not much.
Q: Also has many confounding factors. Hard to go from final score to here’s how students have learned.
Bodong: Can you talk about experience data, clickstream data.
Not done much about that. Anecdotally see cases where people zoom to parts of the video that are confusing or interesting. Most common because it shows something visually for the first time. Say in modern poetry, it was where (inaudible). Not a ton of actionable insight for instructors.
Ken: Learning progress issue. Do you have courses with assessments of learning objectives repeated throughout the course?
Q: Individual faculty at other institutions develop the courses. Many don’t do this at all.
Ken: We need to fix that somehow.
Q: Sentiment analysis (inaudible)
Haven’t done anything on those lines. Don’t have resources in house, but some partners interested.
Ian Witten – The WEKA Open Source Data Mining System
I started the WEKA project, but that was 21 years ago. It’s been stable for 5 y. I stopped working on it 10y ago, but it’s been successful so I can’t deny knowledge of it. Any detailed questions about whether it can do that, I’m the wrong person – Google knows that.
Half the room have used WEKA. Why are you here?
A: We were here for the Coursera talk.
Someone: I’m here for WEKA!
I call it butterscotch ripple. The Willy Wonka quote. WEKA does this little thing at the core of data mining. Gathering your data, deciding your question, amalgamating data, cleaning it, how you’ll deploy it – that’s 93% perspiration, and it’s outside WEKA. A lot of people say I have data, I have WEKA – it doesn’t work like that. Have to work out what you’re trying to do. Maxim to get down and dirty with it, really understand it. It’s your data, you need to understand it and operate the tool. That’s why we push it out.
6% electricity – that’s feature engineering. The secret sauce. WEKA helps a bit with this. 4% evaporation, that’s using WEKA to do classification, deploy the results. Tiny part of overall data mining. 2% butterscotch ripple. Other projects – interested in working with text. More recent projects. It’s all open and you can download it.
New Zealand consists of three islands – north island, south island, and the west island, that the inhabitants call Australia. Waikato is in Hamilton (not Auckland). A weka is like the kiwi, flightless bird, can’t see, can’t fly, runs like hell.
WEKA is Waikato Environment for Knowledge Analysis. Don’t pronounce it like ‘weaker’! State of the art machine learning algorithms, runs in Java, very stable, under the GPL. Complements ‘Data Mining’ by Witten, Frank & Hall. Supports process of experimental data mining. Helps you prepare input data, transforming it, creating new features, combining them, doing principal component transforms, filtering – supervised filtering. Can evaluate learning schemes, and some visualisation. It’s old, been around, so stable, but not sexy UI. But people like it. Used a lot. Used for education, research and application. Data mining and machine learning means the same – but data mining means shelve with database books, not old school AI. The book is great and you should buy it.
MOOC coming Sept 2013 – follow @WekaMOOC on Twitter. Very elementary. Email firstname.lastname@example.org. Using Google platform, very simple.
Lots of data preprocessing tools, classification/regression algorithms, some clustering, attribute evaluation, feature selection. But not suited for large-scale association mining. Three UIs – Explorer, Experimenter, KnowledgeFlow. Takes a minute to download. Explorer for exploratory data analysis. Experimenter for large scale comparisons, can map across multiple computers. KnowledgeFlow is graphical processes.
Most people use Explorer most of the time. I do. Tabs across the top – Preprocess (filters), Classify, Cluster, Associate, Select attributes (=feature selection), Visualise. You probably don’t want to use WEKA for large-scale association. Visualise lets you do it in 2D.
Q: Why would you use it?
I don’t know. If you’re a computer scientist, you’d use R instead of WEKA. Might use WEKA quickly for interactive view, look at it from multiple perspectives, exploration. Experimenter for large-scale comparisons with t-tests comparing algorithms. Didn’t think about the target audience.
Ken: Data miners!
Everyone wants to mine data. You should be doing the data mining. Using WEKA if you don’t want to get technical. You need to be in control. Even managerial level people can use WEKA. It’s simple.
Q: I don’t have programming background, it’s nice. R requires that.
Some people don’t know Python! That’s fine.
Ryan: It’s stable. Will there be a new release? Feature request list?
Typically new version when new version of book. Just mail the WEKA people.
We’re not paid to support the world in their data mining operations. We move on to other things. There’s a company who is now doing maintenance of WEKA – Pentaho – anxious to put in features people want. Mark Hall is person to contact.
Huge amount of stuff WEKA can do. In general terms, if it’s sensible and do-able it’s probably there in some form.
Knowledge Flow lets you set up your structure as a flow diagram. WEKA loads all the data in to main memory and it runs out – need to go in to Java and increase heap size. But Knowledge Flow lets them work incrementally and don’t load whole dataset at once. I don’t like it. A lot of people use it. Plot learning curves, fundamentally incremental things, Knowledge Flow will do it for you.
Experimenter is a bit more complex. Set it up to run experiments, store results in databases, pause, restart. Does serious statistical testing. Just press go and it’ll run for a couple of weeks.
What’s it like to have long-standing successful OS project? Started with funding bid, in 1991/2. Small dept, wanted to invigorate research. Wanted big group project on machine learning. NZ is agricultural, much scope for data mining there. We feed cows on grass. Not so much in winter as summer. In winter have to cull the herd. It’s killing really. You have to decide which cows to send to the abattoir Big economic decision. Heaps of data on cows – parents and heritage, milk yield measured every day, how many kids, how difficult the births were, whether they kick you in the shed, etc. Big application of data mining.
Let’s say you want to have a baby, but turns out you can’t. Artificial insemination is taking some cells from the ovaries, fertilising it with sperm, growing embryos, have to select which ones to implant in the uterus. YOu want live birth but not too many. Heaps of measurements that embryologists do – whole bunch of features. And historical data about which led to live births. Use that data to select best embryos for implantation.
Uses ARFF format. Internal release in 1994, Tcl/Tk, C. Rewrote in Java in 1999. 1st edition of book. Factor of 3 slower in Java – no graphics. 2ns edition 2003, 275kloc. Deal with Pentaho Corp in 2006. 2008 was 500kloc. 2011 3rd ed of book. WEKA 3.6 in 2008 is last stable release. Was very exciting initially, after that, we’re not paid to maintain this code, had to find a way to get that out, and Pentaho is a way out.
There’s a lot of interoperable stuff, can export. If it’s sensible they’ve probably thought about it. Spreadsheets, CSV, everything. R has interface to it via RWeka. (Note to self: Check this out.) Many other stuff. Kea – automatic keyphrase extraction. Has extensive filters for documents for textmining.
Lots of people use it. 2009 1.5m downloads, 20,000 downloads per months.
Many other interesting things too.
Concept-based text representation: Wikipedia Miner. Taking plain text documents, represent in terms of concepts. Where do we get them from? Wikipedia. Every article addresses a single concept (supposedly). Every concept you can think of, there’s a Wikipedia article. (If there isn’t you should write it.) 500y ago, knowledge in Europe, the gatekeepers were the church. To publish a book, write out by hand, church had scribes. Also teachers. Between Renaissance and Enlightenment, moved in to universities. Right now, academics are in charge. Publish a book or a paper, someone reviews it, probably a professor. We’re in charge. We like it like that. Wikipedia and that movement towards collaborative mass text production is moving gatekeepers of knowledge to we the people. Much more democratised. People like me resist it, tell them not to use Wikipedia. MOOCs are even worse. Poor old beleaguered professors not happy about it. Took 300y last time, won’t happen overnight. In 50y it’ll be different. Wikipedia just the beginning. Collaborative production and editing, by ordinary people. This was predicted 120y ago by American philosopher Charles Peirce. Committed community of enquiry – now visible on Wikipedia talk pages. Wikipedia is also machine-readable. So define a concept as a Wikipedia article. Look at 4-grams, 5-grams in the text, use manual redirects and anchor text on internal hyperlinks in Wikipedia – synonyms for this concept.
Took news article (Snowden) – use wikipediaminer JS link in browser – gives links to WP articles embedded in the text. Can get some failures – here ‘colonial independence fighter Paul Revere’ where fighter was link to fighter aircraft. Can control link density too. Even when ramped up high, still generally good.
Can download this, it’s open source.
Demo of wikipediaminer, live on website.
Take e.g. kiwi. All sorts of things – minor character in obscure French cartoon. And many other things. Put in to the search, gives you all the articles. Can compare. Compare e.g. kiwi with bird. Disambiguation algorithm, based on semantic distance. Database of manual judgements. So kiwi is 54% related to bird. Kiwi and weka are 73%. Kiwi and west coast – 68%. What about LA and west coast? There is disambiguated to west coast of the US (not west coast of NZ as in the first e.g.). Kiwi is related to shark 41%, because of the kiwi shark. Kiwis and sharks is 61% – but there Kiwis=NZ rugby team, and Sharks is a South African rugby team. It chooses the closest pair.
To get relatedness, look at the links between them, how many targets in common and how many not. Look at overlap. Not just outlinks but inlinks. Combine these things, machine learning scheme, that gets the semantic relatedness. Not exact. Really fast though. Just works on the hyperlinks. At least 10k calcs a second. Preprocess whole Wikipedia, takes about a week, Hadoop on 30 computers. Pretty simple. Looks cool because all the knowledge is in Wikipedia defining the redirects.
Annotate. Can do pasted-in text and wikipedia-link-ifies it.
Problems with this demo, not always up because of hack attacks. But can download the code and run it yourself.
That was text as a bag of concepts, rather than bag of words. Not perfect. WP isn’t perfect, has holes. Homer Simpson has long article, Homer has short article. It reflects popular culture. Women are under-represented. Techies over-represented. Verbs are under-represented. Also abstract concepts e.g. courage. (Didn’t used to be one, but there is right now.)
The aboutness of documents. Have concept representation, look at the semantic relatedness, see what concept is most closely related to other concepts in the document, get some sense of what document is about. Used to set up distance metric between documents. Not quite so simple. Use centrality measure, concepts in each document compared to each other, computed cross-centrality, got nuanced distance measure between documents. Training data from Australia, students quantified what they saw as relatedness of documents. Tested it using it for text clustering, more accurate.
If you’re dealing with text, you should deal with concepts instead. Don’t represent as words, or n-grams. That’s passe. Use Wikipedia Miner and turn it in to concepts.
Third thing – better than human performance. Key phrase identification from academic documents. KEA. (Another NZ bird.) Much better version called More Accurate Topic Indexing (?).MAUI Got professional indexers to independently tag documents. Then our system. Took as correctness measure, the consistency. E.g. human 1 with other others, and so on. Average those. Then take algorithm and get consistency with the human annotators. We got better than human performance in terms of consistency with other humans. (!) (Note to self: link to older stuff on metadata and indexing and how computers do it better.)
Another project MOA – Massive Online Analysis. MOA deals with bigger datasets than WEKA. Never stores anything in memory. Algorithms tailored for online learning. So if worried about big data … WEKA deals with pretty big data. If really big, then MOA and the workbench.
Further project – ADAMs – advanced data mining and machine learning system. Flexible workflow engine to build real-world complex knowledge workflows. Not just feature extraction and classification.
John Behrens: Thanks for explaining how to pronounce the tool. Any tools outside your lab you’re excited about?
No. My students do interesting things. I don’t spend a lot of time looking for other things. I don’t know. I’ve never used R, trying to find a day. I don’t use other tools. But there must be. We were surprised with our wikification algorithm. There were others. Can test it by stripping Wikipedia article of its links and try re-insterting. We were way better than other systems. This stuff is pretty good.
Open source software, you’re giving it away. We don’t know who’s using it. At conferences, say physiology – they’re using WEKA. It’s infiltrated other disciplines. That’s the answer to who it’s aimed for – it’s non computer scientists. Everyone’s using machine learning. It’s great technology. It’s simple, its unpretentious. It does a simple job and does it well.
Ken: I want to know if WEKA and do something, I hesitate to ask. If you have a multivalued discrete variable, say 30 values, want to search for different subsets of those values, split in half 30 different ways. Kind of like multiple regression.
There is a tree induction algorithm, looks for subsets to branch on, not on single values.
Ken: Searches through whole powerset?
Mumble. Probably options to do that. It’s an algorithm. Multivariate decision tree. There’s an algorithm for that.
This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.