LAK16: Tues am – EDM with Python and Apache Spark

Liveblog notes from Tuesday 26 April at LAK16.

Educational Data Mining With Python and Apache Spark

Lalitha Agnihotri, Shirin Mojarad, Nicholas Lewkow and Alfred Essa, McGraw Hill Education

Sunset on the Crags

There were an outline for getting set up with Anaconda and the dataset for the workshop, and Anaconda with Python 2.7, plus the dataset, plus the Jupyter notebooks on their Github repo.

Nick and Shirin presenting. The dataset is from a MIT/Harvardx MOOC.

First part is introducing Python and exploratory analysis. Then more advanced. It’s all in interactive Jupyter notebooks.

[Very light liveblogging from this entire session, sorry – I got sucked in to working on the Python code.]

First notebook is 1-Educational_Data_Mining.ipynb… Basic data exploration, iPython/Jupyter notebook work.

Second was looking at classifiers, logistic regression then k-means. (Slight problem with the last problem with that; will send through afterwards.)

Spark (3-Introduction_to_Spark)

This is really cool but much harder to install, so we didn’t run that as an example.

Like Hadoop – a way to do parallel computing on commodity hardware.

Resilient Distributed Dataset (RDD) – a list of objects. It divvies it up among computers in a cluster, when you do an operation, they do the work and communicate it back down to the driver, your laptop. You have a Spark cluster, set of machines, and a driver program, often your laptop, that ships the work out.

If computers in the cluster go down, it’s resilient and can cope.

Two ways to create an RDD. Parallelize a collection – take a list and parallelizing it. The other way, read data from an external source – e.g. S3, C*, HDFS (Hadoop file system) – it’s like one line. For HDFS, the data is stored in parallel, so the read happens in parallel and not through your laptop so it’s fast with no bottleneck.

Trivial example – fish, cats, dogs. collect() method brings it all back to a single Python object. Then Moby Dick data.

Operations similar to Hadoop.

You can think of Spark operations as a DAG (directed acyclic graph) [not sure I got what he meant there]


There’s Core Spark, Hadoop type functionality, plus four packages that work on top of it –

  • Spark SQL (easy to parallelise dataframes),
  • Spark Streaming (where data’s coming in constantly, live data updated periodically, e.g. every second, parallelised and chunked – often use the free Twitter API, set up a Spark job that grabs those, does sentiment or geographic analysis, do it on the streaming output),
  • MLib (machine learning algorithms),
  • GraphX (for network analysis/DAG analysis – nodes/edges etc).

Example of Moby Dick again. Going through basic actions.

Two types of maps – flat maps, regular maps. Regular maps are 1:1 – every element of the input has one output. A flat map can have 0 to many outputs for each element of input. So, for instance, expanding the RDD from 22k lines to each word – 221k words.

Regular maps – just map – are easier to conceive of, and useful for cleaning.

Filter – select particular elements from the RDD.

Map reduce. Map the word to a tuple, aggregate them for how many times it appears. Here, reduce it.

Many other handy functions … can combine in to single line of Python code.

Fun analysis – average Scrabble score per book.

These examples you could easily do without Spark. But on student data, with gigs of data, it gets really sluggish. It might take 3h to parallelise code, but it’ll run in 5h rather than 5 days. Easy to take Python code then parallelise this.

Q: Spark vs Hadoop

Hadoop very rigid, very key-pair. Spark, especially with Python, any Python object can be an RDD. Say 1000 Pandas dataframes, parallelise those. Hadoop has nothing like that. More activity, developer activity, around Spark than Hadoop at the moment.

Spark is significantly faster, it’s in memory. Also lazy – doesn’t do anything until you do collect(). Hadoop works them as separate function, read/writing to disk. In-memory, Spark broke record for petabyte of data sorting. Half the computers, in very little time.

Really compare MapReduce and Spark. Much faster because memory. Can use Hadoop and Spark.

Here, used Vagrant, set up to run things locally as a headless Spark, single node, can do a cluster, run locally and elsewhere.

Nobody’s done much on GPU … except there is something, forked Spark to use GPUs.

Q: Integation of pandas with Spark. Machine Learning library isn’t as good in Spark than SciKit Learn. What are the options?

Create and RDD of data frames. flatmaps, every element is a pandas dataframe, say finding average or something, do that as a map operation. Any python datastructure can go in a Spark RDD. Time series, RDD with every time step as an element.

Can also call external libraries within Spark – if they are parallelised/serialisable.

Most classes are, but some not in Stanford core NLP.

This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.


Author: dougclow

Academic in the Institute of Educational Technology, the Open University, UK. Interested in technology-enhanced learning and learning analytics.