Apache Hadoop: The Big Data operating system
Chair: Stephen Coller
Speaker: Doug Cutting, Chief Architect, Cloudera and Director, Apache Software Foundation
Will describe where it came from, where it’s going. Is the central component of Big Data.
32 years ago, was a freshman nearby, could see this building. On the ground floor was where undergrads learned to program, and that’s where I did – spent a lot of time in this building. The relational database was being invented and the first ones shipped. Some amazing progress since then. Moore’s Law – the number of transistors on a chip would double every 18 months/2y – grow exponentially. Same for other components – memory, storage devices, network speed. Just about every parameter seen this improvement for 30 or 40 years, which is pretty phenomenal. Much bigger difference than between walking than jet travel, or mud hut and Versailles.
The technologies, software approaches to use these technologies might change too. We’re at a phase transition in how we process data, and Big Data is a new way. It’s a fundamental change and we’ve not see a lot of those. Derives from cost fall. Many, many devices collecting, generating data. Phenomenal amount of data generated, and it’s fundamentally valuable because it gives a picture to businesses of what is happening. Would be true in education to the degree that we can use technology in education.
‘Big Data’ – I’m indifferent to it; it seems to be the term but don’t get upset about it. It’s like Big Hair. You could deny that big hair exists, but it’d be a hard thing to prove.
What is Big Data?
It’s scalable. You can store a lot of data. Relational databases have fallen short. It runs on commodity hardware – cheap, mass-produced. Relational dbs evolved to run on very specialised hardware, and you paid a big premium for that. If you want to scale, if hardware costs 1/10 as much, you can store and process 10x as much for the same price. Key element. Another key idea: distributed. No matter how big you can build a single computer, at some point you need a couple in tandem, and have a distributed softwares system. It’s really tricky, it’s a rat’s nest, you don’t want to go there if you don’t have to – but we have to. It’s tricky to develop. Why is it complicated? If you have one computer, and it crashes once a week, that’s Ok, you reboot and you’re running. If 1000 computers and one crashes, and you have to restart, it’ll happen every few minutes if they crash at the same rate. You can’t afford having one computer fail stopping progress, so have to be able to proceed regardless of any individual stopping, just rely on the majority continuing, Very tricky way to program. When talk to other computers you have to consider failure scenarios you just don’t have to if it’s just on a single machine. But once you’ve got it and debugged it and it works, it’s incredible. You can just add hardware to add capacity. Two elements – commodity hardware, distributed reliable computing. Third element – open source software. Through the last 30y, people built systems on proprietary techs, come to be a bit wary about the risks. If your platform is controlled by a commercial venture, your future becomes dependent on that company who could go out of business or change pricing. Linux has taken off, especially in most new platforms. Independent of any one institution.
Open source’s time has come. It provides software for the commons, it can be shared, everyone can depend on it without fear of change in pricing or availability. And can get involved in development. In Big Data, through Hadoop, see demand for an open source platform.
At Apache, quality is emergent. It’s a nonprofit, hosts about 140 different projects. Doesn’t care about theme. It’s themed around how the communities run, so they’re independent, open to newcomers, welcome contributions. Communities of individuals rather than organisations. I’ve been involved with Apache through 5 employers. A lot of social rules about how projects are run, few about the technology – is the nature of Apache. Apache has no strategic agenda. The goal is to build active, diverse communities. High quality software products emerge from that.
Big Data is flexible. People miss this. One component is, in classic relational dbs, spend a lot of time up front figuring out the model, what the tables, columns are going to be. There’s rules about how to structure your data up front. You presuppose the questions you’re going to ask. In Big Data, we talk about schema on read. You load it in, don’t presuppose analyses, you can afford to save a lot more – the price per byte stored is much smaller. The stack is designed to accept and deal with raw data in the form it arrives. As you process it you create new schemes – you might want to store those as intermediate steps. You want to share the data once it’s stored between systems. If it’s voluminous, heavy, hard to move, even though networks are faster they’re still a bottleneck. So operate on data where it lives – send the computation to the data, rather than pulling the data to the computation as done in relational systems. What kind of systems are going to operate well? Once that don’t need data to be moved to operate on it. Live together on the same cluster of machines. Some systems, big data systems, don’t share storage with other big data systems, they’re missing out – have scalability but not necessarily flexibility. NoSQL stores, want them shareable, interoperable.
Apache Hadoop. Open source, came out of work on web search engine open source, building on what Google or Bing does but open source. That’s still going on but it’s tough to match what they can do. But out of it came distributed computing methods. I was working on this, it had to be distributed. Google published a few papers on how they were working internally, my buddy and I read it and thought that was a better way of doing it – general purpose platform. So we reimplemented file system, and computing layer called MapReduce. Took us 6months-1y to implement that and move Notch on top of it. Didn’t think about big data, were scratching our own itch. As Google were. Yahoo started backing this work, we renamed it Hadoop. Similarly not looking to change data processing, it was a single application driving it. MapReduce is a batch processing system. Store across 1000s of nodes, want to do some computation over it, and collate some results from it. Jobs take minutes to hours to run. Yahoo’s web search on Hadoop took about a week to do the biggest computation over the whole web. Really need reliability, will have a lot of nodes fail – roughly a petabyte of data. Hadoop – 2006 was when Yahoo started backing it. 2007 Apache project, had other collaborators pitching in. Other companies saw the utilities, Facebook, Twitter, LinkedIn, became involved, contributing to Hadoop or building things on top. It took off, other industries became interested. Cloudera founded to take to enterprises beyond the web where industries want tech hand-holding.
Hadoop ecosystem has a set of tools. It’s continuing to grow. Apache Pig (Dataflow language, emit MapReduce jobs), Hive (SQL engine like with relational db but operate over big flat data files, still schema on read), Mahout (machine learning algorithms, primarily MapReduce). Another thread shows the promise: Apache HBase (NoSQL store in Hadoop, key/value store in huge huge tables, incredible rates of insert and lookup, after paper from Google on BigTable, their NoSQL store), Cloudera Impala (not a batch system, it’s an SQL engine but doesn’t do MapReduce in back, but runs more interactively, 20x faster than Hive for most queries), Apache Solr (search, based on Lucene, built on Hadoop, indexed and searchable as it arrives). Change to more interactive systems. MapReduce things are run for applications but run as batch jobs on hourly basis – rankings, recommendations, machine learning. Still very useful. But also lots of cases to do things online with similar datasets, and those things are arriving. Probably 10, 20 more tools as well.
Hadoop has become the Big Data operating system. It manages storage, abstraction of hardware, controls access to resources – sits in the layer between hardware and application. Split that layer in to underlying system (e.g. Linux) and distributed system (Hadoop). It’s a platform, a useful metaphor. Some use MS Office, hard to co-write with people on different platforms. Easier if all based on a common platform. So build big data platforms that share a platform, have better interoperability, facilitates more rapid exploration of datasets. I expected we’d see some proprietary or alternate open source come forward as real competitors but that’s not happened. Microsoft had one, but they elected not to productise that and promote Hadoop. Oracle has endorsed it, IBM too. The IT industry endorsed it as a big data platform. Nice to not see fragmentation for applications. Still see techs that are operating-system free – a NoSQL store that people build systems to, but it doesn’t interoperate easily, so pay penalties in management for authentication, backups. Operating big clusters is not trivial, so simplifying is big. And moving the data is expensive. It’s a good thing to have this single system.
Big data fuels innovation. Can get a picture of what you’re doing that you don’t have otherwise. A higher resolution picture – like a satellite picture, you’re looking for e.g. missiles – a higher resolution lets you see more. Many other cases – in education lots of long tail effects, different abilities, if you put everyone in one box won’t capture everyone accurately. You need more data, get more accurate a picture. Peter Norvig – the unreasonable effectiveness of data – article and YouTube talk – mostly in natural language processing where doing e.g. machine translation. Different systems trained on the same dataset, you evaluate them, have a ranking, train them on a larger dataset, and the ranking is totally different. They all do better on the larger dataset than any did on the smaller one. In that domain, translation, more data is much better and more important than the algorithm you’re using. (More data usually beats better algorithms.) Use simple, dumb algorithms that you can understand, then huge amounts of data. More data beats better algorithms – blog post that someone from ?LinkedIn did. Not just the quantity, but the variety – data that complement each other. Netflix contest, people’s movie ratings, try to predict which movies they will like, score compared to their actual ratings. Offered $1m to anyone who can do 10% better than internally. Lots of people play with that data. The students who did better – with the same dataset – people who grabbed the IMdb dataset too and included that, did much better. Similar in other places – Yahoo tries to not just log what pages someone sees (web server logs – this user visited this url at this time using this browser) – but every single link on the page when the person viewed it, and any other resources. That would let them, when someone clicks a link, what were the links they didn’t click, the ones above and below. Not quite the whole page, but logging lots of context around the click. A lot of people don’t think about that – e.g. retailers getting more context about where on the shelf a product was. Same in education – if measuring something about a student, not sure it’s relevant, but if you can afford to gather it, why not – could be an important factor. Oftentimes pay off and provide huge benefit, this contextual information. Plays back to flexibility, need platform you can stuff data in to, and rich set of tools to explore. Hadoop lets you do that. Rather than having preconceived notion of the problem, you save the data and then you explore it. Data scientists and how you find them – they’re people steeped in a problem who are numerically competent, and that’s the combination you need. They care about the problem, familiar with what’s been tried and not – previously spreadsheet jockeys. Now spreadsheets on steroids in big data. Give them better tools to explore the data relevant to their problems.
Use of Hadoop snowballs. What happens again and again – at Google, Yahoo, many institutions. Have one problem can only solve with Big Data, forces them to build a cluster and move things in, get things running. Then peoples see other things they can do on it. People going on the bandwagon just to see, not so good. Come with a problem to get you over the initial hump.
How can you use Hadoop? Think about what datasets you have, might have, what other people have, how they can be combined. Find those problems. It was just two of us working on it just 7 y ago. Not it’s a worldwide trend. You can get involved too. Love to hear questions.
Jason, WGU: Future of the relational database? Are apps just going to write to Hadoop clusters?
Some things uniquely suited – transactional systems. Selling items, tracking inventory. Technologies in big data aren’t quite appropriate for those yet. But that’s coming. Look at Google’s last Fall paper on Spanner, have figured out how to do transactions in large scale distributed systems. Pretty neat computer science work. We will see fewer things that can only be done in relational dbs. That’s the trend.
John Behrens: Thanks. Talked about commodity computing at the hardware level. For education, the hardware as a service, the cloud, Amazon, Google, IBM – allows us not to build large data centers. Can say something about Hadoop and those technologies and how we can leverage that?
Cloud is really neat trend. Cloudera started thinking that’d be part of the business but hasn’t work out that way. People continue doing everything in the cloud – e.g. Netflix all in Amazon. Most find cost premium to not be what they want long term, if you have a cluster with a lot of data, it’s cheaper to house in your own than to outsource it. Maybe Amazon can shave margins over time. We’re still seeing most serious people running in their own datacenter. Most start in the cloud, very convenient – project, or research or student – silly to build a cluster. But if you have a cluster for other things and can share the resource, in a big institution, you can combine it. But one off experiments, don’t buy a big asset. It has its place, it’s not the way the biggest players are going except a very few.
Piotr, EdX: Where do you think the role for Hadoop starts? If I took every CS student in world, all their problem results, 15 Mb of data, can do it on a single computer. Complete dataset, it’s about 1 Tb, starting to be 1 computer maybe 2. Can do subsampling. When do I switch over from MySQL to Hadoop.
Rule of thumb, if it’s hard to do on one computer, use multiple ones. Most don’t start with cluster of 2, go from 1 to 10, then they run maybe 5x as fast, some cost of moving to this model over a single machine. It’s a threshold that’s not hard and fast. Get less convenient on one computer, then when first make the jump it’s overkill. But then later as it grows you want to add more.
Marie, SRI: In K12, broadband is the real disenabler. Visualisation software. Anything besides, say, Tableau?
I’m not the best person to answer. I spend my time down in the plumbing trenches than up in the applications. Tableau is popular, commercial, does run on top of Hadoop well. Some open source visualisation systems, I can’t remember names. Some built specifically on top of Hadoop, both commercial and open source.
Chris Brooks: Foundation side, Apache. Talked about didn’t have a focus on an outcome, but on getting communities together and give those communities methods of governance. In open source for education, Sakai, Moodle, many different ones. Is Apache appropriate for open source higher ed projects? A lot of people are interested in a project around learning analytics. Make sense inside Apache? Or stick in this higher ed side?
Bar for becoming an Apache project is roughly, round figures, you have multiple institutions – at least 3 who are involved. Or independently. And probably 5 or so relatively active decision-makers – at least 3, but you want a few more than that. Beyond that, not a lot. Use the Apache license, so have to accept that. It’s pretty liberal. Other than that, no. If one guy has written something and wants to give it out, maybe accept a patch now and then but not really a community. It’s like the borderline of when you need a cluster. A lot of people enter incubation at small size because they have ambition to get big. Have an intuition that the thing you’re building is something that will reach critical mass, say 10 or so people working on it, and over time will have long-lived. Also, if a project is going to be done, e.g. convert one format to another – not interesting to build a community round it. But if has a theme that will develop and grow over time, that does need a community. Has to sustain a community for some years, big enough that it’s not just one guy or gal.
Playing Nicely in the Sandbox: Ed Tech, Education Researchers, & Business
Chair: Taylor Martin
Speakers: Alan Louie (K12 startup incubator), Steve Schoettler (LA startup company), Nicole Forsgren Velasquez (Business Sch at Utah State)
Taylor introduces panel.
Imagine K12, incubator for startups who want to build tech for K12 education. Few years ago. Startup typically two guys and a dog, or two gals and a cat. Get advice, funding. Were 10 teams per 6 months. Stepped back from day to day at end of last year. Idea was to take consumer web knowledge about startups and apply to K12 education. Learn startup hits K12 education. There’s a lot of activity in the ed tech space for startups. I see that there needs to be more collaboration between researchers and startups. I’d leave the larger companies out of it, they have researchers on payroll. The founders are typically business and technical, not research, maybe education. I would suggest if you’re interested, startups are really great place to work with. They have a lot of data. There are startups beyond the seed stage, have a few million bucks. Example, Class Dojo, 5m registered users. Collecting data from the beginning, so can look back to e.g. 2011, study what happened in those classrooms, post-priori RCTs, take teachers who signed up and those do didn’t. [That’s not randomised! And barely controlled.] Good, faster, cheap – pick two. Generally startup picks fast and cheap. Opposite on the academic side, they’re all on the good, pinned to 11 on that scale, so fast is really uncomfortable. You gotta turn down the dial on fast (to companies). Example win-win. Steve Schneider (?) researcher at West Ed, connected to companies. Applied for an SDIR ? grant and two won. Instagrok, better ed search engine, and another building dashboards for principals. Now eligible for a phase 2 grant. The odds are stacked against that happening. On business side, apprehensive about working with researchers, because of impact on the fast dial. VC friends, OMG I keep researchers as far away as possible from my startups. But for education you absolutely have to work together.
In response to Alan, business is here, startups are here. Felt like an alien at AERA because in business. People are going to do this, we can sacrifice good for fast, give someone bullet point and lists. We can craft gorgeous paper noone can read. People are going to monetise things. Education field would benefit from acknowledging that’ll happen. We can have impact on influential things, doing something that’s important. For people doing research, find someone in a business school. Ed tech startups have data, that’s valuable. Business have hard time getting hold of data. We want data that’s grounded, contextually-relevant, we’re fine with quasi-experiments if they’re real. Decision support research systems research, dashboards. Established practice. Organisational studies, get purchases used and implemented. HCI research. And analytics. See what methods are being used in other fields. Sit with Taylor, things that are traditional from our backgrounds, can come up with insightful things from beyond formal training – interdisciplinary research is exciting, adventures.
Been in tech companies. Was at Zynga, co-founder. Saw rapid growth.Majority due to data, tens of tb every day to understand users. Wanted to do something more rewarding than selling virtual tractors. Education had a lot of appeal for me, I’m a sucker for hard problems. Also thought I could do something. Chose for-profit company to make impact, best way to make scalable sustainable impact. Saw a lot of opportunity in meeting other entrepreneurs, big companies e.g publishers, big online schools – not just the startups, it’s e.g. University of Phoenix. We can do a lot of this. Can add more instrumentation to collect more data without getting people to fill out forms. Working with other folks, requires multidisciplinary approach. At Stanford, talk about lean startups, great book – Startup owners manual – start cheap and quick. Lean startup doesn’t really work in education. Needs many domains – tech, academic expertise. A lot of projects started by different groups are missing key elements. LA work group, interesting collection of people from academia, industry. Very few people have a sense – all are passionate. People in nonprofits work on long timelines. See great ideas for how to address a problem, but don’t think about e.g. how to get a teacher to change to use something. Might be world’s best, but if teacher doesn’t use it, never gets used and doesn’t solve the problem. Need for scalability, reliability. Schools at a level where laptops only have 50% chance of working, so reliability of your cloud service isn’t the block in the classroom. Bring together disciplines, bring end to end, what are the rules about the data, how you’ll use it to feed back to students. Scalable, sustainable. Exciting opportunity, takes a lot of people to come together.
Marie, SRI: Want to address a couple of things. Standards of evidence. Plug in for report on US Dep of Ed office of educational technology website – the evidence framework – looking at new standards of evidence to bring to bear in the right context when working with ed tech context. Lot of e.g.s from companies. Looked by the Inst of Ed Sciences, had heartburn because didn’t meet their gold standard. Hoping to start a conversation about what’s a good standard of evidence. Look at it and give feedback. Request, new project for the dept, looking at info ed tech startups need to know about education – teachers need to be able to adopt it. Stuff like what’s an IRB [other examples I didn’t catch]. Other ideas would be welcome.
Steve: Interesting coming from a background where we believe in scientific method, schools don’t like to talk about experimenting on kids. There is a method key to innovation, really rapid cycle at web companies. At Amazon, put two web layouts, try out and get real-time feedback. A/B testing. Learn, iterate, in 5 minute intervals. We need more data, more rapid than end of year tests. If that’s the only indicator, we can only innovate on a 3y cycle. Wouldn’t fly in cloud on instruments with ATC and instruments on Jupiter with 18 minute delay. Bought in to idea of starting small, all benefit from the results. If we can be subtle, e.g. change out questions on the exam in real time. Doesn’t have to be large experiments. Need culture of innovation you see in startups, not seen in the school systems. Try things in a low-risk way and innovate.
Al: Mentored company called No Red Ink. Entrepreneur is an educator – 9y veteran of Chicago public school system. Giving feedback on papers kids were writing, mark it up in red, spelling corrections. Got tired, wasn’t having impact. When I got my paper I looked at the grade, flicked through a bit, but maybe spent 5 minutes on it. Feedback loop pretty broken even week to week. Focused on grammar to start with. Personalisation – instead of taxonomy of all grammar things and running it through. Started by asking student what are you interested in, hired a consultant programmer to build a system given that input, construct sentences with grammatical errors related to that interest – so e.g. ‘Justin Bieber needs a haircut soon’ with an error in it. Addressed a lot of issues about how to integrate in to a classroom – it’s low bandwidth, gets a lot of traction. Uses his adaptive algorithm. Very creative ways looking at solutions. Doesn’t have any educational research background to look at larger scale analytics.
Nicole: To build on that, some of us classically trained in research. But with our data, if we want to run these analytics, like Amazon does, important we keep our enquiries theory informed but data driven. We have great theories, often in different contexts. Or construct not yet thought of, now we have rich deep data, something else might come out of it. Some fields are coming round to it. Theory inform if you can, but see what the data gives you.
Steve: Disheartened by Doug’s study about knowledge about the algorithms is more important than domain knowledge. I still believe academic knowledge and understanding cog sci will help. Example if takes 3 exposures to learn something and you do 2, you’ll never realise that from your data. A place where we need collaboration with that knowledge.
Nicole: Sanity-driven data science is important. What does domain knowledge tells us ought to be there. Sanity check to throw out stuff we know makes no sense.
Ruth, Bristol: Research in the UK. things are radically changing, in 2014 embark on next research evaluation framework, 20% of funding depends on impact, and we have to measure it. It’s not a lecture, citations in publications, or what journals. YOu have to prove that something in the real world has changed as a direct result. Forcing us to look at relationship between research, policy and enterprise. Struggled with that for 10y, suddenly become very valuable. Question – any thoughts on the governance architectures to hold this creative tension between demands of commerce, practice, research and policymaking. They are reasonable – want rigorous assessments for our students. If kids assessed with rubbish data, that’s an issue. But have to scale fast. These tensions are very real. Need new governance architectures that we’ve not got to the bottom of.
Steve: Great idea. School policies, union negotiated, if get rid of teacher with least seniority, has worse impact than getting rid of least effective technology. Capitalism, if incentives aligned with outcomes, superintendents don’t have that incentive now to take risk. What are the metrics? The state end of year test scores aren’t things we think are important goals. The governance structure needs to work out what’s important. Might vary. Teachers should have a voice in setting those goals, and be reviewed every year.
Nicole: To speed up the feedback process, make quick data assessment count as impact, then a win-win.
Paolo: Misconception about role of Schs of Ed and researchers. Startup people come to my office, have an idea, can you do the evaluation. But I do design. That’s the common view – entrepreneur wants a stamp to say ‘it works’. At Stanford have 1y masters programme, 3 startups or so, should rethink how we think about that link. Here it’s a good example of how it’s extremely synergistic, better companies, but faster too. If better idea get product sooner. Lot of experience in Schs of Ed with schools, and ideas. If match with ability to code, create things quickly. A lot of interesting synergies.
Al: The K12 ed tech ecosystem. In the spectrum of tools vs content/curriculum, the incubator model works Ok for the tools stuff, platforms being built. But the further you get in to the hard stuff like content where you need the feedback loop to get it right and it’s hard to get it right in the first place. Gathering at AERA, VCs and educators, of a mind that we need to get educators to push ideas forward. Misguided to have rubber stamp – ‘please evaluate this, positively’ – there’s no feedback loop. Tools have gotten much better. Can have educators not being coders, but being designers, using tools to start prototyping things, explain good design principles. Then you get the coders involved. In K12, have to show up with a technical cofounder. ‘Can you help me find a technical cofounder’ – they’re in short supply and they don’t talk. But plenty of technical folk who once they’ve realised there are hard problems
Steve: Ask entrepreneur about how long they’ve spent in the classroom, how many teachers have you talked to – if it’s zero, there’s your evaluation.
Alyssa: What compromises do you make to have an impact. We try to do things the best. But trying to do things better is important. Sometimes we start to think about industry to get things we think are great out in the world. But also, what can we learn from things out in the world. Interesting synergy – like design-based research – improve products, and research. Requires deep collaboration. What are the first steps of that dance?
Nicole: Coming at it from an interdisciplinary point of view, what you bring to the conversation. Difference between consulting and research. ID a few points of perspective to explore difficult, interesting questions.
Al: Working with new school venture fund, invest in companies from K12 and other places. Outside ed tech, common to have advisory boards – great place to start for someone with a broad understanding of things. Startup willing and valuing that input. People in a startup, can say one has to be a founder, big bar. Or get involved peripherally, on e.g. board of advisers. You’re in but not all the way in, keep seat in your institution. If you’re interested in that, come and find me.
Janet, Colorado: Run a tech meet up in Boulder. Educators, K12 and HE, excited about ed tech, groupies, and entrepreneurs. Sticky issue in the room – relationship between academia and entrepreneurs. Entrepreneurs exhausting their immediate community, need to get in deeper. How do they reach out effectively?
Al: Asking friends, build a watering hole of interested research institutions, make folks available. Teachers College has put out something saying we have researchers here come talk to us – honeypot strategy. Having a tie in with ?TFA, ??, say raise your hand if you’re this type of person. Everyone comes with what they want and what they’re looking for. This is a problem. Watering hole problem. Researchers and entrepreneurs don’t mix normally. They just don’t run in to each other. Even at Stanford.
Nicole: Have a couple of advisers.
Steve: Get out of mainstream tech, joined a nonprofit incubator. Over 1/3 members work on ed projects, partner up. Began building network. New school venture fund summit, met lots of entrepreneurs with great ideas, inspired me, saw need for people. The GSV annual education innovation summit, was in Scottsdale. Meet thousand of entrepreneurs. Some well-connected folks, Taylor, people who can rattle off who’s interested in it too.
This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.