Workshop 1: Intro to Cloud Computing and Google App Engine
Chair: Stephanie Teasley
Speakers: Wesley Chun, Felipe Hoffa
Wesley – Google App Engine
From the developer relations team – they don’t know about the consumer products. Represent the developer tools, it’s about you integrating Google technologies in to your products. We run hackathons, do talks, to raise awareness. Several talks, also hands-on. Generally relaxed, but can dig in to some deep queries. Starting on Cloud computing services – especially Google App Engine. Can start playing with just a browser! But there’s downloads too.
Wesley is a teacher, all grades. Written books on Python. Is a software engineer, built Yahoo! Mail with Python (originally RocketMail), worked at Slide, IronPort. Also software to help doctors analyse spinal fractures, osteoporosis.
Scale is key. Approx size of Google search index (2012) – 100 petabytes (100 million Gb). How much video uploaded to YouTube every minute (in h) (2013) – 100h/minute? Monthly uniques on Youtube 2013 – 1bn. Http hits on App Engine. Google ever down? 500s?
Out of this world scaling and security – only recently allow photos.
Three levels to understand. Top level – Software as a Service – SaaS. Must have a browser, application you don’t download. At the bottom level – Infrastructure as a Service IaaS – renting raw hardware (e.g. Amazon EC2, S3, Google Compute Engine) – you’re responsible for everything from the hardware up. Platform as a Service – PaaS – takes care of a lot of work for you – Google App Engine, Windows Azure, Heroku, Only have to worry about the application code. At a SaaS level, you can’t fix e.g. a problem in GMail if it’s not working for you. At the Platform level, you host apps on the same infrastructure. Use PaaS to create a Saas App. In between – Google BigQuery, Cloud SQL, Cloud Datastore, Translate, Prediction – they’re services, but they’re not application-creation platform. Don’t compare App Engine to EC2 – Amazon EC2 compares to Google Compute Engine.
Google’s Clou – comprehensive, integrated platform. Serves at all three levels. Extension of internal infrastructure. Plus other APIs and services. As we improve this infrastructure, you reap benefits/upgrades without efforts. Use what they use. Same scale, performance, security and value. “The mother of all clouds”
About Google App Engine
All the load balancing, databases, applications, monitoring, etc. As well as the code to make it happen, there’s a whole load of other things – jam those in to one large box called App Engine – Python, Java, Php, Go. Don’t worry about sysadmin. Let us wear the pagers. Build it properly and it’ll scale for you. Have extended language support through JVM.
It’s a shared service. Everyone has to run in a sandbox. Can’t create a local file. Can’t open an inbound socket connection. Can’t make an OS system call.
Ask why – e.g. send an email, write to database, etc. So have higher-level API – memcache, datastore (can choose relational or noSQL, writes take a long time because distributed – so use memcache), URL fetch, Mail, XMPP, Images (rotate, resize, crop), Blobstore (audio video store), User Service, Task Queue. There’s a University that uses XMPP for chat – student asks question on IM, simple querying, then sends a chat back to them. (Mail and XMPP allow send as well as receive.) We penalise you if you’re slow. If >1000ms. Your app has to respond within 60s or we terminate your app. If you need a long query, that’s what Task Queue is for – can take up to 10 minutes. If need longer have that too. But get a response back to user fast as possible. Also channel API, so e.g. if they fire off 10 long jobs, can signal them back when something’s done e.g. check mark, so you don’t lose context. Also authentication – use your own, or use Google accounts, or OpenID. More than 36 now – a search API that indexes text data, date/time stamps, geopoint. Also MapReduce.
You also don’t have a VM, you can’t ssh in, can’t grab the logs. But you’d want to monitor your traffic, errors, etc. So have administration console – have App Engine Dashboard. Now very confusing, a lot of stuff. Can see hit rates, error rates, are you close to budget, etc. We give you knobs and dials and tuners to tweak it for you. Can see most popular end points for your app, also what generate the most errors. Can view datastore, which devs can upload new code, manage multiple versions. Access to your logs. You can run a MapReduce on them.
How many users of App Engine? 300,000+ active developers – i.e. logging in to dashboard or uploading code this month. >3m active applications. Half of the world’s IP addresses touch an App Engine server per week. 7.5 bn hits per day, 4.5 tn datastore requests.
App Engine is an orchestrator of activity to other Google products – e.g. BigQuery, Cloud Storage. So as an ed researcher you could use App Engine to fire off queries to BigQuery.
Many interesting users – e.g. Khan Academy.
Example – Social networking at scale. BuddyPoke. In 2011 they had > 64m users. So 64m rows. 3.6 mm users on facebook daily plus others makes more like 6m. Also Gigya – do things for event-based promotions, where it’ll get no usage then shedloads. Got no traffic, then up to 800 requests per second, jumps up to 1,600 users every second. Prince William & Catherine Middleton wedding official blog and live stream apps. 2k requests per second for the blog, big burst was 42k requests per second on the live stream during the kiss.
So far only web apps. But many App Engine apps aren’t – interface could be on a mobile phone or tablet. Backend server processing. Keep data off the device (which is at risk) – user info like high scores, contacts, levels/badges. Better UI – move user data off phone & make universally available. Just via HTTP.
Pulse application – newspaper 2.0. App Engine with mobile backend. Python, gets thousands of QPS (queries per second). 100m daily requests. Many games – even the Chrome version of Angry Birds. Snapchat – picture sharing but pictures only live temporarily, so your photo from party doesn’t stay there, they’ll disappear. Songpop – like name that tune – 18 Tb music serves a day, 1mm+ DAUs (daily active users).
If you’re considering a mobile version, there’s Google Cloud Endpoints. You want a REST API eventually. Set up the tool, it creates the boilerplate that iOS or Android or any apps need, it’ll generate the code in Objective C or Java. Academia apps – any course where students build web or mobile apps, research projects, IT/Operational apps.
Vendor lock-in. That is a concern. Benefit is taking advantage of Google’s infrastructure. The cost is you need to write against Google APIs. There are choices. Many platforms are transferable – e.g. Python / Django or web2py. There are open source backend systems too, claim 100% API compatibility.
Questions: About lock in and data – audience fairly happy. Also about product cancellation, API changes.
The label the product has – if it’s labelled experimental, tricky, but if not, the ToS will specify how long it’ll work.
Prices: If more than about 5m pageviews a month, probably won’t exceed the free quota. If you do, $9/app/pcm, Premier is $150+/pcm. If you’re paying money, you get extended free quota. Also get SLA, and for Premier you get a paid support. Info at cloud
All security compliance there. Can choose from a US- or EU-based service.
Features listed here https://developers.google.com/appengine/docs/features also with roadmap.
Google Apps Users – eg. third party vendor that you can’t quite find – you can write custom app, integrate in to your custom app domain just as if you did it yourself.
Developers Guide here https://developers.google.com/appengine/
Also Google Prediction – machine learning service. Translate. BigQuery. Cloud SQL. Google Cloud Storage. Google Drive is an apps level thing, you don’t put multi-terabytes there – that’s Google Cloud Storage.
Google Compute Engine – goes beyond App Engine boundaries, spin up VMs, do stuff with them. Have persistent disk that all the VMs can boot off. Can ssh in, use web-based interface, or your own code via REST API. Other players in this market have more features, but they compete on performance – latency low, big throughput pipe. Create and can ssh in in less than 30s.
Google Cloud Storage – <5Tb per object. Can choose US or EU data centres. Strong read-your-write consistency. Access control lists. Get at it via REST API, App Engine API, Web UI, command-line.
Cloud SQL – mySQL compatible. Cloud DataStore – native, noSQL database. Translate – use it programmatically.
BigQuery – main idea, use a SQL-like query language over terabytes of data and get results. cloud.google.com/bigquery-tour
Prediction API – machine learning service. Only supervised learning. Train models, get predictions. developers.google.com/prediction.
Fusion Tables – Google Maps + spreadsheets on steroids. Aggregate data.
Course Builder MOOC. Open Source. A lot of the team will be coming in tomorrow. It runs on App Engine so you don’t need to worry about scaling. You control the code.
CloudCourse – manages course offerings. Create classes, say only take up to 50 people, or only contractors, or need management approval.
Have university consortium – developers.google.com/university
Evaluation and feedback or further questions from this talk here.
Cloud Playground – get started on App Engine without downloading the SDK.
Four ways to access BigQuery – App Engine, command line, REST API, ?
Felipe – BigQuery
If you have a 7 Gb data set, working with it in mySQL.
Dan: Issues are the time it takes to run a query, and the expertise in writing queries intelligently. Set a query off on Thursday, checking in on it now, it’s swapping badly.
Taylor/someone: 5 Tb data in SQL (!). Not many rows, but it’s very wide, has different challenges. Thousands of columns. Every action a child takes on a level, down to mouse movements in columns.
Taylor: InBloom, bringing 10 states’ worth of data. Eventually all kids in the US. Also MOOC data.
Zack: EdX system, how many courses at a time can we look at, counting things is one thing, but correlating things, mapping X features to Y outcomes, then some modelling. What capabilities does this afford.
Michael: It’s sometimes unclear, how many courses you might be dealing with e.g. in MOOCs. Try one, then three, then seven, tie all that to survey data … gradual progression. Do I have to write all these scripts, are they going to work on the others. The MOOC platforms haven’t standardised, just give a big data dump. A Coursera course anything from 20k-150k students. Across 9-10 week engagement. Maybe 10-15% stay engaged across that timeframe. Video watching, discussions data.
You can run SQL query that’ll take days, or you run a MapReduce or Hadoop.
Michael: Just getting started getting in to rows and columns I can run regressions on. R, SPSS, multilevel factor,modelling,
People are writing a query to connect R to BigQuery, would be good to have them connected.
Someone: Machine learning people. Basically, I design unsupervised models.
Prediction API is very interested. Not fully formed yet. Experiment with it!
What is it?
It’s a database. Not like mySQL, or noSQL. You can query it with SQL, so don’t need a new language, but works in a different way. It’s a cloud database, so don’t need to worry about servers, RAM. Two things you do need to worry about: – the data you send to it, your queries.
Pricing: Storage, $80 per TB/month. Interactive queries – results in seconds, while you’re having a dialogue with the material – $35 per TB processed. Batch queries – take a little longer, but if you know what you’re going for and don’t mind waiting half an hour – $20 per TB processed. First 100 GB of data processed per month is at no charge. Only counts the columns you’re querying.
When to use? When you get terabytes in seconds. (But not subseconds.) You still need mySQL/noSQL. Normal operations in a db when you run an app, you want them to last less than a second, so website is very fast. 3-4s is too slow, BigQuery isn’t for that. If getting <1s response, BigQuery won’t make that much faster. It’s more when it’s taking minutes, or days. Also not good for frequently-updated data.
First we have relational dbs, good, fast, but problems scaling. Then BigTable model, now called noSQL, many alternatives. Fast, scale, but for some operations. When you say give me this key, comes back instantly. But when you want to sum, or run e.g. partition by type of student over data, those full table scans are very slow – mySQL because it doesn’t scale, or Cassandra because you have to go record by record running MapReduce. There’s no index to answer these questions. noSQL, we recompute things, don’t go to db and e.g. count records.
Case study – 1d of data, MySQL faster (<1s), BigQuery slower (few seconds). But as it scales to 7d, 360d – MySQL time taken scales nearly linearly, but BigQuery is pretty much level. (Ad serving logs for 500 websites, about 300m rows/day)Not just the number of machines, but how they are architecting it. Every operation on BigQuery is a whole-table scan, system optimised for that.
BigQuery empowers the analyst – no good getting question say 2 weeks later. Lets you analyse very quickly, go deeper, get answers in seconds, enables different use case – more exploratory analytics, create and destryo.
Tableau sits on BigQuery, a very nice interface for it. App Engine good place to store data. Upload – BigQuery takes CSV or JSON files.
Then can query through API, SQL, analyse it. Serve up in e.g. Google Spreadsheets, Tableau, Excel/Drive, others.
[break to upload data]
Introduces Ani, who has a dataset, Felipe’s never seen it.
Ani: Game log data, 27 students. Educational game. We have a lot of actions, move-based data – 66,000 rows, 24 columns. Has username, timestamps. Each student plays about 50 levels, each level has 2-200 moves. Wanted to spot most frequent patterns of moves. Information on each move, how long it takes. First move says so many seconds, second move includes that. Includes what the move looks like (board state), number of correct/incorrect targets. Success or not.
Part of the problem with working with the cloud is how to get your data from your computer to Google – would’ve taken too long to upload all of this stuff. They do hard drive ingestion – FedEx the hard drive, we pull it in to the cloud.
Uploaded it, just from the web UI. Create a new table, choose file to upload – AppEngine Datastore Backup, JSON, CSV – normally store it in Google Cloud Storage. Uploaded compressed version to make it even faster. Give it a schema – type of each. It asks you to specify. Can declare them all strings for speed – but then problem to run stats. But even there there’s a solution. Few more tweaks – e.g. errors, skip header rows, etc.
New query – SELECT max_depth FROM [database name] order by max_depth desc LIMIT 10.
avg(max_depth) doesn’t work because it’s a string. But here can just do avg(integer(max_depth)).
More advanced – board_state_id, the state where the player is, where they’re going next. So write analytic function to select the next state – using a window function. Multiple processing, took about 2-3s.
Q: Can you define your own functions?
No, we add them at the moment. But if you ask us, and enough ask, we could add them. Can’t add your own for data security reasons.
Q: Mainly more descriptive operations. Could so something more complex like in R, set something up, point R at this data?
Here you can do SQL, has percentiles, ranking, some functions. For more advanced things you might want R, or a spreadsheet. Here running SQL on terabytes, but might want it on a spreadsheet or R. There we’re working on connectors. Can connect AppScript to a spreadsheet now.
Q: Where does R run then?
I’d only bring aggregated data to R. Run R locally, but instead of it going through terabytes. App Engine doesn’t run R. Compute Engine yes. You have fast network connection between Google products. Panda script too.
Write Python running on App Engine, that would talk to BigQuery via an API – it’s RESTful. Example in IP[y] – uses it as a service. This is a really fast, powerful way to have a conversation here. Full power of Python and Google speed.
Q: What libraries, Python packages, if we don’t have ssh?
Here we have Compute Engine so you can. It’s your own computer, can do whatever you want. IP[y], it’s built in to a browser, run it on any computer (your laptop to a VM on Compute Engine or EC2 or whatever).
Step by step example (slides)
Raw data and Chilean census. Data.gov for US. datos.gob.cl from Chile. Full data from 2002, 15m. 3 gzip files, 1 min to download, 189M. 10s just to uncompress on a very fast computer. Now 1.1Gb dataset, SPSS file. But read it in to R, using library ff. Created about 37 CSV files, best – each file is a column from the census. With a bit of Python, combined to a big CSV – on a fast Google-quality computer, took 6 minutes.
Now what? Spreadsheet too big, write code – 7s to run, MySQL, Hadoop need a cluster – BigQuery!
Compress CSV file. Took 1min to run query to upload. After 3-5min was inside BigQuery.
URL to play with it: bigquery.cloud.google.com to see that data. Can ask queries, takes 1.5 Gb in seconds.
If you don’t know SQL, or want to give tools to people who can’t – connect it to a Google spreadsheet. Code to handle the connection, easy to authenticate from a spreadsheet to BigQuery because Google knows it all. Easy to e.g. graph things. Someone has to write the glue to connect the spreadsheet to BigQuery. Can do it in Tableau too.
Can share datasets – with e.g. all authenticated users, and more fine-grained.
Big message: Use BigQuery! Share your datasets with the world!
Site to share things – bigqueri.es. Shakespeare, US natality, Wikipedia, weather, Github timeline, HTTP archive, Trigrams, MtGox Bitcoin trades, crime data. Can run a join with these sets to connect them to your data.
Q: If I upload my data, share it with others. If I’ve put it up, do I pay for their queries?
You don’t force them, the model is you pay for your storage, user pays for their queries. Share easily and cheaply.
Demos weather database, shakespeare. Shakespeare –
select corpus from publicdata:samples.shakespeare group by corpus;
Then tweak to get word counts from each play.
Can also do it from command line. – bq – find most frequent without a handful of stopwords.
Can also access through your own applications, App Engine, Python script.
gsutil – command line tool, lets you copy my huge dataset to to Google Cloud Storage. Separate from BigQuery storage. Here you persist your data; in BigQuery you might not leave them there. Just leave them on Cloud Storage. Can also copy stuff from Amazon to Google Cloud.
Q: Is there a dump/snapshot thing, download it then put it back.
Cloud Storage is good to upload first. Ingesting in to BigQuery is faster.
Q: Do you just have to export it, or is there a snapshot tool?
Will talk offline.
Documentation here for people to take away. Send us information about your success stories!
Q: Someone here messed up some data storage – had permissions because they were new. Or e.g. grad student who accidentally obliterate a dataset. Is there a way to manage you as an administrator to say, this person accessed it, can I roll back? Where you need to work collaboratively.
Finer level storage on Google Cloud Storage. Can enforce security policy via ACLs, object-level rather than directory. Also has SDK, realtime API as well – can build in the collaborative stuff.
This work by Doug Clow is copyright but licenced under a Creative Commons BY Licence.
No further permission needed to reuse or remix (with attribution), but it’s nice to be notified if you do use it.