Talk Data to Me: Sparking Insights at Elsevier (Emlyn Whittick)

Thank you so. I'm Emlyn I work for elsevier and I basically am the tech lead for elsevier's big data platform so for those of you that don't know and so elsevier started off in 1880 as a traditional book based publisher then later on they specialized in some scientific medical publishing. That's it's very old but now over time it's evolved and now it really sees itself as an information solutions provider and what elsevier is all about is about trying to empower scientists and health professionals and through insights to make better decisions. They're looking to lead the way in advancing science technology and health and a lot of that is through all of the data that they have available at their disposal so elsevier has a couple of major products that you may have heard of so they have scopus which is the largest abstract and citation database which has about 65 million peer-reviewed journal articles about 27 I think million patents and it also has the ScienceDirect which has 14 and a half million full-text articles has all kinds of information from its social networks. So mendeley has over 5 million users it has institutional data as a funding data and all of this is spread across the organization so the the real challenge comes when we want to know how do we take all that data which is spread across the organization. How do we do something with it and provide valuable insights for our customers now. Many of you may have heard kind of similar kind of talks about. How can we take. We take data. We look to build something like a data like so in order to spice things up a little bit. I'm going to be talking about it in the context of taking some ingredients and trying to create some delicious dishes so taking the little bits of food and then combining them together through our master chefs and providing something really delicious for our Bryant users so the first step of this process is to actually get the ingredients to begin with. Now when you're going through chopping you can find your ingredients all over the place so you'll have your your mainstream sukham at the supermarkets.

You'll have your specialist suppliers and you might have your your local farm who you know does the best potatoes you've ever seen. And all of these different places have like slightly different guarantees when it comes to how they deliver your data. How do they deliver your ingredients. And how enter what kind of frequency so your supermarket might have a fairly standardized automated collection and delivery process. So you know that it's pretty much going to turn on turn up on time if not there'll be something you know to help you cater for that and you know it's it's going to be pretty much the same regardless of which supermarket you go to your specialist. Suppliers may equally be kind of reliable and have these automated process but it might be completely proprietary whereas your local farm might be completely manual and so it might be the local farmer jumps on his bike and brings over the sack of potatoes and dumps it at your door. And if he's got a flat tire one day then it might turn up and and the same comes with data so we have pieces of data in our organization which have fully automated collection and delivery processes so those interfaces are really well defined. And it's actually quite easy. Forget that to get those data we've got others which are more proprietary so it's a little bit of finer fine-tuning and others that are completely manual and we have to cater for this in in our collection process so the next problem is trying to get access to that so with your food you'll get it getting your ingredients is not enough you also need your tin openers. You need your pair of scissors you know you need your nails to get under that cellophane and date as much the same so it may be locked away inside a proprietary database it might be over in an AWS account knocked by some kind of proprietary keys and one of the main things that we've tried to achieve is to try and provide unified access to that data so we'll provide you with the tin openers will provide you with the scissors and we'll let you get access to that data without having to worry about the access so building this kind of centralized platform and to enable us to do that so once we've do that and once we've done that the next stage is actually starting to get to the cooking but before that i'm going to talk about.

Harry spark fits into the picture. So spark first came along to elsevier in about twenty forty and one of our teams looked to apply some natural language processing across elsevier's sciencedirect articles so they're full published work so that was about 14 million xml files and they wanted to do some natural language processing across all of that content so data bricks was selected in this context and it selected for a number of reasons so a lot of this data was spread throughout the organization and the ability to mount that data within data bricks and not have to worry about getting access to that data made it really easy to deal with and secondly the team didn't really want to have to deal with the operational overhead of having to manage their infrastructure again. The data brooks platform enable them to do that. The third thing that they wanted was they wanted to be able to present their results to their wider teams and again through the notebooks feature data brooks allowed them to to achieve that now if we move on to 2015 so by that point that team had scaled up to about 15 users and data bricks was being used for a variety of different use cases for content analytics across various pieces of data and in parallel spark was also chosen as the main processing engine for el soviets big data platform now. We chose spark for a number of reasons. So first of all you had the obvious kind of performance improvements and things like you're over your traditional MapReduce but also one of the things that we wanted to achieve was to have some kind of level of convergence so we wanted to build a big data platform for everyone in elsevier and we wanted to have something fairly generalized such that regardless of the use case whether it was something like some generic aggregations whether it was something more complex like machine learning or whatever it may be but whether it's analytics teams data scientists developers we could have a centralized platform where people could process all the data that we had in our organization in one place with one platform and spark gave us that so moving on to this year we run both Spock.

2.0 and spark 1.6 in production for our production workflows weave so our development processes have evolved as as the spark application has in itself so our application started off as being very much our DD based and we've transitioned to data frames and now we're looking to transition to data sets and one of the great things about spark is the fact that there's so much invested in its development and its progress and by adopting that as our centralized platform we get to come on the right and we get to kind of get all the performance improvements we get to get all the new features and we just benefit from that so all our production workflows and written in Scala but the great thing about spark of course is that it doesn't have to be so a lot of updates our science teams and analytics and may well use Python or are and have their workflows deployed on top of spark using those languages so now that we've got our tools now. It's time to actually do something and and get some insights so what we want to do is we want to. We want to empower those. Masterchef's those really smart people who can create those fantastic dishes so before we do that. We need to prepare the ingredients. And then we want to focus on sharing those pre-prepared cooked and pre cooked dishes amongst the organization. And then of course we want serve that up to our to our customers so step one so we need to prepare the ingredients.

So we've now collected all our ingredients together so we've got our disposal but now it needs some preparation so if we take our humble. CSV potato so spit of an old classic comes in all shapes and sizes but quite often really really grubby so we've got lots of CSVs hanging around and in the organization and there maybe if all kind of bits of dirtiness and kind of grubbing us hidden inside and quite often there's quite a lot of little tweaking that neat that's needed in order to get that kind of proprietary information out sometimes it looks like CSVs not CSV at all you might have some kind of nested Jason in their kind of all kinds of mixed formats and things like that and the great thing with the kind of spark libraries is that we were able to do that. Use those libraries and and process that data fairly easily. We've also got other things take your XML onion for example so again. Got lots of these two so you'll pretty much find onions. Everywhere you've got lots of lasted layers and quite often especially in our case it really does make you want to cry. So a lot of scope at the scopus and sciencedirect articles so we've got millions and millions of xml files that we need to process in our organization. It's a constant pain point but everyone's got them xml. It's great for a lot of things and again spark provides it provides us with that generalized platform in order to process these things so going back to the kind of collection again we also need to cater for the fact that all of those groceries need to be delivered and those large suppliers have these well-established mechanisms. But as you're going to go down the chain your local farms may not so and again. Our data ingestion mechanisms need to be able to cater for that so we've got automated processes settle up to try and cater for the fact that we've got automated systems on one end so we've got things like Amazon's simple notification service so cues on SNS feeding in messages and then taking in data from s3 but on the other side we've got people manually taking database dumps putting them in a bucket and emailing them over when we're ready and our aim is just to be able to automate them so an example of one of the pieces of data preparation that that was particularly challenging for us so we had a use case where we had 200 million article abstracts that we wanted to process so they were all stored in Amazon's s3 and as referred to in a in a talk yesterday so s 3 is not a filesystem it's an object store and sometimes they're little distinctions can really cause a lot of problems so all of these XML files were named after the identifier as represented by the the artifact itself we were getting data delivery notifications via by SNS and these files varied in size from a few kilobytes too many megabytes.

So what's the problem well. We had a massively skewed distribution of keys so we had 200 million files all named but the distribution was massively skewed which meant that the initial abstraction that spark gave us of just trying to read the data directly from s3. Just didn't work. So the listing was a was a huge problem. We got throttled by s3 when we tried to make the requests and then we had to be a bit smarter about it but again. Spock gave us the tools so we use spark streaming in order to take those SMS notifications and be able to process the data data updates as they came in and then we actually took took that data. Hashed it to provide a more distributed key set then reprocessed it with sparks batch functionality producing a park a file for consumption by our other processes. So now we've prepared our ingredients. It's time to get cooking and one of the key things that we want to do. Is we want to focus on sharing those cooked ingredients so quite often. It's it's a bit of a pain to do that. Preparation and a great example of this in food miss cassava so cassava when when processed can make tapioca but if you eat it raw or you don't prepare it properly it's got enough cyanide in you in it to cause some serious problems and the preparation process takes days.

It's a lot of drying. It's a lot of preparation and you don't have to go through that over and over again and you also don't to push that out to your end users so likewise with our data processing especially when it's the case that we have very difficult or awkward sources of data to process. We want to focus on once. We once one team or us or somewhere else. Within the organization has done that pre-processing we'd like to share that out to the rest of the organization so they don't have to do the hard work so the other thing that we found is that of course cooking in batches it definitely has its advantages so a lot of our work is currently batch based and it also makes it really easy to verify the consistency of your batches. So you're cooking a source and it's very easy to verify whether you're your source from day one and day to kind of very much the same because you can take that one batch you can compare it to the next and you can do this testing of between batches and then release based on how happy you are with that batch likewise if you get a bad batch if you if you put too much salt bit too much chili in your source then again it's very easy to throw away and go back to previous batch and the same goes with data and of course what we're trying to do is yeah we're trying to we're trying to take this data and share it and your end your end goal might be to produce a beautiful tomato sauce and maybe part of your bolognaise that tomato sauce may also be a key component a few lasagna and again the same goes for our data so when we processed that all of those articles for that first use case that park a file that we generate is also useful for many other use cases within the organization now we do most of our cooking exclusively with spark as I mentioned we mainly cook in batch but we are looking at doing more and more than streaming solutions and again we're looking at doing spot streaming for that now when it gets to streaming it does get significantly harder and then the kind of easy things that the batch approaches give you do you get a bit more tricky so you add a bit too much salt to your batch you can throw it out you add a bit too much salt to your to a big pot of stew and it's a bit more difficult to take that back back out again now we've ended up using parquet for our intermediate storage for most of our data sets and that's because of the obvious program performance implications and the optimizations that it gives us in terms of using it within spark for the big data processing so going back to the initial use case that I talked about so we took those 200 million XML abstracts and we pre process them with spark and we generated this Park a file and what we wanted to do with them as a use case was to then be able to generate some citation statistics based on that data so what we did was just use bug do some simple aggregations across that data pausing the XML and calculated the article citations by both the article and by the author and then there were two things that we wanted to do this do this so we wanted to serve it up to our front-end teams so they could put it in a dashboard and in order to do that we had another spark job that transformed that data into a key value Jason so that we could serve it up from an API but also what we were able to do was take that data set so that was the article citations night that back into data bricks and share it with the rest of our our community which meant that other data scientists other analytics teams could take that data and make some useful insights out of that and that's really where we're really using data bricks here and so since we've adopted data bricks its usages gone significantly up and up and up so from that original use case of about 15 people we're now nearing the hundred bar so what we do is we take all the data from our data like and we go and mount it into the database file system which allows for very easy access and also there's some performance gains that come with that which make it even easier for people to to be able to use that data we also have implemented a system where we've got currently a single shared cluster for general use so most of our users when they when they want to do some allows some analysis when they want to do some analytics they go on to this single shared multi-tenant cluster and do their work on that now when that's not enough we also have people spinning up this book clusters for for custom workloads and we get all kinds of funky stuff so Reza who's over in the audience here did a talk yesterday about mental mentee relationships and again a lot of that's been developed in in in data bricks so this one shows some some entity relationships and again rendered using d3 within the database platform and we've got a lot of different use cases across the organization for this data so we've got people doing author relationships and graphs disambiguation research some natural language processing recommendations both around articles and people various bits of data profiling and exploration so quite often when we do have issues with our production production pipelines or pieces of data our development teams will use data bricks to do that data analysis to do that computes comparison to do that debugging and it's also a key artifact for learning so we've had a couple of new graduates coming to the team and dace.

Bicks was a really valuable resource for them to get up and running with spark to not only understand a scholar and spark a bit better but also understand the context of our data and again. It's very easy for them to go crazy. And dashboards analyze that data. Get some useful insights. So what's next up for us so we've got plenty to come and in a lot of ways we've kind of done the easy stuff and it's just the hard stuff to come so we've got a lot of things to look at around doing much more fine-grained data access privacy and security so especially when you're you're building a data like you're trying to spread this data right to all the different people and you and you want to be able to share data within the organization privacy access security of the utmost importance but it's also hard likewise data discovery and provenance if you've got a data lake with a huge amount of data in it and then it's not useful to the end users and unless you can find it and unless you know like where it's come from how reliable that data is you know what processes those people used to get it into place likewise a lot of the data across any organization has varying degrees of quality and so there's going to be a lot of a lot of work needed around cleansing that data classifying that data again to make it really ultimately a maximal usefulness for the organization and and finally as we kind of move especially as we move into more of a kind of real-time or streaming concept like enhanced operational support.

It's just going to be of utmost importance. Having a batch workflow means that you don't really need to worry about that too much and as we move forward that's going to be a more and more key thing so that's basically how we elsevier have taken all the data across our organization and you spark to both pre process and and cook our data share it amongst the organization and then use data bricks for our wider community in order to get access our data. So you can reach me on this email address. You can find me around the floor and thanks very much all right. Thank You Evelyn now we have some time for Q&A so rare chance a second great talk thank you with regards to the shared cluster. I was wondering if the need has come across the organization that I elsevier to restart the cluster for whatever reason it in that obviously affects people other people other teams that are using it. So how do you manage that it is.

It is a difficult roblem that we have we do have a need to restart that cluster fairly often. I think moving forwards. I think the way we're probably going to start looking at. It is to to make more use of things like auto scaling and also. I think the big challenge is figuring out whether people actually need it all the data that I've got on the cluster because quite often it you know ill slow down. Because there's a lot of cache data in there it's a lot of notebooks attached and people just forget to detach or or an cash there and catch their data so so that that's the tricky thing usually just because of our use cases. I will what kind of notify everyone. And everyone will be experiencing the kind of same slowdowns we can see from the UI's like generally who's been who's been using it or who's been working on it and then we will write going and bounce it. I think as we move forward they are we going to need to have something a little bit more robust about in there and I think auto scaling will definitely come into that. I think the other thing I've heard about that is that increasing the size of the driver and having a larger driver and smaller slaves can also help all right. Thank you any other questions. Sure so you mentioned security as one of the things of you move forward so I come from a telecoms background. Where a lot of our data you know is sensitive to customers using a call records. Would you be considering if you like to move away from a cloud-based solution to an on-premise solution ought to get round me. Perhaps some the security issues may face with cloud and. I don't think it's something that we've looked at. Just because we get so much from from being in the clyde and i think we feel that by utilizing kind of some of the things that clouds give you like v pcs and stuff of that land and having those security grid set up that we feel that that gives us enough enough really in our case so yeah questions okay well thank you.

I'm in great talk.