Professor Eric Meyer, University of Oxford

So I'm professor Eric Meyer. I'm from the Oxford internet institute at the university of oxford. And where i'm professor of social informatics and i'll explain what that means in case you don't know what social informatics is in just a slide or two and I'm also a faculty fellow here at the Alan Turing Institute and I've been involved with the Alan Turing since the very first startup. We had a lot of workshops last year. I had attended a number of those so I'm going to talk about one or two topics today of things that I'm interested in that might get some other people interested in some of the opportunities here at the Turing Institute. I'm going to talk tell you first though about two things. I'm not going to talk about today but I'd be happy to talk about on another occasion so some of the things that we're doing that might be of interest to some of the people at the Turing. We've got a project right now on automation in healthcare in the health sector in the primary health sector and we are it's a joint project that involves both qualitative and quantitative researchers. We've got an ethnographer who's going out and spending time in doctors offices to figure out how they spend their time and what tasks they do and then we're currently in the process of hiring actually a postdoc to in the machine learning area. That's going to work with mike. Oz born in the engineering department at Oxford to then build models of how we can understand the future of automation in the health sector. And so that's a project that's going on and if you know of anybody who's interested in a postdoc send them my way because we've got an opening right now that we're looking for somebody good on that project another project that I won't really talk about. Today is one that has to do with buck chain and we it's with Vili Laden berta who you might have met who's also a faculty fellow here and he and I are working with an organization in London called Dax which is the designers copyright society to understand the potential blockchain in the art community.

How artists can use blockchain to keep track of provenance and payments and those sorts of things so those are both a bit more on the machine learning side. The topic today isn't so much about machine learning but it's about a resource that could be potentially leverage because of the Turing's location here in the British Library. Something that has been underused over time so. I'm going to talk about the web as a knowledge machine and Alex I mean what a knowledge machine isn't just a bit web history web archives and internet research so first social informatics. I promised I would tell you what that means. In case you don't know so the way I like to describe what social informatics is is. If you spell the word socio-technical which is about you know the people and the technologies that operate and you use a hyphen which. I'd like to do for a reason you'll see why because I've written this article that says understanding the hyphen. Essentially i look at the hyphen. The connection between people and the technologies that use them that they use so how these four people make their choices about technologies. How technology shape the choices. People make how people shape the technologies that are built. And we've done this in a lot of different areas over the last couple of decades one of the things that we did. We've written this book with my colleague. Railroader in 2015 called knowledge machines digital transformations of the sciences and humanities. And this is about the socio-technical configurations of researchers and different disciplines. And how they're using computational approaches much like many of the people in this building are using computational approaches to generate new knowledge and that's really a change in some areas. It's probably less of a new change in the sciences where we've had a lot of big data for a long time. But in the humanities there's a lot of more recent it's not part of the sort of DNA of the humanities to deal with computational approaches to understanding things so the book goes into quite a bit of information about that if you'd like to read that now one of the things that we're interested at if the Oxford internet Institute is understanding the internet obviously it becomes a bit of an obsession to understand everything about the internet when all our colleagues are talking about it so I just wanted to start today with a little bit of a couple of pictures from a paper that we published actually in 2016 that was looking at this idea of the nut as a knowledge machine.

So how has the internet become embedded in a lot of different disciplines across the world and become not only a tool of research but a topic of research. So this isn't a big data and doesn't use any fancy machine learning it's just a simple scientometrics study bibliometric data so these are data from scopus which many of you are probably familiar with all the publications that come out for about 19,000 different journals and. I should explain exactly what's going on here so this is 1990 and 1985. And then you'll see another picture in just a second more recently so this underlying map here with the gray dots this comes from someone named Luke latest Dorf in the Netherlands who has built a underlying map of science. So this is all of the publications in scopus over a 50-year period they took it they look at all the publications and all the citations in those publications and they build a citation citation map so any journals that site each other are closer to each other so the more they cite each other the closer they are together. So you know. These two journals are really close to each other. Because they're almost overlapping these ones. You know that one and that one probably have never said it either at all and so you build this underlying map of science and then you can extract data from scopus on a particular topic and overlay it on this underlying map of science and you can say okay.

How does the topic. I'm interested in overlay on all of science. So how do we understand it so in 1990 when the internet is first coming around the web is first starting you see things that include internet as a topic sort of scattered around sparsely across knowledge and it grows quite rapidly so by 1995. You start to see it's showing up everywhere. And then the underlying map so the social sciences are over here and humanities are down there and sort of the physical sciences are up here medicines down in this area and you can start its to see growing right across the way and by 2015. It's basically across all of science. The internet is part of the publication's coming out in journals across all of the scientific disciplines now. This isn't necessarily terribly surprising. We all know this intuitively that this is happening but the best for knowledge were the first ones who actually tried to show it using any kind of data the extent to which the Internet has become a topic and this uses a fairly complex set of search terms to extract papers that have some kind of reference that is actually about the web or the Internet in it. That isn't just you know. Look at ww whatever that those kind of things are excluded using our methods so the idea here was as with the nuns of knowledge machine is because the Internet has become so embedded in what we do. It becomes a source of knowledge creation across disciplines and as almost a a general passport kind of tool that can be used in multiple different disciplines in ways that we don't even really necessarily think about that much because it's become so ubiquitous but one of the things that this also highlights is the fact that so the vibes been around for 25 or more years. Yeah the the 25th anniversary the web happened about several times in the last few years it keep the key. Everyone can't decide exactly when it started so whether it was when. Tim berners-lee turned on the machine or when he did other things at CERN but essentially it's been pet over 25 years that the web has been around and the question is can't we do anything with that collected body of information at least to how much of its been saved so some of you might be familiar with the Wayback Machine the Wayback Machine was set up by the internet archive the Internet.

Archive actually started collecting web archives in 1996 and has been doing it ever since they send out spiders and crawlers to go out and grab stuff off the lab and this was Brewster kales big idea in the 90s and the internet. Archive is interesting organization. Because it's set up as a digital library in the state of California that's it's sort of legal incorporation status and they very early on decided okay. This is going to be an important thing. The internet is going to be important. Somebody should be saving this rather than just having pages delete and disappear and and disappear from our memory but of course the Wayback Machine how many of you is the Wayback Machine over anybody here. Couple people at the wayback machine doesn't really lend itself to anything other than saying. I know where our page is. I'm going to go find that page and look at that page in the context that it was at a particular point in time so if you put in Turing you can see that. There's 21 captures of the Turing page between September 2015 and februari 2017. Obviously nothing before that because it didn't exist then but of course there's only 21 captures and if you look at any of a variety of pages you'll have different levels of granularity the number of times it's captured so you'll get little snapshots that you can look at but those are uneven over time only fairly recently they put up a beta version of the wayback machine that lets you do search up until very recently you couldn't ever search anything and though in the because I've done some projects with the ia and they said essentially they didn't have indexing tools that would run fast enough to keep up with their accumulation so they were accumulating data faster than they could index it so I didn't bother trying to index it because they couldn't do it fast enough and they didn't have the kind of infrastructure google has to do their indexing but they've recently built in some indexing that looks at just home page terms not everything else so you can put in homepage terms.

It'll get you some pages that come up so they you're exploring 279 billion web pages over time and they pull you up some things that say. Alan Turing now. You can see that so here we've got Alan Turing. Net with 238,000 captures from 2000 2016. Or touring org. UK with 21,000 from 90 1999 2000 16 again. This is all very non computational that you can do anything whether I can go look at individual captures. You can grab stuff if you want to see what was on a page at one point fine. I'll well and good but that doesn't really lend itself to anything terribly. Large scale and interesting because of the interface now there have been some efforts to do something more with the data that is in the Internet Archive. So this is a the shine prototype that was built here at the British Library it was part of a number of efforts including projects. We were involved in at the oii in partnership with the Institute of historical research and the British Library funded by jeschke and this was just build a extract of the Internet Archive that's called the UK web archive that was bought from by JYSK from the internet archive and extracted a lot of data related to the UK and to build at least some faceted searching so if you put in Alan Turing it gets up a couple of pages but then you can also look for different content types HTML PDF so you can look at different post codes where there's references to that. If you've got evidence about the location of the page different crawl years different suffixes and so forth so you can at least start to use these facets to find some more detailed information about that which was largely impossible using data directly from the Wayback Machine we did find a number of small projects to use the shine interface but again they tended to move back toward the typical methods of historians and humanities scholars which is looking at individual pages.

Once they would use this to find things then use these facets to narrow down what they were looking for but to look at an individual level now we wanted to do something a bit bigger so we worked with the BL to try and extract data from the whole collection. And say. Let's look at the whole thing at once if we can do anything computationally. This is actually harder than one might think and i'll give you on the final slide. Some reasons why that's true but this is from a new publication. We published it earlier version in the ACM web science conference in 2014 and then a new version. That's coming out in this book. The web is history from UCL press notes online and open access. It opens up in march. I believe the open access where we extracted all the data from the web archive just looking at the links so the links within the web pages and we weren't able to get all the content and i'll tell you why in the in the final slide because of legal legal restrictions on that but just looking at the linking pages you know who links to who and what the links are and so this is between 96 and 2010 where we got the end of the data in that particular archive and we worked with it. You can see the growth of the UK web subdomains again relatively simple question but nobody had done this before. So this is on a logarithmic scale. You can see that the company domain is the largest. The academic domain is pretty significant and by 2003. Essentially everybody in the academic and gov domain. That was going to have a web page. Did before that you still see growth in the web pages but by 2003. If you know if your university you've got a web page and that stays pretty stable you still see companies going up a bit but those because new companies form.

This looks at the relative sector size on the web so you can see that. Academics were more prominent in the early days again. Something that we suspect. And we knew intuitively but we got some data for now still. Even in 96 companies were much more prominent than academic institutions but academic institutions had a significant portion that disappears to almost nothing in terms of the overall size of the web by current. A so we become less much less prominent but then how do these sectors link to each other so this is sectoral linking on the web on the UK web looking at these top top four level. Second second-level domains. I'm not going to point out everything in this picture. You can go look at it in the paper but essentially here. We've focused on the second diagram which is we've normalized the domains for the size of the domain. See otherwise in this one if you have links going back to the domain code just overwhelms everything. It's too big so one of the things you can see here. Is that academics link back to themselves quite a lot not necessarily to their own pages. But to other academic pages there's a lot of within academia linking going on you see a lot less of that in the corporate world they don't link to other company pages. They link a lot to government pages in companies so these are two regulations and other kinds of things going on in the government. You see a lot of sectors linking into government pages but the government like doesn't link out that much so government pages. Don't send out to other domains very frequently and you can see that because things starting in this color. There's gum to gov but say gov't academics is this little line or gov to businesses that tiny little line there so there's not much linking going out of government domains and you can start to understand something about the relationship between different sectors of the UK looking at data like these and then we also dug in a little bit more into detail and I've only brought one of these slides about the what's going on in the academic world and so this picture we looked at hyper links between universities dividing them into their different sort of affiliation groups we've got the Russell Group we've got the 1994 group the University Alliance so these are rough stand-ins for different types of universities the Russell Group are generally the research-intensive universities that could be clustered together and we wanted to see do.

These clusters also get reflected on the web. Do these are. Russell Group universities more likely to link to each other are cathedrals group more likely to link to each other and it turns out the answer is no the only group with any sort of significant more increased likelihood of linking to each other was Russell Group but this has more to do with the fact that they're all research in tents universities rather than the fact that that part of this Russell Group we also brought in somebody we also did some additional data analysis that I didn't bring in because it's a bit too complex to look at on the screen looking at the question of overtime do we see a Matthew effect which means that the rich get richer due the highly linked highly visible Russell Group universities become more highly linked overtime or does this sort of old-fashioned meme about the internet being democratizing that mill opens it up to everybody and everybody can do better and what we find is that actually the most research-intensive universities the most prominent universities become more prominent over this 25-year period so being prominent leads you to more prominence on the web at least when we look at links so this was relatively simple. We're dealing with large files. But they're quite simple. It's just text of links you know linking pages linked to and when so it's a really quite narrow text file even though it's quite long but that took quite a long time to deal with but there's a lot of challenges to actually doing anything more with web archive so one of the things that I've got a slide next.

It shows some of the publications we've done. I find web archives really frustrating for a lot of reasons one has to do with legal limits on using them so the legal deposit libraries act of 2003 which was not implemented until these regulations were put in place in 2013 was written in a way that made the kind of research we might want to do with them nearly impossible to do because up until 2013 the British Library when they were archiving web pages in the UK. They couldn't do it unless they had the written permission of the web owner to go out and archive their webpage so it took a much more restrictive view of what was possible than the internet archive in San Francisco. Did which was the. Ia says. We'll grab everything and if somebody complains we'll take it out on the UK was we'll need a positive assurance that we can put it in before we do so if we're going to include this the 2013. Act changed the regulations change that and said actually the British Library another depository libraries can archive anything in the UK web space without asking permission they can just go grab it because it's a published item in UK just like any other legal deposit book the British Library and that bodley and other libraries are entitled to every document published in the UK a single copy at no charge so they can go out and they can grab these digital items but the regulations were written in such a way that they said and that library can hold a single copy accessible on a machine in their library. Ok so there's somewhere in this building a machine you can go access the web archive data but for most of us. If we want to do anything even remotely you know that gets more complex and that kind of stuff we did with the web links. What can you do on a machine sitting in the library. That doesn't have your tools on it doesn't have your data on.

It doesn't have any way of doing anything with it not very much. So we've been working with people like Adam Farquhar here at the British Library to investigate ways that we can legally open up that data to more kinds of uses that still fall within the the rules of the law but let us do something more interesting with it possibly kind of mining across time now. The data within these web archives is also quite incomplete and inconsistent because the crawlers went out and got them whatever they could sometimes new sites and things like that that they expect to change lot. They'll get more with greater frequency than other sites. But if you're interested in something you might find that a particular kind of page hasn't been crawled very consistently over the time you might have entire years missing when there was no crawled on of a particular site so it becomes difficult to work with these data over time. Also the data is stored in these things. He'll work files which is a web archive files it's a it's a standard that was adopted a number of years ago and the problem with work files is. You can't just take tools that work on the live web and point them at work files and have them do anything because they don't work that way. Work files are structured differently and so it doesn't work like the live web so a lot of the tools we're used to on the live web essentially don't work with work files also. There's a lot of missing object page and data types in the web archiving tools so they largely don't grab things like flash objects or video other kinds they certainly don't take things like database driven sites. So if you've got a site that's being driven by data that those don't work you just get the HTML part at the front and more and more kinds of things are driven that way certainly. It doesn't grab anything that's running out an app so you're missing all that part of the ecosystem. Another problem is that many of these web archives are have a real national focus you know.

I've been mentioning the UK web archive ok well the UK web archive is held. Here there's a. Dutch archive held at the Royal Museum National Library the Netherlands. Yeah there's ones in the US but of course the internet isn't a national place at least yet. The internet is global and money. The questions we want to ask. Don't really make sense if we limit ourselves to a national focus and then finally the last challenge here is that just this block of research are interesting doing anything with web archives. We've written a number of papers again. They'll pop up in a second that has tried to find evidence of people wanting to do things with this trying to do things with it and there's very little research or focus and now there's two reasons. I think that this is true. One is that we've got people who study the social science in the internet like my colleagues at the III or web science like people at Southampton they largely look at the contemporary web. As it is today you know things happening live on the web rather than looking at the history you know is the last 25 years of the web is that an Internet science question or is it a history question you know is something. Contemporary historians are starting to be interested in now they largely haven't gotten onto the space at all people were interested in the last. 25 years. Haven't started to engage with web archives as a way of understanding this but I would argue that if the internet continues to develop the way it is in a hundred years time. If you want to understand anything about this era you're going to web. Archives are probably going to be one of your best sources of understanding history today and largely. We haven't been able to interest that community at all and doing these kind of questions today and developing it as we go forward so we really run this risk of having dusty digital archives. They've been collected very carefully by libraries but nobody uses very little use and I think one of the interesting challenges for us at the Turing is to say can we dream up new things by dint of our relationship with the British Library in our location here to do something new and interesting with these web archives that are sitting here in the building again largely untapped so I think there's lots of untapped potential here and again if you're interested in this you can talk to me you can talk to Adam Farquhar.

There's other people involved in this. That will really help come up with some interesting challenges we might be able to do with web archive so on these slides will be available some place afterwards. These are a lot of the publications. We've had about web archives over the years some of them detail my frustrations of dealing with web archives because there's a lot of them but I do think this is a potential area of research that we it's largely. It's an open field and we can do anything we like with it.