Monday, 9 August 2010

Uncovering hidden connections in Research Activity Data

I recently submited an article to Sconul Focus which, I hope, will be published in a few months time. The topic of the article was data harvesting and aggregation. A simple and (hopefuly) easy to read explanation of a rather complex topic. I used our entity registry as an example and described how it harvests data from University and external sources, how it converts sources into RDF format so data can be aggregated and finally how it uncovers hidden connections. Regarding this last step, uncovering hidden connections, I thought this was a very interesting and value adding process, so I will blog about it here.

When we collect data from different sources, there is a chance that some of these data are interconnected. Different sources may contain replicated data, e.g. a researcher's profile can be found on his college and department's websites. Different sources may contain data that complement each other. For example a researcher's profile on his college website and a list of his projects and publications on his departmental website. Originally, these webpages do not include links to each other therefore unless we know about the other source we will not get a complete profile for that researcher. Add more sources, external ones too (e.g., funders' websites containing information about grants), and we will get information about researchers which is spread all over the place but disconnected.

Uncovering hidden connections means making connections between data in different sources more evident: as in adding links between sources which point to every each of them. But how do we create those links? Here a non-technical introduction:

For example if we get Prof Francis Matthew Kellner's profile in source 1 and Prof Francis Matthew Kellner's research interests and publications in source 2 we can establish with some level of certainty that these two sources refer to the same person. Therefore we can connect the data in these two sources and build a completer profile with Prof Kellner's profile, research interests and publications.

There are, however, other cases which are not so straightforward, where names are similar but we cannot be sure they belong to the same person. For example if we get Prof Francis Kellner’s biography in source 1, Prof F. M. Kellner listed as Principal Investigator on a project in source 2 and Francis M. Kellner as author in publications in source 3. How do we know if these data belong to the same person?

For cases like these we have developed a ‘same-as’ process.

‘Same-as’ has a set of rules which use information such as people’s first name and surname, researchers’ affiliation and email. Depending on the availability of information and whether the sets of data match, ‘same-as’ determines if two or more records belong to the same person or not. If the records belong to the same person ‘same-as’ merges the records. If the information available is not enough to do the matching, or if the data do not match, ‘same-as’ will keep the records separately.

The following is the logic used. This is a technical-ishh explanation written by Anusha.

Search for people with the same last name, who are part of a group of sources, e.g. sources belonging to the social sciences (we group sources together, based on likelihood of information overlapping)

Case 1
Match people only if each person has at least the fields below and they all match.
  • first name (not just initials)
  • last name
  • affiliation
  • source(s)
If the source is a trusted source:
  • staff_id
Note: Subset of the firstname will be matched : Example John P M, John P, John
Create a person superset and add all of their info

Case 2
For people with the information in atleast these follwing fields, with all of them matching:
  • initials
  • last name
  • affiliation
  • source
Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above
If the person matches the information above, suggest a connection.

Case 3
For people with the information in atleast these follwing fields, with all of them matching:
  • initials
  • last name
  • source
Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above.
If the person matches the information above, suggest a connection.
If the source is a trusted source (data harvested from within oxford), display them in the browse / search results pages, else do not display them.

The ‘same-as’ process focuses on ‘people’ entities. Projects, publications, funders and academic units, usually have fixed (or standard) names which are used consistently across sources. However, names of people are frequently written in different ways, depending on the contexts.

So now, you can imagine, everytime we add more data to the registry, we pass these data through the same-as process to see if there are any hidden connections to the data we already have. Or put it in a different way, everytime we add a new source, we do not only add their data but the connections we find with same-as, which were probably unknown or at least not evident in the original sources.

If you want to know more you can wait for the Sconul Focus paper or if you are impatient e-mail us.

Tuesday, 13 July 2010

Southampton ECS release their data into the public domain

The School of Electronics and Computer Science (ECS) at the University of Southampton releases all public data in open linked data format.

"In what is believed also to be a world-first, ECS has become the UK’s first University department to release all its public data in open linked data format.

The School of Electronics and Computer Science (ECS) at the University of Southampton is at the forefront of the open linked data initiative through the work of its Professors Sir Tim Berners-Lee and Nigel Shadbolt."

You can read the complete article here or here

We think this is good news and a good example to follow.

Wednesday, 30 June 2010

BRII Summer Project

This another update to explain our summer 2010 activities. At BRII we are working on a reporting system where users can notify us and official sources of data about errors they find in Research Activity Data. This system will help us and our sources to improve the quality of data. As we are harvesting data from other sources we are designing a system were users can flag errors and send notifications to the appropriate people (sources and BRII) These notifications will contain enough information to decide on a suitable action to take.

Errors could originate from the content of data themselves or from the process of aggregation we perform at BRII. Say for example, misspellings and wrong information in source data, information which have been aggregated but which belong to different people with the same names, information belonging to the same person but which appears as belonging to two or more people with the same name, etc.

In relation to aggregation errors, Anusha has been working hard to design a system to accurately identify sets of data which belong to the same person. For example Prof John Smith in source 1 and J. P. Smith in source 2 could be the same person, or could not. For this she is using extra information that comes with data such as affiliation etc. When her algorithm is finished we will be able to merge two or more "people" into one or divide one "person" into two or more "people" as requested by administrators or users who identify inaccuracies.

For the summer task we are collaborating with the Computing Laboratory Comlab. Anusha is currently harvesting their data and Monica is working on the reporting/notification forms within the Blue Pages. We will soon contact Comlab again to check their harvested data and participate in tests. We would like to thank Thorsten Hauler, research facilitator, and Edward Crichton, web manager, from Comlab who have kindly given us their time.

This summer project is part of the data quality control that we are trying to establish within the registry. I have talked about this in a previous post.

Tuesday, 29 June 2010

Advisory Group Meeting

On Friday 2nd of July we will have our first advisory group meeting after the BRII Project ended. The advisory group is comprised of 10 stakeholders from across the University. The purpose of this meeting is to discuss avenues for development of the Entity Registry and related services (i.e. Blue Pages) and strategies for embedding these services within the University. This of course will be a long process and we are just starting. We have good foundations though as the BRII project successfully produced a proof of concept system which is capable of being expanded with services for and information about all University divisions. If you want to have an idea of what we have produced watch our screen cast of the Blue Pages here (wmv file).

Thursday, 17 June 2010

BRII at the SCONUL Conference

Yesterday I attended the "Organisational Leadership in a Time of Change" SCONUL Conference at Leeds Queens Hotel. Although I could not stay for the whole event (I stayed only 1 of 3 days) I enjoyed it. I attended a few presentations and presented our BRII poster.

I would like to thank Prof John Lancaster who welcomed me and who kindly helped me to move the BRII poster to a more visible place :)

Sorry for the poor quality pictures, dark place and shaky hands.

Thursday, 13 May 2010

BRII Update

This is a short update on our activities and some ideas I have come up with.

Since the end of the BRII project we have been working on the registry and user engagement. We are adding more data continuously. We are also outlining data quality control processes and planning some activities with users to evaluate our work.

Outcomes of our work in BRII gave us some insights into what the requirements for a successful service of the registry would be. For example:

From the departments and individual users point of view:
  • Breadth of coverage. By this is meant the inclusion of data from as many different sources as possible, both internally and externally.
  • Depth of coverage in addition to breadth of coverage. This will enable context to be clear and detailed questions to be answered. It will require a maximum quantity of data about each entity from multiple sources.
  • The ability to find information that cannot be easily found elsewhere such as all Oxford researchers working in a particular topic or collaborating with others in a specific geographical location
  • Easy to use and flexible search option on the Blue Pages
  • The ability to discover research connections between people and research interests as well as gaps or islands of subjects (groups who are not related to anyone)
  • The ability to explore information across time, as in changes in roles, research interests of people, and in their publications,
  • Being able to download relevant information in formats that can be easily manipulated by users.
From the University point of view:
  • To provide services complementary to those provided by other systems (avoid duplication)
  • the ease with which data harvesting can be repeated and supported in future will be critical in the long-term to delivering a lower overhead and a sustainable/affordable service.
We are working on implementing and providing the above requisites. However this could take some time during which we need to constantly monitor response from our users to see if we are in track.

During the last few weeks I have been reading about privacy and data aggregation issues. Both topics are extremely relevant to BRII. Regarding privacy, I like this quote by boyd (2010):

"Fundamentally, privacy is about having control over how information flows. It's about being able to understand the social setting in order to behave appropriately. To do so, people must trust their interpretation of the context, including the people in the room and the architecture that defines the setting."

If I translate the above to the context of the registry and the Blue Pages, I would say that in order for researchers and departments to trust our work we need to help them understand what we are doing with their information and in which contexts we are going to disseminate it. Although we are not dealing with personal information, we are using information about researchers' work (which sometimes they want to be private), information which can affect (hopefuly positively) their reputations and future work.

Regarding data aggregatiojn, the general concerns I gather from the literature can be summarized as: data aggregation can threaten privacy, can lead to security problems (e.g., identity theft), can mislead people (aggregated data is not always comprehensible), can violate contextual integrity (changing data’s original meaning) and is not always used for the same purposes as originally intended.

Having the above in mind and putting that into BRII’s context: data restricted to one institution and limited to data about research, this reading has left me thinking on more requisites we may need to consider in our work. The diagram below explains what I am talking about.

Click to see larger version

It is about building trust among our users. It complements the first list which is more focused on technical developments and data access and coverage. This second set of points, I think, would help our contributors as well as users of data. It will let them know in which ways we are going to use their information and in which ways we are allowing other parties use that data. We need to reassure our contributors that their data will be secured and used lawfully, by constraining uses to research purposes and keeping data’s contextual integrity. We have already been taking into account some of this points but we may need to stress them and publicise them more.

Relevant literature:
Ethics of data mining and aggregation
Data aggregation: Actually a threat?
Lita van Wel and Lamber Royakkers (2004) Ethical issues in web data mining. Ethics and Information Technology 6: 129–140
Nissenbaum, (1997) “Toward an approach to privacy in public: the challenges of information technology,” Ethics and Behavior 7(3) , pp. 207–219.
Nissenbaum, H. (1998), “Protecting Privacy in an Information Age: The Problem of Privacy in Public,” Law and Philosophy, 17, pp. 559-596.

Thursday, 1 April 2010

End of Project

The BRII project has officially concluded. However development and user engagement work are still ongoing. We will continue to add data to the registry and go around the Univeristy selling our outputs and asking people for more feedback. We will keep you updated on our progress here and in our website

If you want to know more about our outputs and outcomes, have a look at the "BRII Papers, Reports and Presentations" section in the right-hand side column of this blog.

If you would like more information email me at: or ring me at +44 (0)1865 280028 or Sally Rumsey (BRII Manager) at e: tel: +44 (0) 1865 283860.

Friday, 19 March 2010

Blue Pages video clip

As the project will end soon (31st of March) I wanted to show you what we have done so far with the Blue Pages. We will not stop working though. We will continue working on harvesting more data and user engagement. Anusha and Mat have been working tirelessly on the Blue Pages and although there is still a lot of work to be done on it I think it is at a stage where I can show you a short demo. This is a work in progress video and we are using a few sets of harvested data. The video below lasts 3:14 minutes and has no audio.

Open wmv version on a separate window.

Monday, 8 March 2010

BRII Presentation

On Friday 5th of March I attended a book launch at the University of Hull - Business School. The book is titled "Supporting research students" by Dr Barbara Allan.

"The importance of supporting the needs of research students has recently risen higher up the academic agenda around the world. Numbers of postgraduate students have expanded, and the traditional PhD has now been joined by a new range of doctoral qualifications including professional doctorates. These developments have led to a more diverse student body which now includes senior professional practitioners. This shift has been accompanied by a recognition that universities must encourage library and information staff to make their critical contribution to students' research skill. This timely book offers guidance to enable them to support the specialist needs of these students."

The event was launched by Biddy Fisher, President of the Chartered Institute of Library and Information Professionals and Emeritus Professor Patsy Cullen, Board Member of the Museum, Libraries and Archives Council.

During the event Dr Allan presented her current project called the Graduate Virtual Research Environment, an online environment inviting Phd students and research staff in the Business School to share and learn from each others experiences. Then Dr Chris Thomson ran a demo of the environment. He writes a blog here.

After their presentation I gave a short talk about BRII. I talked about the entity registry trying not to use too technical language ;) and about how Research Activity Data (RAD) can be used in the Blue Pages, to design websites, in ORA, etc. I made emphasis on how RAD can complement research students' learning by providing overviews of what experts are doing in their fields of research. Here you have my presentation.

Tuesday, 2 March 2010

Project Evaluation

We just finished our Project Evaluation report. The project's summative evaluation took place on the 19th of February and was directed by Neil Beagrie from Charles Beagrie Ltd. As part of this evaluation Neil ran an online survey with a sample of our interviewees and testers. The results were very positive. We chose people who knew about BRII and who had at least seen a demo or tested the Blue Pages. The reason for this was that we needed respondents who understood the basic concepts of BRII’s entity registry and RAD (Research Activity Data) and and were able to assess the potential benefits that sharing and re-using RAD could have for the University.

For the purpose of the survey we asked respondents to have a look at the working version of the Blue Pages. As this tool is still under development we have limited access only from Oxford networks. However I also compiled a series of screenshots for respondents who were not on campus. You can see the status of the Blue Pages six weeks ago.

and these are the results of the survey (click on + to zoom in):

Monday, 22 February 2010

Uncovering User Perceptions of Research Activity Data

I recently wrote a paper about the user testing of the Oxford Blue Pages. It has been published in Ariadne in their January 2010 issue. Here you have a link to the article in Ariadne The title is Uncovering User Perceptions of Research Activity Data. I chose that title to emphasise the aspect of the user testing which we think helped us the most to engage and understand our users. Beside improving software usability the user tests provided us with data which confirmed and complemented the findings of the Stakeholder Analysis. In this article I discuss the concepts of Usability, Perceived Ease of Use and Perceived Usefulness and provide some examples from feedback.

I have also deposited a copy of the article in the Oxford University Research Archive ORA here
You can also find the ORA record (with the full text) for the BRII Stakeholder Analysis here

: I have added a section titled "BRII Papers, Reports and Presentations" in the right column of this blog (scroll down a bit.) It contains links to relevant outcomes produced by BRIIl.

Friday, 12 February 2010

The BRII Project Use Cases Report is ready

The BRII Project Use Cases report is ready and available in PDF format from here. The Use Cases report contains four short, business-style use cases illustrating the uses and benefits of the BRII project outputs.

Any comments about the report are welcome. (Click on comments below or email me.)

Wednesday, 27 January 2010

BRII's Products - JISC Trade Fair

These are posters I designed for the Institutional Innovation Projects Trade Fair. The posters explain our three products: a vocab site, an approach to user testing and the Blue Pages.

Click on pictures to get larger versions

We have many more products of course but these are the ones I thought could be more attractive to other projects. In addition to the above, we have produced an object registry containing aggregated research activity data which is actually the core of the project, we are designing processes of data harvesting, we have written a stakeholder analysis which studies the requirements and uses for research activity data at the University of Oxford, we organised a JISC assembly on stakeholder buy-in, etc.

Wednesday, 20 January 2010

Institutional Innovation Projects Trade Fair

I have been working on the products I will offer in the Projects Traid Fair, part of the JISC Institutional Innovation Exchange event. I have designed a couple of posters to explain my three products. These will be:
  • The Vocabulary site storing the new vocabularies created by BRII providing formal descriptions of concepts, terms and relationships within the "research" knowledge domain (funder information, people profiles, research data).
  • Experiences and lessons learned from the User Testing of the Blue Pages. I have related some of these experiences in previous posts. I have also written a paper which will hopefully be published at the end of this month in Ariadne. In this paper I describe our approach to user testing, the reasons for choosing that approach and the lessons we have learned from it.
  • The Oxford Blue Pages... or at least the concept of the Blue Pages, displaying aggregated Research Activity Data as research objects and the connections between them.
The JISC have asked us to make projects products directly related to your final report and outputs.

I will also have to buy products from other projects attending the event. I asked our JISC contact if we had to buy products which could be used only in our current JISC projects. He said there was not problem with making purchases with related activities. So I can buy products which can be useful in other activities in our office or perhaps other departments in the University.

Tuesday, 12 January 2010

What is going on in BRII

Since September we have been carrying out user tests of the Oxford Blue pages. These tests have been very useful to improve the Blue Pages usability and to get the perspectives of potential users. These perspectives were not limited to the Blue Pages but to the BRII registry. After seeing the Blue Pages testers asked about the sources of data and the processes of updating these in the registry. Some testers raised issues about accuracy and validity of data, others asked about data coverage (within and outside Oxford) and the processes to update or synchronize these data with changes in the original sources and scope. Regarding this last one, scope, some academics suggested harvesting their personal websites and blogs (i.e., to not only harvest data from official departmental sources.) They said personal sites "reflect more their career and research experience as they are not limited to their activities in Oxford but relate previous experiences.” Other testers suggested ways to organising and presenting information. For example how to present a list of publications. Some people were interested in the abstracts while some others just wanted basic information such as the title, year and author.

All these feedback and questions are food for thought for us. User tests are an ongoing activity. Although were are not doing as many as we did last year we are still running them whenever we get an opportunity.

In addition to the user tests, we have also run some dissemination activities, such as meetings and demos with interested parties. We recently had a meeting with some people in the Social Sciences division. They are working on a divisional system which will cover areas such as teaching, HR and research within their division. Most of our conversation focused on the nature of the data harvested. We discussed if the aggregation of data in the registry and the Blue Pages could change their original nature and intentions. We also talked about how the Blue Pages will make these data, which is already publicly available, more visible. The Blue Pages will enhance all the positive aspects of research within the University. However it will also make some data problems more obvious. Social Sciences participants realised that some inconsistencies could be made more obvious in the Blue Pages. As for example, wrong names or affiliations. Correcting these inconsistencies is of course not a task for BRII but for the people responsible of the original sources. However BRII can potentially help them by letting them know where these problems are, e.g. Blue Pages users can report errors in data.

We have also been working on a new Graduate Opportunities website for the Medical Sciences division. This is a second planned output of BRII. Anne Bowtell is in charge of this. She is using the BRII API and data from the registry to populate this website. These are the data Anusha is harvesting from Medical Sciences sources which of course will also be available from the Blue Pages.

On a different matter, in a couple of weeks I will attend the next JISC Innovation Exchange event. This will take place in Birmingham. One of the main events will be a kind of trade fair where we will have to sell to and buy products from other JISC Institutional Innovation projects. This sounds exciting! Now I need to think on how to sell our BRII products and need to get some updates of the status of other projects so I know in advance what will be on offer then.