Monday 9 August 2010

Uncovering hidden connections in Research Activity Data

I recently submited an article to Sconul Focus which, I hope, will be published in a few months time. The topic of the article was data harvesting and aggregation. A simple and (hopefuly) easy to read explanation of a rather complex topic. I used our entity registry as an example and described how it harvests data from University and external sources, how it converts sources into RDF format so data can be aggregated and finally how it uncovers hidden connections. Regarding this last step, uncovering hidden connections, I thought this was a very interesting and value adding process, so I will blog about it here.

When we collect data from different sources, there is a chance that some of these data are interconnected. Different sources may contain replicated data, e.g. a researcher's profile can be found on his college and department's websites. Different sources may contain data that complement each other. For example a researcher's profile on his college website and a list of his projects and publications on his departmental website. Originally, these webpages do not include links to each other therefore unless we know about the other source we will not get a complete profile for that researcher. Add more sources, external ones too (e.g., funders' websites containing information about grants), and we will get information about researchers which is spread all over the place but disconnected.

Uncovering hidden connections means making connections between data in different sources more evident: as in adding links between sources which point to every each of them. But how do we create those links? Here a non-technical introduction:

For example if we get Prof Francis Matthew Kellner's profile in source 1 and Prof Francis Matthew Kellner's research interests and publications in source 2 we can establish with some level of certainty that these two sources refer to the same person. Therefore we can connect the data in these two sources and build a completer profile with Prof Kellner's profile, research interests and publications.

There are, however, other cases which are not so straightforward, where names are similar but we cannot be sure they belong to the same person. For example if we get Prof Francis Kellner’s biography in source 1, Prof F. M. Kellner listed as Principal Investigator on a project in source 2 and Francis M. Kellner as author in publications in source 3. How do we know if these data belong to the same person?

For cases like these we have developed a ‘same-as’ process.

‘Same-as’ has a set of rules which use information such as people’s first name and surname, researchers’ affiliation and email. Depending on the availability of information and whether the sets of data match, ‘same-as’ determines if two or more records belong to the same person or not. If the records belong to the same person ‘same-as’ merges the records. If the information available is not enough to do the matching, or if the data do not match, ‘same-as’ will keep the records separately.

The following is the logic used. This is a technical-ishh explanation written by Anusha.

Search for people with the same last name, who are part of a group of sources, e.g. sources belonging to the social sciences (we group sources together, based on likelihood of information overlapping)

Case 1
Match people only if each person has at least the fields below and they all match.
  • first name (not just initials)
  • last name
  • affiliation
  • source(s)
If the source is a trusted source:
  • staff_id
Note: Subset of the firstname will be matched : Example John P M, John P, John
Create a person superset and add all of their info

Case 2
For people with the information in atleast these follwing fields, with all of them matching:
  • initials
  • last name
  • affiliation
  • source
Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above
If the person matches the information above, suggest a connection.

Case 3
For people with the information in atleast these follwing fields, with all of them matching:
  • initials
  • last name
  • source
Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above.
If the person matches the information above, suggest a connection.
If the source is a trusted source (data harvested from within oxford), display them in the browse / search results pages, else do not display them.


The ‘same-as’ process focuses on ‘people’ entities. Projects, publications, funders and academic units, usually have fixed (or standard) names which are used consistently across sources. However, names of people are frequently written in different ways, depending on the contexts.

So now, you can imagine, everytime we add more data to the registry, we pass these data through the same-as process to see if there are any hidden connections to the data we already have. Or put it in a different way, everytime we add a new source, we do not only add their data but the connections we find with same-as, which were probably unknown or at least not evident in the original sources.

If you want to know more you can wait for the Sconul Focus paper or if you are impatient e-mail us.

Tuesday 13 July 2010

Southampton ECS release their data into the public domain

The School of Electronics and Computer Science (ECS) at the University of Southampton releases all public data in open linked data format.

"In what is believed also to be a world-first, ECS has become the UK’s first University department to release all its public data in open linked data format.

The School of Electronics and Computer Science (ECS) at the University of Southampton is at the forefront of the open linked data initiative through the work of its Professors Sir Tim Berners-Lee and Nigel Shadbolt."

You can read the complete article here http://www.ecs.soton.ac.uk/about/news/3313 or here http://www.alphagalileo.org/ViewItem.aspx?ItemId=81065&CultureCode=en

We think this is good news and a good example to follow.

Wednesday 30 June 2010

BRII Summer Project

This another update to explain our summer 2010 activities. At BRII we are working on a reporting system where users can notify us and official sources of data about errors they find in Research Activity Data. This system will help us and our sources to improve the quality of data. As we are harvesting data from other sources we are designing a system were users can flag errors and send notifications to the appropriate people (sources and BRII) These notifications will contain enough information to decide on a suitable action to take.

Errors could originate from the content of data themselves or from the process of aggregation we perform at BRII. Say for example, misspellings and wrong information in source data, information which have been aggregated but which belong to different people with the same names, information belonging to the same person but which appears as belonging to two or more people with the same name, etc.

In relation to aggregation errors, Anusha has been working hard to design a system to accurately identify sets of data which belong to the same person. For example Prof John Smith in source 1 and J. P. Smith in source 2 could be the same person, or could not. For this she is using extra information that comes with data such as affiliation etc. When her algorithm is finished we will be able to merge two or more "people" into one or divide one "person" into two or more "people" as requested by administrators or users who identify inaccuracies.

For the summer task we are collaborating with the Computing Laboratory Comlab. Anusha is currently harvesting their data and Monica is working on the reporting/notification forms within the Blue Pages. We will soon contact Comlab again to check their harvested data and participate in tests. We would like to thank Thorsten Hauler, research facilitator, and Edward Crichton, web manager, from Comlab who have kindly given us their time.

This summer project is part of the data quality control that we are trying to establish within the registry. I have talked about this in a previous post.

Tuesday 29 June 2010

Advisory Group Meeting

On Friday 2nd of July we will have our first advisory group meeting after the BRII Project ended. The advisory group is comprised of 10 stakeholders from across the University. The purpose of this meeting is to discuss avenues for development of the Entity Registry and related services (i.e. Blue Pages) and strategies for embedding these services within the University. This of course will be a long process and we are just starting. We have good foundations though as the BRII project successfully produced a proof of concept system which is capable of being expanded with services for and information about all University divisions. If you want to have an idea of what we have produced watch our screen cast of the Blue Pages here (wmv file).

Thursday 17 June 2010

BRII at the SCONUL Conference

Yesterday I attended the "Organisational Leadership in a Time of Change" SCONUL Conference at Leeds Queens Hotel. Although I could not stay for the whole event (I stayed only 1 of 3 days) I enjoyed it. I attended a few presentations and presented our BRII poster.

I would like to thank Prof John Lancaster who welcomed me and who kindly helped me to move the BRII poster to a more visible place :)

Sorry for the poor quality pictures, dark place and shaky hands.

Thursday 13 May 2010

BRII Update

This is a short update on our activities and some ideas I have come up with.

Since the end of the BRII project we have been working on the registry and user engagement. We are adding more data continuously. We are also outlining data quality control processes and planning some activities with users to evaluate our work.

Outcomes of our work in BRII gave us some insights into what the requirements for a successful service of the registry would be. For example:

From the departments and individual users point of view:
  • Breadth of coverage. By this is meant the inclusion of data from as many different sources as possible, both internally and externally.
  • Depth of coverage in addition to breadth of coverage. This will enable context to be clear and detailed questions to be answered. It will require a maximum quantity of data about each entity from multiple sources.
  • The ability to find information that cannot be easily found elsewhere such as all Oxford researchers working in a particular topic or collaborating with others in a specific geographical location
  • Easy to use and flexible search option on the Blue Pages
  • The ability to discover research connections between people and research interests as well as gaps or islands of subjects (groups who are not related to anyone)
  • The ability to explore information across time, as in changes in roles, research interests of people, and in their publications,
  • Being able to download relevant information in formats that can be easily manipulated by users.
From the University point of view:
  • To provide services complementary to those provided by other systems (avoid duplication)
  • the ease with which data harvesting can be repeated and supported in future will be critical in the long-term to delivering a lower overhead and a sustainable/affordable service.
We are working on implementing and providing the above requisites. However this could take some time during which we need to constantly monitor response from our users to see if we are in track.

During the last few weeks I have been reading about privacy and data aggregation issues. Both topics are extremely relevant to BRII. Regarding privacy, I like this quote by boyd (2010):

"Fundamentally, privacy is about having control over how information flows. It's about being able to understand the social setting in order to behave appropriately. To do so, people must trust their interpretation of the context, including the people in the room and the architecture that defines the setting."

If I translate the above to the context of the registry and the Blue Pages, I would say that in order for researchers and departments to trust our work we need to help them understand what we are doing with their information and in which contexts we are going to disseminate it. Although we are not dealing with personal information, we are using information about researchers' work (which sometimes they want to be private), information which can affect (hopefuly positively) their reputations and future work.

Regarding data aggregatiojn, the general concerns I gather from the literature can be summarized as: data aggregation can threaten privacy, can lead to security problems (e.g., identity theft), can mislead people (aggregated data is not always comprehensible), can violate contextual integrity (changing data’s original meaning) and is not always used for the same purposes as originally intended.

Having the above in mind and putting that into BRII’s context: data restricted to one institution and limited to data about research, this reading has left me thinking on more requisites we may need to consider in our work. The diagram below explains what I am talking about.

Click to see larger version

It is about building trust among our users. It complements the first list which is more focused on technical developments and data access and coverage. This second set of points, I think, would help our contributors as well as users of data. It will let them know in which ways we are going to use their information and in which ways we are allowing other parties use that data. We need to reassure our contributors that their data will be secured and used lawfully, by constraining uses to research purposes and keeping data’s contextual integrity. We have already been taking into account some of this points but we may need to stress them and publicise them more.

Relevant literature:
Ethics of data mining and aggregation
Data aggregation: Actually a threat?
Lita van Wel and Lamber Royakkers (2004) Ethical issues in web data mining. Ethics and Information Technology 6: 129–140
Nissenbaum, (1997) “Toward an approach to privacy in public: the challenges of information technology,” Ethics and Behavior 7(3) , pp. 207–219.
Nissenbaum, H. (1998), “Protecting Privacy in an Information Age: The Problem of Privacy in Public,” Law and Philosophy, 17, pp. 559-596.

Thursday 1 April 2010

End of Project

The BRII project has officially concluded. However development and user engagement work are still ongoing. We will continue to add data to the registry and go around the Univeristy selling our outputs and asking people for more feedback. We will keep you updated on our progress here and in our website http://brii.bodleian.ox.ac.uk

If you want to know more about our outputs and outcomes, have a look at the "BRII Papers, Reports and Presentations" section in the right-hand side column of this blog.

If you would like more information email me at: cecilia.loureiro-koechlin@bodleain.ox.ac.uk or ring me at +44 (0)1865 280028 or Sally Rumsey (BRII Manager) at e: sally.rumsey@bodleian.ox.ac.uk tel: +44 (0) 1865 283860.