Monday 9 August 2010

Uncovering hidden connections in Research Activity Data

I recently submited an article to Sconul Focus which, I hope, will be published in a few months time. The topic of the article was data harvesting and aggregation. A simple and (hopefuly) easy to read explanation of a rather complex topic. I used our entity registry as an example and described how it harvests data from University and external sources, how it converts sources into RDF format so data can be aggregated and finally how it uncovers hidden connections. Regarding this last step, uncovering hidden connections, I thought this was a very interesting and value adding process, so I will blog about it here.

When we collect data from different sources, there is a chance that some of these data are interconnected. Different sources may contain replicated data, e.g. a researcher's profile can be found on his college and department's websites. Different sources may contain data that complement each other. For example a researcher's profile on his college website and a list of his projects and publications on his departmental website. Originally, these webpages do not include links to each other therefore unless we know about the other source we will not get a complete profile for that researcher. Add more sources, external ones too (e.g., funders' websites containing information about grants), and we will get information about researchers which is spread all over the place but disconnected.

Uncovering hidden connections means making connections between data in different sources more evident: as in adding links between sources which point to every each of them. But how do we create those links? Here a non-technical introduction:

For example if we get Prof Francis Matthew Kellner's profile in source 1 and Prof Francis Matthew Kellner's research interests and publications in source 2 we can establish with some level of certainty that these two sources refer to the same person. Therefore we can connect the data in these two sources and build a completer profile with Prof Kellner's profile, research interests and publications.

There are, however, other cases which are not so straightforward, where names are similar but we cannot be sure they belong to the same person. For example if we get Prof Francis Kellner’s biography in source 1, Prof F. M. Kellner listed as Principal Investigator on a project in source 2 and Francis M. Kellner as author in publications in source 3. How do we know if these data belong to the same person?

For cases like these we have developed a ‘same-as’ process.

‘Same-as’ has a set of rules which use information such as people’s first name and surname, researchers’ affiliation and email. Depending on the availability of information and whether the sets of data match, ‘same-as’ determines if two or more records belong to the same person or not. If the records belong to the same person ‘same-as’ merges the records. If the information available is not enough to do the matching, or if the data do not match, ‘same-as’ will keep the records separately.

The following is the logic used. This is a technical-ishh explanation written by Anusha.

Search for people with the same last name, who are part of a group of sources, e.g. sources belonging to the social sciences (we group sources together, based on likelihood of information overlapping)

Case 1
Match people only if each person has at least the fields below and they all match.
  • first name (not just initials)
  • last name
  • affiliation
  • source(s)
If the source is a trusted source:
  • staff_id
Note: Subset of the firstname will be matched : Example John P M, John P, John
Create a person superset and add all of their info

Case 2
For people with the information in atleast these follwing fields, with all of them matching:
  • initials
  • last name
  • affiliation
  • source
Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above
If the person matches the information above, suggest a connection.

Case 3
For people with the information in atleast these follwing fields, with all of them matching:
  • initials
  • last name
  • source
Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above.
If the person matches the information above, suggest a connection.
If the source is a trusted source (data harvested from within oxford), display them in the browse / search results pages, else do not display them.


The ‘same-as’ process focuses on ‘people’ entities. Projects, publications, funders and academic units, usually have fixed (or standard) names which are used consistently across sources. However, names of people are frequently written in different ways, depending on the contexts.

So now, you can imagine, everytime we add more data to the registry, we pass these data through the same-as process to see if there are any hidden connections to the data we already have. Or put it in a different way, everytime we add a new source, we do not only add their data but the connections we find with same-as, which were probably unknown or at least not evident in the original sources.

If you want to know more you can wait for the Sconul Focus paper or if you are impatient e-mail us. Print this post

No comments:

Post a Comment