Building the Research Information Infrastructure (BRII): Uncovering hidden connections in Research Activity Data

Monday, 9 August 2010

Uncovering hidden connections in Research Activity Data

I recently submited an article to Sconul Focus which, I hope, will be published in a few months time. The topic of the article was data harvesting and aggregation. A simple and (hopefuly) easy to read explanation of a rather complex topic. I used our entity registry as an example and described how it harvests data from University and external sources, how it converts sources into RDF format so data can be aggregated and finally how it uncovers hidden connections. Regarding this last step, uncovering hidden connections, I thought this was a very interesting and value adding process, so I will blog about it here.

When we collect data from different sources, there is a chance that some of these data are interconnected. Different sources may contain replicated data, e.g. a researcher's profile can be found on his college and department's websites. Different sources may contain data that complement each other. For example a researcher's profile on his college website and a list of his projects and publications on his departmental website. Originally, these webpages do not include links to each other therefore unless we know about the other source we will not get a complete profile for that researcher. Add more sources, external ones too (e.g., funders' websites containing information about grants), and we will get information about researchers which is spread all over the place but disconnected.

Uncovering hidden connections means making connections between data in different sources more evident: as in adding links between sources which point to every each of them. But how do we create those links? Here a non-technical introduction:

For example if we get Prof Francis Matthew Kellner's profile in source 1 and Prof Francis Matthew Kellner's research interests and publications in source 2 we can establish with some level of certainty that these two sources refer to the same person. Therefore we can connect the data in these two sources and build a completer profile with Prof Kellner's profile, research interests and publications.

There are, however, other cases which are not so straightforward, where names are similar but we cannot be sure they belong to the same person. For example if we get Prof Francis Kellner’s biography in source 1, Prof F. M. Kellner listed as Principal Investigator on a project in source 2 and Francis M. Kellner as author in publications in source 3. How do we know if these data belong to the same person?

For cases like these we have developed a ‘same-as’ process.

‘Same-as’ has a set of rules which use information such as people’s first name and surname, researchers’ affiliation and email. Depending on the availability of information and whether the sets of data match, ‘same-as’ determines if two or more records belong to the same person or not. If the records belong to the same person ‘same-as’ merges the records. If the information available is not enough to do the matching, or if the data do not match, ‘same-as’ will keep the records separately.

The following is the logic used. This is a technical-ishh explanation written by Anusha.

Search for people with the same last name, who are part of a group of sources, e.g. sources belonging to the social sciences (we group sources together, based on likelihood of information overlapping)

Case 1
Match people only if each person has at least the fields below and they all match.

first name (not just initials)
last name
affiliation
source(s)

If the source is a trusted source:

staff_id

Note: Subset of the firstname will be matched : Example John P M, John P, John
Create a person superset and add all of their info

Case 2
For people with the information in atleast these follwing fields, with all of them matching:

initials
last name
affiliation
source

Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above
If the person matches the information above, suggest a connection.

Case 3
For people with the information in atleast these follwing fields, with all of them matching:

initials
last name
source

Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above.
If the person matches the information above, suggest a connection.
If the source is a trusted source (data harvested from within oxford), display them in the browse / search results pages, else do not display them.

The ‘same-as’ process focuses on ‘people’ entities. Projects, publications, funders and academic units, usually have fixed (or standard) names which are used consistently across sources. However, names of people are frequently written in different ways, depending on the contexts.

So now, you can imagine, everytime we add more data to the registry, we pass these data through the same-as process to see if there are any hidden connections to the data we already have. Or put it in a different way, everytime we add a new source, we do not only add their data but the connections we find with same-as, which were probably unknown or at least not evident in the original sources.

If you want to know more you can wait for the Sconul Focus paper or if you are impatient e-mail us. Print this post

No comments:

Post a Comment

Our Goal

Building the Research Information Infrastructure (BRII) aims to support the efficient sharing of Research Activity Data (RAD) captured from a wide range of sources. BRII develops an infrastructure that harvests and archives RAD, and Web services which disseminate and reuse this kind of data by using a lightweight solution based on semantic web technologies. Phases of the project include: a stakeholder analysis to collect views from interested parties (e.g., academics and administrators); an iterative development process which uses information collected in the analysis phase; and an embedding and sustainability phase where user acceptance is assessed and strategies to support the expansion of the information research infrastructure are designed. Additional outputs of the BRII include: an application programming interface (API) for harvesting and querying data; a collection of ontologies and taxonomies used to organise and classify data; a themed Web site; and the Oxford Blue Pages displaying RAD in creative ways. By facilitating access to RAD, BRII expects to improve the research visibility of the institution and its research impact, as well as boost collaboration.

BRII Papers, Reports and Presentations

Rumsey, S. and Loureiro-Koechlin, C. (Forthcoming 2010) The role of an entity registry in scholarly communication: exploring creative uses of research activity data. New Review of Academic Librarianship.

Loureiro-Koechlin, C. (Forthcoming 2010) "Explaining abstract concepts with concrete examples - entity registry and research activity data." Sconul Focus.

BRII Project Completion Report to the JISC.

BRII Project Final Report to the JISC.

Blue Pages Video Clip A short demo of the Blue Pages (recorded 19th March 2010.)

BRII Stakeholder Grid A list of BRII's stakeholders, interests and challenges.

BRII Summative Evaluation report An independent evaluation led and facilitated by Neil Beagrie of Charles Beagrie Limited. (March 2010.)

Loureiro-Koechlin C. (2009) BRII Presentation at the Supporting research students - a unique book launch at Hull University Business School. (5th March 2010.)

Rumsey, S. (2010) BRII registry & other outputs A description of the pilot Research Activity Data Registry functionality, services and other outputs that will be developed by the project end (March 2010) and suggestions for further work.

Adding a researcher profile. Video clip demonstrating how to search for a researcher profile in the ORA registry and then embed this in a content managed website.

Loureiro-Koechlin, C. (2010) Uncovering user perceptions of research activity data (published in Ariadne, January 2010.)

Loureiro-Koechlin C. (2009) BRII Project - Use Cases report (project milestone, February 2010.)

Oxford Blue Pages Screenshots

Rumsey, S. (2009) A case analysis of registering research activity for institutional benefit (published in the International Journal of Information Management, 2009.)

Loureiro-Koechlin C. (2009) Selling an abstract concept to a practical audience (presented at the Modular e-Administration of Teaching (MEAoT) Assembly, Centre for Applied Research in Educational Technologies (CARET), University of Cambridge, 10 December 2009.)

Loureiro-Koechlin C. (2009) Building the Research Information Infrastructure (BRII) (published in Inside OR, November 2009.)

Loureiro-Koechlin C. (2009) Making sense of research activity data (presented at the OR51 conference, University of Warwick, 8-10 September 2009.)

Loureiro-Koechlin C. (2009) BRII Stakeholder Analysis report (project milestone, July 2009.)

Loureiro-Koechlin C. (2009) Reaching out to a big, complex university (presented at the Stakeholder Buy-In Assembly, SERS, Oxford University Library Services, University of Oxford, 9 June 2009.)

Bowtell, A. and Loureiro-Koechlin C. (2009) BRII Stakeholder Analysis and Sample Applications (poster presented at the Making Connections JISC event, 23-24 April 2009, Manchester.)

Building the Research Information Infrastructure (BRII)

Monday, 9 August 2010

Uncovering hidden connections in Research Activity Data

No comments:

Post a Comment

About this Blog

Cecilia Loureiro-Koechlin

Project Website

Our Goal

BRII Papers, Reports and Presentations

JISC Assembly

Labels

Interesting Links

Blog Archive

Blog List

Building the Research Information Infrastructure (BRII)

Monday, 9 August 2010

Uncovering hidden connections in Research Activity Data

No comments:

Post a Comment

About this Blog

Cecilia Loureiro-Koechlin

Project Website

Our Goal

BRII Papers, Reports and Presentations

JISC Assembly

Labels

Interesting Links

Subscribe to the BRII Blog

Blog Archive

Blog List