When we collect data from different sources, there is a chance that some of these data are interconnected. Different sources may contain replicated data, e.g. a researcher's profile can be found on his college and department's websites. Different sources may contain data that complement each other. For example a researcher's profile on his college website and a list of his projects and publications on his departmental website. Originally, these webpages do not include links to each other therefore unless we know about the other source we will not get a complete profile for that researcher. Add more sources, external ones too (e.g., funders' websites containing information about grants), and we will get information about researchers which is spread all over the place but disconnected.
Uncovering hidden connections means making connections between data in different sources more evident: as in adding links between sources which point to every each of them. But how do we create those links? Here a non-technical introduction:
For example if we get Prof Francis Matthew Kellner's profile in source 1 and Prof Francis Matthew Kellner's research interests and publications in source 2 we can establish with some level of certainty that these two sources refer to the same person. Therefore we can connect the data in these two sources and build a completer profile with Prof Kellner's profile, research interests and publications.
There are, however, other cases which are not so straightforward, where names are similar but we cannot be sure they belong to the same person. For example if we get Prof Francis Kellner’s biography in source 1, Prof F. M. Kellner listed as Principal Investigator on a project in source 2 and Francis M. Kellner as author in publications in source 3. How do we know if these data belong to the same person?
For cases like these we have developed a ‘same-as’ process.
‘Same-as’ has a set of rules which use information such as people’s first name and surname, researchers’ affiliation and email. Depending on the availability of information and whether the sets of data match, ‘same-as’ determines if two or more records belong to the same person or not. If the records belong to the same person ‘same-as’ merges the records. If the information available is not enough to do the matching, or if the data do not match, ‘same-as’ will keep the records separately.
The following is the logic used. This is a technical-ishh explanation written by Anusha.
Search for people with the same last name, who are part of a group of sources, e.g. sources belonging to the social sciences (we group sources together, based on likelihood of information overlapping)
Case 1
Match people only if each person has at least the fields below and they all match.
- first name (not just initials)
- last name
- affiliation
- source(s)
- staff_id
Create a person superset and add all of their info
Case 2
For people with the information in atleast these follwing fields, with all of them matching:
- initials
- last name
- affiliation
- source
If the person matches the information above, suggest a connection.
Case 3
For people with the information in atleast these follwing fields, with all of them matching:
- initials
- last name
- source
If the person matches the information above, suggest a connection.
If the source is a trusted source (data harvested from within oxford), display them in the browse / search results pages, else do not display them.
The ‘same-as’ process focuses on ‘people’ entities. Projects, publications, funders and academic units, usually have fixed (or standard) names which are used consistently across sources. However, names of people are frequently written in different ways, depending on the contexts.
So now, you can imagine, everytime we add more data to the registry, we pass these data through the same-as process to see if there are any hidden connections to the data we already have. Or put it in a different way, everytime we add a new source, we do not only add their data but the connections we find with same-as, which were probably unknown or at least not evident in the original sources.
If you want to know more you can wait for the Sconul Focus paper or if you are impatient e-mail us.