
Errors could originate from the content of data themselves or from the process of aggregation we perform at BRII. Say for example, misspellings and wrong information in source data, information which have been aggregated but which belong to different people with the same names, information belonging to the same person but which appears as belonging to two or more people with the same name, etc.
In relation to aggregation errors, Anusha has been working hard to design a system to accurately identify sets of data which belong to the same person. For example Prof John Smith in source 1 and J. P. Smith in source 2 could be the same person, or could not. For this she is using extra information that comes with data such as affiliation etc. When her algorithm is finished we will be able to merge two or more "people" into one or divide one "person" into two or more "people" as requested by administrators or users who identify inaccuracies.
For the summer task we are collaborating with the Computing Laboratory Comlab. Anusha is currently harvesting their data and Monica is working on the reporting/notification forms within the Blue Pages. We will soon contact Comlab again to check their harvested data and participate in tests. We would like to thank Thorsten Hauler, research facilitator, and Edward Crichton, web manager, from Comlab who have kindly given us their time.
This summer project is part of the data quality control that we are trying to establish within the registry. I have talked about this in a previous post.