MSc Seminar: "Effective Name Similarity Measures for the Task of Name Matching in Record Linkage" by Mitra Kaviani

Date and Time

Location

J.D. MacLachlan Room 228

Details

ABSTRACT:
Databases represent an essential part of IT-based systems and applications. Numerous businesses and industries rely on the precision of databases to complete operations. Accordingly, the quality of the data in databases plays an important role to support the functions of businesses. Organizations have personal names in databases. For instance, they can be found in emails, customer and patient records, health care and banking systems. These organizations particularly use personal names for identifying an individual's record in their databases. While various databases face the problem of determining similar names records when databases are duplicated or linked across organizations, it is crucial that we undertake effective name similarity measure to improve the matching quality and enrich the data.

Names such as surnames are important pieces of information. An increasing number of applications apply matching personal names including search engines, text and web mining, information extraction and retrieval, and data linkage or deduplication system. Whereas, name variations are problematic, particularly when they are used to identify an individual person in the databases. As it is not easy to determine whether a name variation which appears in a record is a different spelling form of the same name or a name for a different person. Thus, understanding whether two strings are instances of the same name entity in spite of name variations is the challenge of the name-matching problem. Considering types of name variation, name similarity measures can be divided into three main categories of character-based, tokenbased and phonetic-based.

This seminar will discuss the comparison of name similarity measures by getting the benefit of a clustering method based on the name’s phonetic similarity measure, as a pattern. Traditional cluster methods cannot manage the large dataset due to the slow convergence and low accuracy. This seminar also will present an improved fuzzy c- means algorithm combined with Simulated Annealing and Generic algorithm to optimize the searching ability of Fuzzy c-mean (FCM) clustering, and then the names will be grouped by their representative. We will try to compare and find how these different categories of similarity measures are performing and to investigate the effective name matching technique in record linkage.

Find related events by keyword

Events Archive