ProbDup
identifies probable duplicates of germplasm accessions in KWIC
indexes created from PGR passport databases using fuzzy, phonetic and
semantic matching strategies.
ProbDup(
kwic1,
kwic2 = NULL,
method = c("a", "b", "c"),
excep = NULL,
chunksize = 1000,
useBytes = TRUE,
fuzzy = TRUE,
max.dist = 3,
force.exact = TRUE,
max.alpha = 4,
max.digit = Inf,
phonetic = TRUE,
encoding = c("primary", "alternate"),
phon.min.alpha = 5,
min.enc = 3,
semantic = FALSE,
syn = NULL
)
An object of class KWIC
.
An object of class KWIC
. Required for method
"b"
and "c"
only (see Details).
The method to be followed for identification of probable
duplicates. Either "a"
, "b"
or "c"
. (see
Details).
A vector of the keywords in KWIC not to be used for probable duplicate search (see Details).
A value indicating the size of KWIC index keyword block to be used for searching for matches at a time in case of large number of keywords(see Note).
logical. If TRUE
, performs byte-wise comparison
instead of character-wise comparison (see Note).
logical. If TRUE
identifies probable duplicates based on
fuzzy matching.
The maximum levenshtein distance between keyword strings allowed for a match. Default is 3 (see Details).
logical. If TRUE
, enforces exact matching instead
of fuzzy matching for keyword strings which match the criteria specified in
arguments max.alpha
and max.digit
(see Details).
Maximum number of alphabet characters present in a keyword string up to which exact matching is enforced rather than fuzzy matching. Default is 4 (see Details).
Maximum number of numeric characters present in a keyword string up to which exact matching is enforced rather than fuzzy matching. Default is Inf (see Details).
logical. If TRUE
identifies probable duplicates based
on phonetic matching.
Double metaphone encoding for phonetic matching. The default
is "primary"
(see Details).
Minimum number of alphabet characters to be present in a keyword string for phonetic matching (see Details).
Minimum number of characters to be be present in double metaphone encoding of a keyword string for phonetic matching (see Details).
logical. If TRUE
identifies probable duplicates based
on semantic matching.
A list with character vectors of synsets (see Details).
A list of class ProbDup
containing the following data frames
of probable duplicate sets identified along with the corresponding keywords
and set counts:
FuzzyDuplicates
PhoneticDuplicates
SemanticDuplicates
Each data frame has the following columns:
SET_NO | The set number. |
TYPE | The type of probable duplicate set. 'F' for fuzzy, 'P' for phonetic and 'S' for semantic matching sets. |
ID | The primary IDs of records of accessions comprising a set. |
ID:KW | The 'matching' keywords along with the IDs. |
COUNT | The number of elements in a set. |
The prefix [K*]
indicates the KWIC index of origin of the KEYWORD or
PRIM_ID.
This function performs fuzzy, phonetic and semantic matching of keywords in
KWIC indexes of PGR passport databases (created using
KWIC
function) to identify probable duplicates of
germplasm accessions. The function can execute matching according to either
of the following three methods as specified by the method
argument.
a
:Perform string matching of keywords in a single KWIC index to identify probable duplicates of accessions in a single PGR passport database.
b
:Perform string matching of keywords in the first KWIC index (query) with that of the keywords in the second index (source) to identify probable duplicates of accessions of the first PGR passport database among the accessions in the second database.
c
:Perform string matching of keywords in two different KWIC indexes jointly to identify probable duplicates of accessions from among two PGR passport databases.
Fuzzy matching or approximate string matching of keywords is carried
out by computing the generalized levenshtein (edit) distance between them.
This distance measure counts the number of deletions, insertions and
substitutions necessary to turn one string to the another. A distance of up
to max.dist
are considered for a match.
Exact matching will be enforced when the argument force.exact
is
TRUE
. It can be used to avoid fuzzy matching when the number of
alphabet characters in keywords is lesser than a critical value
(max.alpha
). Similarly, the value of max.digit
can also be set
according to the requirements. The default value of Inf
avoids fuzzy
matching and enforces exact matching for all keywords having any numerical
characters. If max.digit
and max.alpha
are both set to
Inf
, exact matching will be enforced for all the keywords.
When exact matching is enforced, for keywords having both alphabet and
numeric characters and with the number of alphabet characters greater than
max.digit
, matching will be carried out separately for alphabet and
numeric characters present.
Phonetic matching of keywords is carried out using the Double
Metaphone phonetic algorithm (DoubleMetaphone
) to
identify keywords that have the similar pronunciation. Either the
primary
or alternate
encodings can be used by specifying the
encoding
argument. The argument phon.min.alpha
sets the limits
for the number of alphabet characters to be present in a string for executing
phonetic matching. Similarly min.enc
sets the limits for the number of
characters to be present in the encoding of a keyword for phonetic matching.
Semantic matching matches keywords based on a list of accession name
synonyms supplied as list with character vectors of synonym sets (synsets) to
the syn
argument. Synonyms in this context refers to interchangeable
identifiers or names by which an accession is recognized. Multiple keywords
specified as members of the same synset in syn
are merged together. To
facilitate accurate identification of synonyms from the KWIC index, identical
data standardization operations using the MergeKW
and
DataClean
functions for both the original database
fields and the synset list are recommended.
The probable duplicate sets identified initially here may be intersecting
with other sets. To get the disjoint sets after the union of all the
intersecting sets use the DisProbDup
function.
The function AddProbDup
can be used to add the
information associated with the identified sets in an object of class
ProbDup
as fields(columns) to the original PGR passport database.
All of the string matching operations here are executed through the
stringdist-package
functions.
As the number of keywords in the KWIC indexes increases, the memory
consumption by the function also increases. For string matching, this
function relies upon creation of a \(n\)*\(m\) matrix of all possible
keyword pairs for comparison, where \(n\) and \(m\) are the number of
keywords in the query and source indexes respectively. This can lead to
cannot allocate vector of size
errors in case very large KWIC
indexes where the comparison matrix is too large to reside in memory. In
such a case, try to adjust the chunksize
argument to get the
appropriate size of the KWIC index keyword block to be used for searching
for matches at a time. However a smaller chunksize may lead to longer
computation time due to the memory-time trade-off.
The progress of matching is displayed in the console as number of blocks completed out of total (e.g. 6 / 30), the percentage of achievement (e.g. 30%) and a text-based progress bar.
In case of multi-byte characters in keywords, the matching speed is further
dependent upon the useBytes
argument as described in
Encoding issues for the stringdist
function, which is made use of here for string matching.
van der Loo, M. P. J. 2014. "The Stringdist Package for Approximate String Matching." R Journal 6 (1):111-22. https://journal.r-project.org/archive/2014/RJ-2014-011/index.html.
# \dontshow{
threads_dt <- data.table::getDTthreads()
threads_OMP <- Sys.getenv("OMP_THREAD_LIMIT")
data.table::setDTthreads(2)
data.table::setDTthreads(2)
Sys.setenv(`OMP_THREAD_LIMIT` = 2)
# }
if (FALSE) {
# Method "a"
#===========
# Load PGR passport database
GN <- GN1000
# Specify as a vector the database fields to be used
GNfields <- c("NationalID", "CollNo", "DonorID", "OtherID1", "OtherID2")
# Clean the data
GN[GNfields] <- lapply(GN[GNfields], function(x) DataClean(x))
y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"),
c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"),
c("Mota", "Company"))
y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM")
y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.",
"Bunch", "Peanut")
GN[GNfields] <- lapply(GN[GNfields], function(x) MergeKW(x, y1, delim = c("space", "dash")))
GN[GNfields] <- lapply(GN[GNfields], function(x) MergePrefix(x, y2, delim = c("space", "dash")))
GN[GNfields] <- lapply(GN[GNfields], function(x) MergeSuffix(x, y3, delim = c("space", "dash")))
# Generate KWIC index
GNKWIC <- KWIC(GN, GNfields)
# Specify the exceptions as a vector
exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE",
"DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT",
"GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE",
"LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R",
"RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE",
"U", "VALENCIA", "VIRGINIA", "WHITE")
# Specify the synsets as a list
syn <- list(c("CHANDRA", "AH114"), c("TG1", "VIKRAM"))
# Fetch probable duplicate sets
GNdup <- ProbDup(kwic1 = GNKWIC, method = "a", excep = exep, fuzzy = TRUE,
phonetic = TRUE, encoding = "primary",
semantic = TRUE, syn = syn)
GNdup
# Method "b and c"
#=================
# Load PGR passport databases
GN1 <- GN1000[!grepl("^ICG", GN1000$DonorID), ]
GN1$DonorID <- NULL
GN2 <- GN1000[grepl("^ICG", GN1000$DonorID), ]
GN2 <- GN2[!grepl("S", GN2$DonorID), ]
GN2$NationalID <- NULL
# Specify as a vector the database fields to be used
GN1fields <- c("NationalID", "CollNo", "OtherID1", "OtherID2")
GN2fields <- c("DonorID", "CollNo", "OtherID1", "OtherID2")
# Clean the data
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) DataClean(x))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) DataClean(x))
y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"),
c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"),
c("Mota", "Company"))
y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM")
y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.",
"Bunch", "Peanut")
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeKW(x, y1, delim = c("space", "dash")))
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergePrefix(x, y2, delim = c("space", "dash")))
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash")))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeKW(x, y1, delim = c("space", "dash")))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergePrefix(x, y2, delim = c("space", "dash")))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash")))
# Remove duplicated DonorID records in GN2
GN2 <- GN2[!duplicated(GN2$DonorID), ]
# Generate KWIC index
GN1KWIC <- KWIC(GN1, GN1fields)
GN2KWIC <- KWIC(GN2, GN2fields)
# Specify the exceptions as a vector
exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE",
"DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT",
"GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE",
"LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R",
"RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE",
"U", "VALENCIA", "VIRGINIA", "WHITE")
# Specify the synsets as a list
syn <- list(c("CHANDRA", "AH114"), c("TG1", "VIKRAM"))
# Fetch probable duplicate sets
GNdupb <- ProbDup(kwic1 = GN1KWIC, kwic2 = GN2KWIC, method = "b",
excep = exep, fuzzy = TRUE, phonetic = TRUE,
encoding = "primary", semantic = TRUE, syn = syn)
GNdupb
GNdupc <- ProbDup(kwic1 = GN1KWIC, kwic2 = GN2KWIC, method = "c",
excep = exep, fuzzy = TRUE, phonetic = TRUE,
encoding = "primary", semantic = TRUE, syn = syn)
GNdupc
}
# \dontshow{
data.table::setDTthreads(threads_dt)
Sys.setenv(`OMP_THREAD_LIMIT` = threads_OMP)
# }