ProbDup objectR/ViewProbDup.R
    ViewProbDup.RdViewProbDup plots summary visualizations of accessions within the
probable duplicate sets retrieved in a ProbDup object according to a 
grouping factor field(column) in the original database(s).
ViewProbDup(
  pdup,
  db1,
  db2 = NULL,
  factor.db1,
  factor.db2 = NULL,
  max.count = 30,
  select,
  order = "type",
  main = NULL
)An object of class ProbDup.
A data frame of the PGR passport database.
A data frame of the PGR passport database. Required when 
pdup was created using more than one KWIC Index.
The db1 column to be considered for grouping the 
accessions. Should be of class character or factor.
The db2 column to be considered for grouping the 
accessions. Should be of class character or factor. retrieved.
The maximum count of probable duplicate sets whose information is to be plotted (see Note).
A character vector of factor names in factor.db1 and/or 
factor.db2 to be considered for grouping accessions (see 
Note).
The order of the type of sets retrieved in the plot. The default
is "type" (see Details).
The title of the plot.
A list containing the following objects:
| Summary1 | The summary data.frameof number of accessions 
  per factor level. | 
| Summary2 | The summary data.frameof
  number of accessions and sets per each type of sets classified according to
  factor levels. | 
| SummaryGrob | A grid graphical object (Grob) 
  of the summary visualization plot. Can be plotted using the grid.arrangefunction | 
When any primary ID/key records in the fuzzy, phonetic or semantic 
  duplicate sets are found to be missing from the original databases 
  db1 and db2, then they are ignored and only the matching 
  records are considered for visualization.
This may be due to data standardization of the primary ID/key field using 
  the function DataClean before creation of the KWIC 
  index and subsequent identification of probable duplicate sets. In such a 
  case, it is recommended to use an identical data standardization operation 
  on the databases db1 and db2 before running this function. 
  For summary and visualization of the set information in the object of class
  ProbDup by ViewProbDup, the disjoint of the retrieved sets 
  are made use of, as they are more meaningful than the raw sets retrieved. 
  So it is recommended that the disjoint of sets obtained using the 
  DisProbDup be used as the input pdup.
All the accession records in sets with count > max.count will be 
  considered as being unique.
The factor levels in the factor.db1 and/or factor.db2 columns
  corresponding to those mentioned in select argument alone will be 
  considered for visualization. All other factor levels will be grouped 
  together to a single level named "Others".
The argument order can be used to specify the order in which the 
  type of sets retrieved are to be plotted in the visualization. The default 
  "type" will order according to the kind of sets, "sets" will 
  order according to the number of sets in each kind and "acc" will 
  order according to the number of accessions in each kind.
The individual plots are made using ggplot and then 
  grouped together using gridExtra-package.
# \dontshow{
threads_dt <- data.table::getDTthreads()
threads_OMP <- Sys.getenv("OMP_THREAD_LIMIT")
data.table::setDTthreads(2)
data.table::setDTthreads(2)
Sys.setenv(`OMP_THREAD_LIMIT` = 2)
# }
if (FALSE) {
# Method "b and c"
#=================
# Load PGR passport databases
GN1 <- GN1000[!grepl("^ICG", GN1000$DonorID), ]
GN1$DonorID <- NULL
GN2 <- GN1000[grepl("^ICG", GN1000$DonorID), ]
GN2 <- GN2[!grepl("S", GN2$DonorID), ]
GN2$NationalID <- NULL
GN1$SourceCountry <- toupper(GN1$SourceCountry)
GN2$SourceCountry <- toupper(GN2$SourceCountry)
GN1$SourceCountry <- gsub("UNITED STATES OF AMERICA", "USA", GN1$SourceCountry)
GN2$SourceCountry <- gsub("UNITED STATES OF AMERICA", "USA", GN2$SourceCountry)
# Specify as a vector the database fields to be used
GN1fields <- c("NationalID", "CollNo", "OtherID1", "OtherID2")
GN2fields <- c("DonorID", "CollNo", "OtherID1", "OtherID2")
# Clean the data
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) DataClean(x))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) DataClean(x))
y1 <- list(c("Gujarat", "Dwarf"), c("Castle", "Cary"), c("Small", "Japan"),
           c("Big", "Japan"), c("Mani", "Blanco"), c("Uganda", "Erect"),
           c("Mota", "Company"))
y2 <- c("Dark", "Light", "Small", "Improved", "Punjab", "SAM")
y3 <- c("Local", "Bold", "Cary", "Mutant", "Runner", "Giant", "No.",
        "Bunch", "Peanut")
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeKW(x, y1, delim = c("space", "dash")))
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergePrefix(x, y2, delim = c("space", "dash")))
GN1[GN1fields] <- lapply(GN1[GN1fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash")))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeKW(x, y1, delim = c("space", "dash")))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergePrefix(x, y2, delim = c("space", "dash")))
GN2[GN2fields] <- lapply(GN2[GN2fields], function(x) MergeSuffix(x, y3, delim = c("space", "dash")))
# Remove duplicated DonorID records in GN2
GN2 <- GN2[!duplicated(GN2$DonorID), ]
# Generate KWIC index
GN1KWIC <- KWIC(GN1, GN1fields)
GN2KWIC <- KWIC(GN2, GN2fields)
# Specify the exceptions as a vector
exep <- c("A", "B", "BIG", "BOLD", "BUNCH", "C", "COMPANY", "CULTURE",
          "DARK", "E", "EARLY", "EC", "ERECT", "EXOTIC", "FLESH", "GROUNDNUT",
          "GUTHUKAI", "IMPROVED", "K", "KUTHUKADAL", "KUTHUKAI", "LARGE",
          "LIGHT", "LOCAL", "OF", "OVERO", "P", "PEANUT", "PURPLE", "R",
          "RED", "RUNNER", "S1", "SAM", "SMALL", "SPANISH", "TAN", "TYPE",
          "U", "VALENCIA", "VIRGINIA", "WHITE")
# Specify the synsets as a list
syn <- list(c("CHANDRA", "AH114"), c("TG1", "VIKRAM"))
GNdupc <- ProbDup(kwic1 = GN1KWIC, kwic2 = GN2KWIC, method = "c",
                  excep = exep, fuzzy = TRUE, phonetic = TRUE,
                  encoding = "primary", semantic = TRUE, syn = syn)
GNdupcView <- ViewProbDup(GNdupc, GN1, GN2, "SourceCountry", "SourceCountry",
                         max.count = 30, select = c("INDIA", "USA"), order = "type",
                         main = "Groundnut Probable Duplicates")
library(gridExtra)                                                    
grid.arrange(GNdupcView$SummaryGrob)                          
}   
# \dontshow{
data.table::setDTthreads(threads_dt)
Sys.setenv(`OMP_THREAD_LIMIT` = threads_OMP)
# }