December 14, 2022

R code: drop duplicate values in a delimited column

Sometimes I have columns with duplicate values that I want to collapse into unique values. For example: comma-separated values, semicolon separated values, etc. See the "Gene Names" column below.




In this example data, row7 has duplicate values that I would like to collapse to only "RALGDS". Row 10 has duplicate values that I'd like to collapse to "KCNIP2". Row 8 has two unique values that I would like to keep. If I had another row with "geneA;geneA;geneA;geneB", I would want to collapse that to "geneA;geneB", keeping only unique values.  

SplitFunction <- function(x) {
    b <- unlist(strsplit(x, '[;,]'))
    c <- b[!duplicated(b)]
    return(paste(c, collapse=";"))
  }  
  
DropDuplicates <- Vectorize(SplitFunction) # make function available for vectors, so I can use for columns in dataframe

df$Gene2 =  dplyr::na_if(DropDuplicates(df$Gene), "NA")

Here, I can use SplitFunction() to remove duplicates in a specific list (e.g. the values in cell B7 above). I can use DropDuplicates() to remove duplicates in each row of the column, to edit the whole column at once.

Citation for SplitFunction: https://stackoverflow.com/questions/16976921/r-split-string-with-2-delimiters-remove-duplicate-and-put-the-frame-back

Note there is a bug in the Vectorize() function which replaces NA (not applicable, or missing) values with "NA" string values. I added the dplyr::na_if function to fix this and get back the NA values.


No comments:

Post a Comment

How to format final figures for publication

General figure guidelines File types and file sizes TIFF images with LZW compression to reduce the file size PDF files for vector images Not...