December 14, 2022

R code: drop duplicate values in a delimited column

Sometimes I have columns with duplicate values that I want to collapse into unique values. For example: comma-separated values, semicolon separated values, etc. See the "Gene Names" column below.




In this example data, row7 has duplicate values that I would like to collapse to only "RALGDS". Row 10 has duplicate values that I'd like to collapse to "KCNIP2". Row 8 has two unique values that I would like to keep. If I had another row with "geneA;geneA;geneA;geneB", I would want to collapse that to "geneA;geneB", keeping only unique values.  

SplitFunction <- function(x) {
    b <- unlist(strsplit(x, '[;,]'))
    c <- b[!duplicated(b)]
    return(paste(c, collapse=";"))
  }  
  
DropDuplicates <- Vectorize(SplitFunction) # make function available for vectors, so I can use for columns in dataframe

df$Gene2 =  dplyr::na_if(DropDuplicates(df$Gene), "NA")

Here, I can use SplitFunction() to remove duplicates in a specific list (e.g. the values in cell B7 above). I can use DropDuplicates() to remove duplicates in each row of the column, to edit the whole column at once.

Citation for SplitFunction: https://stackoverflow.com/questions/16976921/r-split-string-with-2-delimiters-remove-duplicate-and-put-the-frame-back

Note there is a bug in the Vectorize() function which replaces NA (not applicable, or missing) values with "NA" string values. I added the dplyr::na_if function to fix this and get back the NA values.


No comments:

Post a Comment

Bookmarks: single cell RNA-seq tutorials and tools

These are my bookmarks for single cell transcriptomics resources and tutorials. scRNA-seq introductions How to make R obj...