Sometimes I have columns with duplicate values that I want to collapse into unique values. For example: comma-separated values, semicolon separated values, etc. See the "Gene Names" column below.
In this example data, row7 has duplicate values that I would like to collapse to only "RALGDS". Row 10 has duplicate values that I'd like to collapse to "KCNIP2". Row 8 has two unique values that I would like to keep. If I had another row with "geneA;geneA;geneA;geneB", I would want to collapse that to "geneA;geneB", keeping only unique values.
SplitFunction <- function(x) {
b <- unlist(strsplit(x, '[;,]'))
c <- b[!duplicated(b)]
return(paste(c, collapse=";"))
}
DropDuplicates <- Vectorize(SplitFunction) # make function available for vectors, so I can use for columns in dataframe
df$Gene2 = dplyr::na_if(DropDuplicates(df$Gene), "NA")
No comments:
Post a Comment