Databricks: Dataframe groupby agg, collector set include duplicate values

Keywords: scala apache-spark dataframe databricks


Suppose I am having a dataset df like in the following

col1   col2 
1      A
1      B
1      C
2      B
2      B
2      C

I want to the dataset with col1 and make col2 as an array using the following code

var df2=df.groupBy("col1").agg(collect_set("col2").alias("col2"))

then df2 will be

COl1    Col2
1       A,B,C
2       B,C

How to change the code so that I can have

COl1    Col2
1       A,B,C
2       B,B,C