create spark dataframe from pandas dataframes inside RDD

Keywords: pandas apache-spark pyspark

Question: 

I'm trying to convert a pandas dataframe on each worker node into a spark dataframe across all worker nodes.

Example:

def read_file_and_process_with_pandas(filename):
    data = pd.read(filename)
    "some additional operations using pandas functionality"
    return data

filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)

Now I have an rdd of pandas dataframes. How can I convert this into a spark dataframe?

I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:272)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

Is there a way to convert pandas dataframes in each worker node into a distributed dataframe?

Answers: