Spark Thrift Server Single Context Globally Available Cache

I read this article http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/ and became confused. It states: Spark Contexts are also unable to share cached resources amongst each othe

View details »

How to convert a row (array of strings) to a dataframe column [duplicate]

I have this code: from pyspark import SparkContext from pyspark.sql import SQLContext, Row sc = SparkContext() sqlContext = SQLContext(sc) documents = sqlContext.createDataFrame([ Row(id=1, ti

View details »

error: filter spark dataframe on col value

Please refer to my sample code below: sampleDf -> my sample scala dataframe that I want to filter on 2 columns "startIPInt" and "endIPInt". var row = sampleDf.filter("startIPInt <=" + ip).filter("

View details »

Is TransmogrifAI compatible for MLFLow through MLeap

I have created a model using TransmogrifAI. I am trying to load that model into MLFlow using MLeap, and unable to do so. I basically used the bundle feature in MLeap, but no use. Anyone has any ideas to go forward?

View details »

Spark Structured Streaming - UI Storage Memory value growing

I am migrating from a DStreams Spark application to a structured streaming application. During testing, I found out that the Storage Memory in the Executors tab in Spark's UI keeps growing. It even

View details »

Apache Cluster not binding ip

I am trying to run my cluster on my external IP so I can have workers from multiple pc's but I'm getting this: spark-class org.apache.spark.deploy.master.Master --host <myIpIsHere> Using Spark's d

View details »

Is there a way I can access multiple JSON objects in array(struct) one by one in pyspark

I am a bit new to pyspark and json parsing and I am stuck in some certain scenario . Let me explain first What I am trying to do , I have a json file in which there is data element , that data eleme

View details »

Extract words from a string in spark hadoop with scala

I was using the code below to extract strings I nedded in Spark SQL. But now I am working with a ton of data in Spark Hadoop and I need help to extract strings. I tried the same code, but it does no

View details »

Get function to run in parallel with pyspark.mllib.linalg.distributed matrix

The following reproducible code does what I want, but is slow. I am not sure if I am correctly initiating the function map_simScore() to get the correct level of parallelism. Initializing the tes

View details »

create spark dataframe from pandas dataframes inside RDD

I'm trying to convert a pandas dataframe on each worker node into a spark dataframe across all worker nodes. Example: def read_file_and_process_with_pandas(filename): data = pd.read(filename)

View details »

How to count the values which are repeating in an array using RDD,dataframe,dataset in scala

I have to count the repeating values in an array val arr = Array[1,2,2,3,4,5,5,5] For example how to count the number of 5's in the array using RDD,Dataframe,Datasets?

View details »

Full outer join in RDD scala spark

I have two file below: file1 0000003 杉山______ 26 F 0000005 崎村______ 50 F 0000007 梶川______ 42 F file2 0000005 82 79 16 21 80 0000001 46 39 8 5 21 0000004 58 71 20 10 6 0000009 60 89

View details »

Spark config: is there an advantage to running master and worker(s) on different VMs / machines?

I'm running a Spark cluster, with the standalone cluster manager, and I'm wondering whether I should be setting up a separate VM for the master. The alternative is to run the master on the same phys

View details »

How to get most common for each element of array list (pyspark)

I have a List of arrays [array(0,1,1),array(0,0,1),array(1,1,0)] for which I need to find highest frequency element for each element of the list def finalml(listn): return Counter(listn).most_c

View details »

Spark Kubernetees building docker image fails

I'm trying out Spark on Kubernetes. Just downloaded Spark 2.4.3 on an EC2 instance in my VPC. I have setup my proxy in /etc/sysconfig/docker and able to import and run docker images from docker hu

View details »

Better practices perform multiple window functions on spark sql

I need help to tunning my code of multiple windows. When I use just one window, the execution finish in just a feel seconds, but when I add more windows, the code run for hours. I tried to group fe

View details »

Spark Java Dataset filter condition not working

I am trying to filter the dataset on a filter condition which has multiple string checks Example "vic_cpc_qid = 'OCC_C_CSI' or vic_cpc_qid = 'OCW_A_RSI' or vic_cpc_qid = 'OCC_C_RSI' or vic_cpc_qi

View details »

Error: Error while compiling statement: FAILED: SemanticException line 1:undefined:-1 Invalid function 'replace'

I am using HIVE I keep receiving the error message below whenever I run my code: Error while compiling statement: FAILED: SemanticException line 1:undefined:-1 Invalid function 'replace' Here is

View details »

Databricks: Dataframe groupby agg, collector set include duplicate values

Suppose I am having a dataset df like in the following col1 col2 1 A 1 B 1 C 2 B 2 B 2 C I want to the dataset with col1 and make col2 as an array using the foll

View details »

How to expand horizontal data to vertical from DataFrame? Scala Spark

I has a text file. Now, I want expand horizontal data to vertical. Using the fields from the first field of the specified file to the field specified by num = <n> as the key, the horizontally arrang

View details »