Dealing with huge data API-PLATFORM

I've an Entity called Companies that has a relation OneToMany with other Entity called CrmItems. Entity/Company.php /** * @ORM\OneToMany(targetEntity="App\Entity\CrmItems", mappedBy="company") *

View details »

How to join tmysqlinput and tMongoDBInput with a large amount of data

I'm dealing with a Talend Big Data job and can't find a solution. pb: i'm querying some data into a mysql db via a tMysqlInput (about 500k rows) and i want to merge them with some mongoDB data. (The

View details »

Hive - year, month and date from timestamp column

I am trying to extract the year, month and day part of a timestamp column in hive. At present the output looks like 2016-05-20 01:08:48. I want it to output only the 2016-05-20 part. I have tried u

View details »

How to handle massive text-delimited files with NodeJS

We're working with an API-based data provided that allows us to analyze large sets of GIS data in relation to provided GeoJSON areas and specified timestamps. When the data is aggregated by our prov

View details »

Why is spark unable to read a large JSON text file despite of sufficient memory for both execution and cache?

I have configured a standalone cluster (a node of 32gb & 32 cores) with 2 workers of 16 cores & 10gb memory each. The size of the JSON file is only 6gb. I have tried tweaking different configuration

View details »

Big data hdinsight using mapreduce

Determine using HDinsight the contributory words (review) for each level of rating and sentiment category. There are three columns review (text), rating (1 to 5), and sentiment category (e.g very po

View details »

How can you store and modify large datasets in node.js?

Basics So basically I have written a program which generates test data for MongoDB in Node. The problem For that, the program reads a schema file and generates a specified amount of test data out

View details »

How big should be the big data to be executed on Apache Spark?

What should be the size of data while working with Apache Spark ? Is it useful to execute a python code on spark cluster with MB's of data? Will it decrease the execution time of the code on spark as compared to local execution?

View details »

JPS results and hdfs admin report is different

Here is my jps results on master node. NameNode SecondaryNameNode And slave node jps output Datanode If i look at the jps result, it seems everything is godd but when i run hdfs dfsad

View details »

Sort key-value pairs first by Value and then by Key. (similar to Radix Sort). I need to maintain the key-value relationship

I'm trying to remove duplicates from key-value pairs. And sorting the Data first seems like the best way to do this. I have tuples(Both values are Integer) so the code doesn't necessarily have to wo

View details »

How to wrangle unstructured log data streamed from twitter through Flume?

I have extracted log data from Twitter through Apache Flume. Here the data obtained is in the file like (FLUMEDATA.12334555678) this. The data in the file looks as follows : {"type":"record","name"

View details »

Hadoop cluster configuration recommandation for mapreduce and yarn in 2 GO nodes

I'm using Hadoop 2.6 on multinode cluster. I Have 3 nodes (master and 2 slaves) all of them with 2 go memory. I have some troubles with mapreduce (jobs stuck) because of resoruces configuration !

View details »