Abstract:
Volume and velocity are two characteristics of big data. Big data “comes in” with high velocity that the volume increases quickly. Efforts are needed to resolve these issues. This paper presents a big data reduction technique that can be used to reduce incoming big data periodically. The results, patterns that represent the original data with smaller size can be kept for further analysis, while the voluminous big data can be discarded. Clustering is a technique that can be used for reducing data. Based on our study, we find that agglomerative clustering is suitable to be adopted for reducing big data having low to medium number of attributes. Our proposed technique is based on Hadoop MapReduce, a computing framework for distributed systems, where Map and Reduce functions run in parallel in machine nodes. The excerpt of our technique: Map preprocesses and randomly divides the big data into disjoint partitions, Reduce constructs cluster trees (dendrograms) from partitions and computes patterns from the clusters formed from the trees. The output is a collection of patterns having a lot smaller number of objects and attributes. To provide flexibilities, we design few input parameters set by users. The effect of those parameters are shown by our experiment results. By experimenting using big data in a Hadoop cluster with up to 15 commodity computers, we conclude that the Hadoop file system block size and number of nodes affect the execution time and the size of incoming big data that can be processed.