Big Data Reduction Technique using Parallel Hierarchical Agglomerative Clustering

Moertini, Veronica Sri; Suarjana, Gde W.; Venica, Liptia; Karya, Gede

dc.contributor.author	Moertini, Veronica Sri
dc.contributor.author	Suarjana, Gde W.
dc.contributor.author	Venica, Liptia
dc.contributor.author	Karya, Gede
dc.date.accessioned	2018-05-08T08:37:15Z
dc.date.available	2018-05-08T08:37:15Z
dc.date.issued	2018
dc.identifier.issn	1819-9224 ( versi online)
dc.identifier.other	artsc285
dc.identifier.uri	http://hdl.handle.net/123456789/5917
dc.description	IAENG INTERNATIONAL JOURNAL OF COMPUTER SCIENCE; Vol.45 No.1, 2018	en_US
dc.description.abstract	Volume and velocity are two characteristics of big data. Big data “comes in” with high velocity that the volume increases quickly. Efforts are needed to resolve these issues. This paper presents a big data reduction technique that can be used to reduce incoming big data periodically. The results, patterns that represent the original data with smaller size can be kept for further analysis, while the voluminous big data can be discarded. Clustering is a technique that can be used for reducing data. Based on our study, we find that agglomerative clustering is suitable to be adopted for reducing big data having low to medium number of attributes. Our proposed technique is based on Hadoop MapReduce, a computing framework for distributed systems, where Map and Reduce functions run in parallel in machine nodes. The excerpt of our technique: Map preprocesses and randomly divides the big data into disjoint partitions, Reduce constructs cluster trees (dendrograms) from partitions and computes patterns from the clusters formed from the trees. The output is a collection of patterns having a lot smaller number of objects and attributes. To provide flexibilities, we design few input parameters set by users. The effect of those parameters are shown by our experiment results. By experimenting using big data in a Hadoop cluster with up to 15 commodity computers, we conclude that the Hadoop file system block size and number of nodes affect the execution time and the size of incoming big data that can be processed.	en_US
dc.description.uri	http://www.iaeng.org/IJCS/index.html
dc.language.iso	en	en_US
dc.publisher	International Association of Engineers - Hong Kong	en_US
dc.relation.ispartofseries	IAENG INTERNATIONAL JOURNAL OF COMPUTER SCIENCE;Vol.45 No.1, 2018
dc.subject	CLUSTER PATTERN	en_US
dc.subject	PARALLEL CLUSTERING	en_US
dc.subject	BIG DATA REDUCTION	en_US
dc.subject	MAPREDUCE	en_US
dc.title	Big Data Reduction Technique using Parallel Hierarchical Agglomerative Clustering	en_US
dc.type	Journal Articles	en_US