Data Science and Business Analytics Articles

Using k-means Machine Learning Algorithm with Apache Spark and R

In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations of k-means algorithm: one is packaged with its MLlib library; the other one exists in Spark’s spark.ml package. While both implementations are currently more or less functionally equivalent, the Spark ML team recommends using the spark.ml package by showcasing its support for pipeline processing (inspired by the scikit-learn Machine Learning library in Python) and its versatile DataFrame data structure (probably inspired by R’s DataFrame matrix-like structure similar to tables in relational databases, also wildly popular in Python’s pandas library.)

The k-means clustering is an example of an unsupervised ML algorithm where you are only required to give a hint to the computer as to how many clusters (classes of objects) you expect to be present in your data set. The algorithm will go ahead and use your data as the training data set to build a model and try to figure out the boundaries of those clusters. Then you can proceed to the classification phase with your test data.

With k-means, you, essentially, have your computer (or a cluster of computers) perform a partitioning of your data into Voronoi cells where the cells represent the identified clusters.
Read the rest of this entry »

No Comments

Spark RDD Performance Improvement Techniques (Post 2 of 2)

In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in the previous post on caching.

You will remember that checkpointing is a process of truncating an RDD’s lineage graph and saving its materialized version on a persistence store.
Read the rest of this entry »

No Comments

Spark RDD Performance Improvement Techniques (Post 1 of 2)

Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing.

Caching allows you to save a materialized RDD in memory, which greatly improves iterative or multi-pass operations that need to traverse the same data set over and over again (e.g. in machine learning algorithms.)

Read the rest of this entry »

No Comments

Apache Spark class development complete

Last week I completed development of our 2 day class teaching Apache Spark which will be integrated in our Big Data and Data Science classes after the QA cycle.
I will be feeding some fragments of the material with additional comments and notes that would help you get a taste of what the new content is all about and see if it can help you in your work.
Stay tuned!

No Comments

SparkR on CDH and HDP

Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode.

Big Hadoop distros, like Cloudera’s CDH and Hortonworks’ HDP that bundle Spark, have varying degree of support for R. For the time being, CDH decided to opt out of supporting R (their latest CDH 5.8.x version does not even have sparkR binaries), while HDP (versions 2.3.2, 2.4, … ) includes SparkR as a technical preview technology and bundles some R-related components, like the sparkR script. Making it all work (if at all this is presently possible) is another story and making it run on YARN may be a whole novel of a size of War and Peace.  So you can view this more as a demonstration of Hortonworks’ commitment to Spark, and we are left with the original supported language triad: Scala, Python, and Java.

No Comments

Simple Algorithms for Effective Data Processing in Java

The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem.

When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are still in the technology adoption phase evaluating this Big Data platform and with time the number of machines in their Hadoop clusters would probably grow into 3- or even 4-digit ranges.

Development on Hadoop is becoming more agile with shorter execution cycles — Apache Tez, Cloudera’s Impala, Databricks’ Spark are some of the technologies that aid in the process along the way.

Read the rest of this entry »

No Comments

Spark SQL

In this blog I will show you how to configure and run Spark SQL on Cloudera Distribution of Hadoop (CDH). I used the QuickStart VM version 5.4 running Spark SQL version 1.3 from inside the Spark shell (Scala REPL).
Read the rest of this entry »

No Comments

Using the k-Nearest Neighbors Algorithm in R

k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics.

In this post, I will show how to use R’s knn() function which implements the k-Nearest Neighbors (kNN) algorithm in a simple scenario which you can extend to cover your more complex and practical scenarios. R is free and kNN has not been patented by some evil patent trolls (“patent assertion entities”), so there is no legal or other restrictions for us to go ahead with the demonstration.
Read the rest of this entry »

15 Comments