Using the k-Nearest Neighbors Algorithm in R


k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics.

In this post, I will show how to use R’s knn() function which implements the k-Nearest Neighbors (kNN) algorithm in a simple scenario which you can extend to cover your more complex and practical scenarios. R is free and kNN has not been patented by some evil patent trolls (“patent assertion entities”), so there is no legal or other restrictions for us to go ahead with the demonstration.

kNN has many useful characteristics, one of which being its insensitivity to outliers that makes it resilient to any errors in the classification data (the supervised learning phase). As a downside, the algorithm is noted for its CPU and memory greediness. The R interpreter may also fail to take advantage of the multi-core CPU capabilities of modern computers. If and when such constrains hamper your data analytics work, the kNN algorithm can be implemented in a language that supports the creation of multi-threaded applications (e.g. Java or C#) that would evenly load all the CPU cores of your computer and improve the computational throughput of the application.

To simplify the demonstration of kNN and make it easy to follow, we will have only two classes used in object classification, which we label A and B.

Objects in classes A and B have two numeric attributes/properties that we map to X and Y Cartesian coordinates so that we can plot class instances (cases) as points on a 2-D chart. In other words, our cases are represented as points with X and Y coordinates (p(X,Y)).

Our simple classes A and B will have 3 object instances (cases) each.

Class A will include points with coordinates (0,0), (1,1), and (2,2).
Class B will include points with coordinates (6,6), (5.5, 7), and (6.5, 5).

In R, we can write down the above arrangement as follows:

# Class A training object instances (cases)

A1=c(0,0)
A2=c(1,1)
A3=c(2,2)

# Class B training objects instances (cases)

B1=c(6,6)
B2=c(5.5,7)
B3=c(6.5,5)

Here is how the classification training objects for class A and class B are arranged on the chart.

Classification training data set

Fig. 1 Classification Training Data Set

The knn() function is housed in the class package and is invoked as follows:

knn(train, test, cl, k)

where

train is a matrix or a data frame of training (classification) cases
test is a matrix or a data frame of test case(s) (one or more rows)
cl is a vector of classification labels (with the number of elements matching the number of classes in the training data set)
k is an integer value of closest cases (the k in the k-Nearest Neighbor Algorithm); normally, it is a small odd integer number

For full description of the knn() function, read its help page that you can see by typing ?knn in the R console.

The points (cases) from both classification classes A and B must be packed in the same matrix used in the classification exercise.

In R you do this with a one-liner:

train=rbind(A1,A2,A3, B1,B2,B3)

This command will build the following matrix:

A1: 0.0 0
A2: 1.0 1
A3: 2.0 2
B1: 6.0 6
B2: 5.5 7
B3: 6.5 5

Now, when we try out classification of a test object (with properties expressed as X and Y coordinates), the kNN algorithm will use the Euclidean distance metric calculated for every row (case) in the training matrix to find the closest one for k=1 and the majority of closest ones for k > 1 (where k should be an odd number).

We also need to construct the cl parameter (the vector of classification labels). We can do this with this command:

cl=factor(c(rep("A",3),rep("B",3)))

This command uses the factor() function to create a vector of factors (discrete, enumerated values) that are used as class literals. In our setup, we have two factors (a.k.a. levels): A and B.

The rep() function replicates the first parameter (a printable symbol) the number of times passed on to it as the second parameter.

The resulting factor vector has the following content: A A A B B B (3 As followed by 3 Bs to match the layout of our train cases – we have 3 cases of A followed by 3 cases of B).

To run the knn() function, we need to supply the test case(s). In our runs, we will start with a single test object that we create as follows:

test=c(4,4)

which corresponds to a point sitting approximately in the middle of the distance between A and B.

At this point, we have everything in place to run knn(). Let’s do it for k = 1 (classification by its proximity to a single neighbor).

For more informative reports of test object classification results, we are going to use the summary() function that can polymorphically act on its input parameter depending on its type. In our case, the input parameter to the summary() function is the output of the knn() function.

That’s how the nested call looks like:

summary(knn(train, test, cl, k = 1))

Here is the complete code listing that you can copy and paste into the R console.


# Class A cases 
A1=c(0,0)
A2=c(1,1)
A3=c(2,2)

# Class B cases
B1=c(6,6)
B2=c(5.5,7)
B3=c(6.5,5)

# Build the classification matrix
train=rbind(A1,A2,A3, B1,B2,B3)

# Class labels vector (attached to each class instance) 
cl=factor(c(rep("A",3),rep("B",3)))

# The object to be classified
test=c(4, 4)

# Load the class package that holds the knn() function 
library(class)

# call knn() and get its summary
summary(knn(train, test, cl, k = 1))

# End of listing 

After pasting the above code in the R console, press ENTER to submit it to the R interpreter for execution.

You should see the following output in your console:

A B
0 1

This result indicates that the test case has been classified as belonging to class B.

Type in the following command in console:

test=c(3.5, 3.5)

Visually, this test case point looks to be closer to the cluster of the A class cases (points). Let’s verify our assumption.

Repeat the same knn summary command as we did a moment ago:

summary(knn(train, test, cl, k = 1))

The result comes back as we expected:

A B
1 0

The point has been classified as belonging to class A.

Let’s increase the number of closest neighbors that are involved in voting during the classification step.

Type in the following command:

summary(knn(train, test, cl, k = 3))

Now, the positions of class B points make them closer as a whole (according to the Euclidean distance metric) to the test point, so the (3.5, 3.5) case is classified as belonging to class B this time.

A B
0 1

If you wish, you can further experiment with the k number (this is one of the exercises data scientists perform trying out better fitting classification models).

Now, let’s build a matrix of test cases containing four points:

(4,4) - should be classified as B
(3,3) - should be classified as A
(5,6) - should be classified as B
(7,7) - should be classified as B

As a result, we should be expecting our test batch to have 3 cases (points) classified as belonging to the B class and one A case.

Type in the following code:

test = matrix (c(4,4,3,3,5,6,7,7), ncol=2, byrow=TRUE)

This command will help us map our points into a two-column matrix containing the X and Y coordinates of our test points.

Now run again the previous knn summary command:

summary(knn(train, test, cl, k = 3))

R should print the following summary of its classification job:

A B
1 3

Which supports our assumptions.

I hope, now you are well equipped to start applying R’s knn() function in your problem domain.

  1. #1 by Kavita on February 20, 2014 - 1:01 am

    Hi,
    This is really helpful.
    But I would like to know how to export the final output (if test data is of bigger size) in csv/txt/excel file

  2. #2 by Abhishek on June 12, 2014 - 1:47 am

    Use write.csv(“filename.csv”); to write that data

  3. #3 by Kushal Bansal on September 30, 2014 - 6:05 pm

    Thanks. really helpful.

  4. #4 by Daumantas on November 5, 2014 - 4:42 am

    Thanks, that was helpful!

  5. #5 by hossam on December 12, 2014 - 12:17 pm

    thanks so much… I needed some simple example…

  6. #6 by Sina on January 19, 2015 - 3:00 pm

    Hi
    I have a little problem, i have a 7 dimension column which the label of rows attached to it. How should I define the factor now?

  7. #7 by Tom on February 21, 2015 - 7:00 pm

    I need help with information Retrieval question
    Using the K-Nearest-Neighbor approach for document categorization with K = 3 (see notes determine how Doc8 given below will be classified. Show your steps. Note: Use non-normalized tf*idf weights and cosine similarity when computing similarities.
    Doc8 T1 T2 T3 T4 T5 T6 T7 T8
    3 1 0 4 1 0 2 1

  8. #8 by segwat on April 3, 2015 - 12:53 am

    what about using mahalanobis instead of euclidean? I havent found any good R solutions for using mahalanobis instead. knnMCN supposedly

  9. #9 by Kartik on April 23, 2015 - 5:14 am

    Hey, Thansk a ton, really nice documentation.

    Is there a way to get the list of (say top 3) closest neighbors using knn.

  10. #10 by Andrew on April 27, 2015 - 1:02 pm

    very helpful! most other examples just use a few lines with iris and do not go into great detail

  11. #11 by Hitalo on May 15, 2015 - 8:03 pm

    Thanks, very helpful.

  12. #12 by Steve on July 7, 2015 - 10:28 am

    Thanks! Very clear and useful for work.

  13. #13 by James on July 19, 2015 - 3:15 am

    Very helpful. How do we go around dealing with categorical predictors where getting the euclidean distance is not possible.?

  14. #14 by James on July 19, 2015 - 3:19 am

    I have just seen the knncat package. I haope it helps

  15. #15 by jazz on October 25, 2015 - 12:56 pm

    this is great!! do you have any suggestions on how you would use this in a production environment where you had 300 to 400 cases to test each day?

(will not be published)

*