Learn how to do real estate value prediction using XGBoost in this two-part series by Ahmed Sherif, a data scientist who develops ML and deep learning solutions on the cloud using Azure and Amrith Ravindra, a machine learning enthusiast with a passion for data science.

Real Estate Value Prediction Using XGBoost

The real estate market is one of the most competitive markets when it comes to pricing. This tends to vary significantly based on a number of factors such as the location, age of the property, and size. Therefore, it has become a modern-day challenge to accurately predict the prices of properties. This article will deal with precisely that.

Downloading the King County House sales dataset

To build a model, you can use the data from Kaggle (https://www.kaggle.com/harlfoxem/housesalesprediction), a platform for predictive modeling and analytics competitions. The King County House Sales dataset contains records of 21,613 houses sold in King County, New York between 1900 and 2015.

The dataset is freely available to download and use and contains 21 different variables such as location, zip code, number of bedrooms, and area of the living space for each house. Once you’re on the website, click on the Download button, as shown in the following screenshot:

King County House Sales Dataset

  1. A file named kc_house_data.csv can be found in the downloaded zip file, housesalesprediction.zip.
  2. Save this file in the current working directory, as this will be your dataset and load it into the IPython notebook for analysis and predictions.
  3. The libraries used as well as their functions in this article are as follows:
  • Numpy – used to wrangle data in the form of arrays as well as store lists of names in the form of arrays
  • Pandas – used for all data wrangling and managing data in the form of dataframes
  • Seaborn – a visualization library required for exploratory analysis and plots
  • MPL_Toolkits – contains a number of functions and dependencies required by Matplotlib
  • Functions from the Scikit Learn library – the primary scientific and statistical library required for this article
  • You’ll also need some other libraries such as XGBoost, but those will be imported as required while building the model

Now, install these libraries using the following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_selection import RFE
from sklearn import linear_model
from sklearn.cross_validation import train_test_split %matplotlib inline

 

  1. The preceding step should result in an output, as shown in the following screenshot:

  1. It is always a good idea to check the current working directory and set it to the directory in which the dataset is stored, as shown in the following screenshot:

  1. The data in the file is read into a Pandas dataframe named dataframe using the read_csv() function and the features/headers are listed out using the list(dataframe) command:

As you may have noticed, the dataset contains 21 different variables such as iddatepricebedrooms, and bathrooms.

Performing exploratory analysis and visualization

In situations where the goal is to predict a variable such as price, it helps to visualize the data and figure out how the dependent variable is being influenced by other variables. The exploratory analysis gives insight which is not readily available by looking at the data.

  • The head of the dataframe can be printed using the dataframe.head() function which produces an output, as shown in the following screenshot:

  • Similarly, the tail of the dataframe can be printed using the dataframe.tail() function, which produces an output, as shown in the following screenshot:

  • The dataframe.describe() function is used to obtain some basic statistics such as the maximum, minimum, and mean values under each column as illustrated below:

dataframe.describe() function output

  • On taking a closer look at the statistics, you’ll realize that most houses sold have about three bedrooms on average. You can also see that the minimum number of bedrooms in a house is 0 and the largest house has 33 bedrooms and a living area of 13,540 square feet.
  1. Now, plot the count of bedrooms in the whole dataset to see how three bedroom houses stand compared to houses with two or one bedrooms using the following code:
dataframe['bedrooms'].value_counts().plot(kind='bar') plt.title('No. of bedrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Count')
sns.despine

 

The plot of bedroom counts must give an output, as shown in the following screenshot:

It is evident that three bedroom houses are sold the most, followed by four bedroom houses, then by two bedroom houses, and then surprisingly by five bedroom and six bedroom houses.

  1. You can also plot a pie chart of the same data using the following command:
dataframe['bedrooms'].value_counts().plot(kind='pie')
plt.title('No. of bedrooms')

 

The pie chart of the number of bedrooms gives an output that looks as follows:

You’ll notice that three bedroom houses account for roughly 50% of all houses sold in King County. It looks like about 25% are four bedroom houses and the remaining 25% is made up of houses with two, five, six bedrooms, and so on.

  1. Now, try to see the number of floors in houses that are sold most frequently in King County. This may be done by plotting a bar graph using the following commands:
dataframe['floors'].value_counts().plot(kind='bar') plt.title('Number of floors')
plt.xlabel('No. of floors')
plt.ylabel('Count')
sns.despine

 

On running the script for most houses sold categorized by the number of floors, notice the following output:

It is quite clear that single floor houses sell the most, followed by two-story houses. The count of houses with more than two stories is rather low, which is perhaps an indication of family sizes and the income of residents living in King County.

  1. You need to have an idea of which locations have the highest number of houses sold. You can obtain this using the latitude and longitude variables from the dataset:
plt.figure(figsize=(20,20))
sns.jointplot(x=dataframe.lat.values, y=dataframe.long.values, size=9)
plt.xlabel('Longitude', fontsize=10)
plt.ylabel('Latitude', fontsize=10)
plt.show()
sns.despine()

 

On inspecting the density of houses sold at different locations, you’ll obtain an output, as shown in the following screenshots. It is pretty clear that some locations see a higher density of house sales compared to others:

From the trends observed in the above figure, it is easy to notice how a great number of houses are sold between latitudes -122.2 and -122.4. Similarly, the density of houses sold between longitudes 47.5 and 47.8 is higher compared to other longitudes. This could perhaps be an indication of safer and better-living communities compared to the other communities.

  1. Also, take a look at how the prices compare for houses with different numbers of bedrooms:
plt.figure(figsize=(20,20))
sns.jointplot(x=dataframe.lat.values, y=dataframe.long.values, size=9)
plt.xlabel('Longitude', fontsize=10)
plt.ylabel('Latitude', fontsize=10)
plt.show()
sns.despine()

 

  1. A plot of the price of houses versus the number of bedrooms can be obtained using the following commands:
plt.figure(figsize=(20,20))
sns.jointplot(x=dataframe.lat.values, y=dataframe.long.values, size=9)
plt.xlabel('Longitude', fontsize=10)
plt.ylabel('Latitude', fontsize=10)
plt.show()
sns.despine()

 

On plotting the prices of houses versus the number of bedrooms in the house, you can realize that the trends regarding the number of bedrooms in a house are directly proportional to the price up to six bedrooms, and then it becomes inversely proportional, as shown in the following screenshot:

  1. Similarly, see how the price compares to the living area of all the houses sold:
plt.figure(figsize=(8,8))
plt.scatter(dataframe.price, dataframe.sqft_living)
plt.xlabel('Price')
plt.ylabel('Square feet')
plt.show()

 

Plotting the living area of each house against the price gives an expected trend of increasing prices with the increasing size of the house. The most expensive house seems to have a living area of 12,000 square feet, as shown in the following screenshot:

  1. The condition of the houses sold will give you some important information. Plot this against the prices to get a better idea of the general trends:
plt.figure(figsize=(5,5))
plt.bar(dataframe.condition, dataframe.price)
plt.xlabel('Condition')
plt.ylabel('Price')
plt.show()

 

On plotting the condition of houses versus price, notice an expected trend of increasing prices with higher condition ratings. Interestingly, five bedroom prices have a lower mean price compared to four bedroom houses, which is possibly due to lesser buyers for such a big house:

  1. To see which zip codes have the most house sales in King County, use the following commands:
plt.figure(figsize=(8,8))
plt.scatter(dataframe.zipcode, dataframe.price)
plt.xlabel('Zipcode')
plt.ylabel('Price')
plt.show()

 

A plot of the Zipcode of the house versus price shows trends in the prices of houses in different zip codes. You may have noticed that certain zip codes, like the ones between 98100 and 98125, have a higher density of houses sold compared to other areas, while the prices of houses in zip codes like 98040 are higher than the average price, perhaps indicating a richer neighborhood, as shown in the following screenshot:

  1. Finally, plot the grade of each house versus the price to figure out the trends in house sales based on the grade given to each house:
plt.figure(figsize=(10,10))
plt.scatter(dataframe.grade, dataframe.price)
plt.xlabel('Grade')
plt.ylabel('Price')
plt.show()

 

A plot of the grade of the house versus price shows a consistent increase in price with increasing grade. There seems to be a clear linear relationship between the two, as observed in the output of the following screenshots:

This concludes part one of the series. In part two, you’ll learn how to plot the correlation between price and other features and predicting the price of a house using XGBoost. If you found this article helpful, you can explore Apache Spark Deep Learning Cookbook to gain expertise in training and deploying efficient deep learning models on Apache Spark. With the help of this book, you’ll be able to work through specific recipes to generate outcomes for deep learning algorithms, without getting bogged down in theory.