project:hotornot:r

R Analysis

The sample data was analysed with RStudio Version 0.99.489 by using the randomForest package version 4.6-12 to determine which variables were most important.

The source code is show below. An explanation follows.

# load the raw data
# note: randomForst does not allow NA    na.strings = c("9999", "notset"
raw = read.csv (file="RawData.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fileEncoding = "UTF-8", encoding="UTF-8", stringsAsFactors = TRUE)
meta = read.csv (file="MetaData.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fileEncoding = "UTF-8", encoding="UTF-8", stringsAsFactors = FALSE)

# eliminate rows where data is not available
raw = raw[complete.cases(raw),]

# eliminate columns on type of water heating
raw = subset (raw, select = -WaterHeatingType)

# eliminate rows where water heater is solar
# raw = raw[as.character(raw$WaterHeatingType) != "solar",]

# eliminate columns where importance MeanDeacreaseInAccuracy is small
raw = subset (raw, select = -QSSmart)
raw = subset (raw, select = -HasSmartMeter)
raw = subset (raw, select = -QSLamps)
raw = subset (raw, select = -QSGender)
raw = subset (raw, select = -SentInvitesAccepted)
raw = subset (raw, select = -InvitesAccepted)
raw = subset (raw, select = -QSAge)
raw = subset (raw, select = -NumMeters)
raw = subset (raw, select = -SavingTipStatusCount_1)
raw = subset (raw, select = -SavingTipStatusCount_3)
raw = subset (raw, select = -InvitesSent)
raw = subset (raw, select = -QSEcoEnergy)
raw = subset (raw, select = -QuizAnswered)
raw = subset (raw, select = -DaysLoggedIn)
raw = subset (raw, select = -LotteriesParticipated)
raw = subset (raw, select = -NumAppliancesEntered)
raw = subset (raw, select = -HouseholdType)
raw = subset (raw, select = -SavingTipStatusCount_2)
# raw = subset (raw, select = -HouseholdMembers)
# raw = subset (raw, select = -HadAudit)
raw = subset (raw, select = -MobilePhoneEmpty)
raw = subset (raw, select = -NumDevices)
# raw = subset (raw, select = -QSKnowMeter)
raw = subset (raw, select = -DaysReadingEntered)
raw = subset (raw, select = -ReadingCount)
raw = subset (raw, select = -WeeksReadingEntered)
raw = subset (raw, select = -QSInterest)
raw = subset (raw, select = -QuizAnsweredCorrectly)
raw = subset (raw, select = -NumWeeksMember)
raw = subset (raw, select = -NumDaysMember)
raw = subset (raw, select = -NumVisit)
raw = subset (raw, select = -WeeksLoggedIn)
raw = subset (raw, select = -Points)
# raw = subset (raw, select = -LivingArea)
# raw = subset (raw, select = -MainHeatingType)

# split the variable of interest out of the data.frame, change to factors for classification
solar = factor (ifelse (raw["QSSolar"] == 28, TRUE, FALSE))

# make a new data.frame without the solar variable
data = subset (raw, select = -QSSolar)

# load the analysis package
library (randomForest)

# run the analysis
forest = randomForest (x = data, y = solar, importance = TRUE, proximity = TRUE, do.trace = TRUE)

# show some results
plot (forest, log="y")
MDSplot(forest,solar)
print (forest)

imp = importance (forest)

Program Steps

The data set and meta data were loaded into R using the built in CSV reader.

Since randomForest does not operate when there are NA values (missing data), only complete cases were retained.

The type of water heating system included information about solar heaters, so it was removed from analysis.

Using trial and error, columns were eliminated step by step based on the importance vector to establish the minimal set. Nearly all columns were eliminated. The columns retained have the subset statement commented out, e.g. HouseholdMembers.

The target vector (QSSolar column) was extracted and removed from the training set. It was converted from it's encoded value (28 or 29) into a two Factors (a factor is a symbol in a R data set) in order to trigger the classification algorithm. Otherwise the randomForest algorithm attempts to do a regression analysis.

The randomForest library was loaded and the analysis run. A couple of plots were generated in aiding the analysis and the error matrix printed out.

Analysis

The error rate was consistently below 10%, but this was due in large part to the overwhelming majority of the data being non-solar customers. In other words, with so many customers not having solar power, the best guess is a FALSE.

The importance vector was extracted from the results and by sorting descending on MeanDecreaseAcccuracy the relative usefulness of a column could be determined. Columns that fell to the bottom of this list contributed little to the accuracy of the result.

It was found that just the MainHeatingType column was an accurate predictor of the QSSolar column.

For the purposes of demonstration, additional columns that could be asked in a survey were included in the machine learning demonstration program.

  • project/hotornot/r.txt
  • Last modified: 2016/04/10 21:03
  • by derrickoswald