User Tools

Site Tools


project:hotornot:r

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
project:hotornot:r [2016/04/09 12:55] – created derrickoswaldproject:hotornot:r [2016/04/10 21:03] (current) – finish description derrickoswald
Line 1: Line 1:
 ===R Analysis=== ===R Analysis===
-The sample data was analysed with the [[http://rstudio.com/|RStudio]] Version 0.99.489 by using the [[https://cran.r-project.org/web/packages/randomForest/index.html|randomForest]] package version 4.6-12 to determine which variables were most important.+The sample data was analysed with [[http://rstudio.com/|RStudio]] Version 0.99.489 by using the [[https://cran.r-project.org/web/packages/randomForest/index.html|randomForest]] package version 4.6-12 to determine which variables were most important.
  
-The source code is show below. An explaination follows. +The source code is show below. An explanation follows. 
  
 <code> <code>
Line 72: Line 72:
 plot (forest, log="y") plot (forest, log="y")
 MDSplot(forest,solar) MDSplot(forest,solar)
 +print (forest)
  
 imp = importance (forest) imp = importance (forest)
 </code> </code>
  
-==Steps==+===Program Steps===
  
-The data set was loaded into R using the built in CSV reader.+The data set and meta data were loaded into R using the built in CSV reader.
  
-Since randomForest does not operate when there are NA values, only complete cases were retained.+Since randomForest does not operate when there are NA values (missing data), only complete cases were retained.
  
 +The type of water heating system included information about solar heaters, so it was removed from analysis.
 +
 +Using trial and error, columns were eliminated step by step based on the importance vector to establish the minimal set.
 +Nearly all columns were eliminated. The columns retained have the subset statement commented out, e.g. HouseholdMembers.
 +
 +The target vector (QSSolar column) was extracted and removed from the training set. It was converted from it's encoded value (28 or 29) into a two Factors (a factor is a symbol in a R data set) in order to trigger the classification algorithm. Otherwise the randomForest algorithm attempts to do a regression analysis.
 +
 +The randomForest library was loaded and the analysis run. A couple of plots were generated in aiding the analysis and the error matrix printed out.
 +
 +===Analysis===
 +The error rate was consistently below 10%, but this was due in large part to the overwhelming majority of the data being non-solar customers. In other words, with so many customers not having solar power, the best guess is a FALSE.
 +
 +The importance vector was extracted from the results and by sorting descending on MeanDecreaseAcccuracy the relative usefulness of a column could be determined. Columns that fell to the bottom of this list contributed little to the accuracy of the result.
 +
 +It was found that just the MainHeatingType column was an accurate predictor of the QSSolar column.
 +
 +For the purposes of demonstration, additional columns that could be asked in a survey were included in the machine learning demonstration program.
  
project/hotornot/r.1460199346.txt.gz · Last modified: 2016/04/09 12:55 by derrickoswald