Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
project:hotornot:r [2016/04/09 13:00] – derrickoswald | project:hotornot:r [2016/04/10 21:03] (current) – finish description derrickoswald | ||
---|---|---|---|
Line 1: | Line 1: | ||
===R Analysis=== | ===R Analysis=== | ||
- | The sample data was analysed with the [[http:// | + | The sample data was analysed with [[http:// |
- | The source code is show below. An explaination | + | The source code is show below. An explanation |
< | < | ||
Line 72: | Line 72: | ||
plot (forest, log=" | plot (forest, log=" | ||
MDSplot(forest, | MDSplot(forest, | ||
+ | print (forest) | ||
imp = importance (forest) | imp = importance (forest) | ||
</ | </ | ||
- | ==Steps== | + | ===Program |
- | The data set was loaded into R using the built in CSV reader. | + | The data set and meta data were loaded into R using the built in CSV reader. |
- | Since randomForest does not operate when there are NA values, only complete cases were retained. | + | Since randomForest does not operate when there are NA values |
The type of water heating system included information about solar heaters, so it was removed from analysis. | The type of water heating system included information about solar heaters, so it was removed from analysis. | ||
- | Columns | + | Using trial and error, columns |
+ | Nearly all columns were eliminated. The columns retained have the subset statement commented out, e.g. HouseholdMembers. | ||
+ | The target vector (QSSolar column) was extracted and removed from the training set. It was converted from it's encoded value (28 or 29) into a two Factors (a factor is a symbol in a R data set) in order to trigger the classification algorithm. Otherwise the randomForest algorithm attempts to do a regression analysis. | ||
+ | |||
+ | The randomForest library was loaded and the analysis run. A couple of plots were generated in aiding the analysis and the error matrix printed out. | ||
+ | |||
+ | ===Analysis=== | ||
+ | The error rate was consistently below 10%, but this was due in large part to the overwhelming majority of the data being non-solar customers. In other words, with so many customers not having solar power, the best guess is a FALSE. | ||
+ | |||
+ | The importance vector was extracted from the results and by sorting descending on MeanDecreaseAcccuracy the relative usefulness of a column could be determined. Columns that fell to the bottom of this list contributed little to the accuracy of the result. | ||
+ | |||
+ | It was found that just the MainHeatingType column was an accurate predictor of the QSSolar column. | ||
+ | |||
+ | For the purposes of demonstration, | ||