Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| project:hotornot:r [2016/04/09 12:55] – created derrickoswald | project:hotornot:r [2016/04/10 21:03] (current) – finish description derrickoswald | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ===R Analysis=== | ===R Analysis=== | ||
| - | The sample data was analysed with the [[http:// | + | The sample data was analysed with [[http:// | 
| - | The source code is show below. An explaination  | + | The source code is show below. An explanation  | 
| < | < | ||
| Line 72: | Line 72: | ||
| plot (forest, log=" | plot (forest, log=" | ||
| MDSplot(forest, | MDSplot(forest, | ||
| + | print (forest) | ||
| imp = importance (forest) | imp = importance (forest) | ||
| </ | </ | ||
| - | ==Steps== | + | ===Program  | 
| - | The data set was loaded into R using the built in CSV reader. | + | The data set and meta data were loaded into R using the built in CSV reader. | 
| - | Since randomForest does not operate when there are NA values, only complete cases were retained. | + | Since randomForest does not operate when there are NA values  | 
| + | The type of water heating system included information about solar heaters, so it was removed from analysis. | ||
| + | |||
| + | Using trial and error, columns were eliminated step by step based on the importance vector to establish the minimal set. | ||
| + | Nearly all columns were eliminated. The columns retained have the subset statement commented out, e.g. HouseholdMembers. | ||
| + | |||
| + | The target vector (QSSolar column) was extracted and removed from the training set. It was converted from it's encoded value (28 or 29) into a two Factors (a factor is a symbol in a R data set) in order to trigger the classification algorithm. Otherwise the randomForest algorithm attempts to do a regression analysis. | ||
| + | |||
| + | The randomForest library was loaded and the analysis run. A couple of plots were generated in aiding the analysis and the error matrix printed out. | ||
| + | |||
| + | ===Analysis=== | ||
| + | The error rate was consistently below 10%, but this was due in large part to the overwhelming majority of the data being non-solar customers. In other words, with so many customers not having solar power, the best guess is a FALSE. | ||
| + | |||
| + | The importance vector was extracted from the results and by sorting descending on MeanDecreaseAcccuracy the relative usefulness of a column could be determined. Columns that fell to the bottom of this list contributed little to the accuracy of the result. | ||
| + | |||
| + | It was found that just the MainHeatingType column was an accurate predictor of the QSSolar column. | ||
| + | |||
| + | For the purposes of demonstration, | ||