To solve the case, it was necessary to follow two technical approaches.
1. DEALING WITH EXCEL LIMITATIONS
Firstly, as students are using Excel, the limitation arises where the number of "X" variables cannot exceed 16. This presents a challenge in reducing the number of "X" variables. While other software packages allow for a greater number of "X" variables by using dummies for categorical variables, this is not feasible in Excel. Thus, this becomes challenge number one.
To overcome this challenge, the best approach is to transform the categorical variables into numerical ones. By doing so, you can eliminate the need for introducing new columns through dummyfication.
In the given case's context, you can create a numerical table with scores assigned to each class type within your categorical variables. For instance:
2. DEALING WITH STRUCTURAL BREAKS
Secondly, it is observed that students often tend to directly proceed with running a regression model without thoroughly examining the data. In this particular case, upon careful examination, it is noticed that the data downloaded by the Professor includes information from three different wholesalers. It becomes evident from a scatter plot between "Price" and "Carat" that one of the wholesalers exclusively sells low carat/low price diamonds.
Now the question arises whether all the data should be used in a single regression model or if there are "structural" breaks in the data that would make a single regression (linear regression) inappropriate. This can be tested using a model stability test such as the "Chow" test, which has been covered in a video lesson. Conducting a Chow test reveals that all three wholesalers cannot be included in a single regression model.
Considering that the Professor's diamond has a higher carat weight, it is appropriate to exclude wholesaler 3 when developing a predictive model for the diamond price.
It is important to note that when wholesaler 3 data is dropped from the model, the R-square value significantly decreases. However, this highlights the trade-off between having a "better" fitting but potentially incorrect model versus a "less fitting" but correct model for prediction purposes.
By using a "correct" model for prediction, considering numerical variables for all categories, the following results are obtained. It should be noted that this is not the only correct regression model, as different approaches to handling categorical variables may result in varied models. However, the primary objective is to make a prediction regarding the fairness of the price, as per the quote received by the professor.