# An Oracle-based approach to the "taxi fare" prediction problem - episode 2

This is the second part of the series on the Taxi Fare prediction problem from an Oracle perspective. You can read Episode 1
here. In this Episode, I will put our model to work. That is, I will show several queries that predict taxi fares based on the features previously selected. I will also review model performance metrics. Remember, performance in this case does not relate to response time, but to fitness or how accurate the predictions are expected to be. Finally, I will comment on the pros and cons of this approach.

## Show me the... results!

Well, now that we have a model what we all want is to see how it fares (see what I did here?). Of course, Oracle includes predictive functions in the Analytics option feature Data Mining SQL Functions. For this exercise, I will use the PREDICTION function because, well, we are predicting stuff. This function has a relatively simple but powerful syntax allowing the introduction of grouping options, cost base data and predictors for regression, classification and anomaly detection problems. It also has two modes of operation: Syntax and Analytic Syntax. Simply put, we can use an existing model with the Syntax option, or we can use the Analytic Syntax option to "create a model on the fly". Let's see a simple example showing both the actual fare amount and the one predicted by the model just for the first 10 rows in the data. First, using the linear regression model created on the 54 million rows train data table we can see that the variation in percentage is quite high, as much as 49% only for this small test set.```
SQL> WITH PREDICTION AS
2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3 USING *) PREDICTED_FARE
3 FROM TRAIN_DISTANCE_BUCKET_3
4 WHERE CASEID < 11
5 )
6 SELECT FARE_AMOUNT,
7 ROUND(PREDICTED_FARE,2) PREDICTED_FARE,
8 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF,
9 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
10 FROM PREDICTION;
FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
----------- -------------- ---------- ----------
6.5 8.81 2.31 35.48
5.7 8.53 2.83 49.71
10.9 12.25 1.35 12.4
18 18.64 .64 3.54
6.1 7.92 1.82 29.89
13 12.64 .36 2.81
8.5 10.56 2.06 24.2
9 10.99 1.99 22.07
6.5 8.67 2.17 33.36
9.3 10.35 1.05 11.26
10 rows selected.
Elapsed: 00:00:04.275
```

And now for the support vector machine model created on the 2 million rows train data table. Simply based on these cases, it looks like the SVM model is more accurate that the GLM one.
```
SQL> WITH PREDICTION AS
2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
3 FROM TRAIN_DISTANCE_BUCKET_3
4 WHERE CASEID < 11
5 )
6 SELECT FARE_AMOUNT,
7 ROUND(PREDICTED_FARE,2) PREDICTED_FARE,
8 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF,
9 ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
10 FROM PREDICTION;
FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
----------- -------------- ---------- ----------
6.5 7.79 1.29 19.91
5.7 7.24 1.54 26.96
10.9 11.99 1.09 10.04
18 20.4 2.4 13.32
6.1 6.38 .28 4.58
13 12.51 .49 3.75
8.5 9.9 1.4 16.43
9 9.61 .61 6.73
6.5 7.31 .81 12.53
9.3 8.62 .68 7.32
10 rows selected.
Elapsed: 00:00:05.736
```

Another interesting function that Oracle provides is the
PREDICTION_DETAILS function. This function returns an XML string that describes the attributes of the prediction, that is for a prediction problem, the topN attributes that have the most influence in the prediction and their relative weights. Let's see another simple example with the GLM:
```
SQL> SELECT FARE_AMOUNT,
ROUND(PREDICTION (LIN_REG_TAXI_FARE_3 USING *),2) PREDICTED_FARE,
REDICTION_DETAILS (LIN_REG_TAXI_FARE_3 USING *) DETAILS
FROM TRAIN_DISTANCE_BUCKET
WHERE CASEID <11;
FARE_AMOUNT PREDICTED_FARE DETAILS
----------- -------------- ------------------------------------------------------------------------------------------
6.5 8.81
5.7 8.53
10.9 12.25
18 18.64
6.1 7.92
13 12.64
8.5 10.56
9 10.99
6.5 8.67
9.3 10.35
10 rows selected. Elapsed: 00:00:04.176
```

Something to note in the example above is that Oracle does not always choose the same attributes and the number of attributes does not seem to imply higher accuracy. But this is only a small portion of the data and may not be representative. Now, the same with the SVM model:
```
SQL> SELECT FARE_AMOUNT, ROUND(PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *),2) PREDICTED_FARE,
2 PREDICTION_DETAILS (LIN_REG_TAXI_FARE_3_SVM USING *) DETAILS
3 FROM TRAIN_DISTANCE_BUCKET_3
4 WHERE CASEID <11;
FARE_AMOUNT PREDICTED_FARE DETAILS
----------- -------------- ------------------------------------------------------------------------------------------------------------------------
6.5 7.79
5.7 7.24
10.9 11.99
18 20.4
6.1 6.38
13 12.51
8.5 9.9
9 9.61
6.5 7.31
9.3 8.62
10 rows selected. Elapsed: 00:00:04.346
```

Again, we can see that Oracle is choosing different attributes for each case. In any case, once the model has been created it is quite easy to use SQL, either in batch mode or real time, to obtain predictions.
## Model fitness

There are several methods to evaluate the accuracy of a regression model. The Oracle documentation mentions two commonly used: Root Mean Squared Error (RMSE) and Mean Absolute Error (MEA), and provides SQL to calculate them. So let's do it first for the GLM and then for the SVM model.```
-- First for the GLM model
SQL> with prediction as
2 (SELECT FARE_AMOUNT, PREDICTION (Lin_reg_taxi_fare_3 USING *) PREDICTED_FARE
3 FROM train_distance_bucket_3)
4 select SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse,
5 AVG(ABS(predicted_fare - fare_amount)) mae
6* from prediction p;
RMSE MAE
---------- ----------
6.66524393 2.96620921
Elapsed: 00:01:39.752
-- Then for the SVM model
SQL> with prediction as
2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
3 FROM train_distance_bucket_3 )
4 select SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse,
5 AVG(ABS(predicted_fare - fare_amount)) mae
6 from prediction p ;
RMSE MAE
---------- ----------
7.76230946 2.39739187
Elapsed: 00:01:42.077
```

With the data above we can say that the GLM model is better in general error terms, smaller RMSE. But based on the MAE, the SVM model is slightly better. Note that in the code, both calculations have been performed on the 54 million rows table, although the SVM model was trained with only 2 million rows of data. While Oracle
indicates that the SVM model is more complex and powerful than GLM, for this particular data set it shows worse performance. This may be because the SVM algorithm is better suited to a data set with a large number of attributes, which is not the case here. For those of you who have worked with the original Taxi Fare problem you may know that the RMSE for a basic algorithm is approximately 8, so getting a value of 6.6 is quite an improvement.