Example 1

We illustrate how to evaluate a Random Forest Classifier on a preprocessed version of the GeoLife Dataset. This examples seeks to reproduce the preprocessing conducted in [1]

The example is structured as follows:

Note

You can access the script of this example.

1. Setup dependencies

Import all the dependencies:

from pactus import Dataset, featurizers
from pactus.models import RandomForestModel

2. Definition of parameters

We define a random seed for reproducibility

SEED = 0

3. Loading Data

To load the original GeoLife dataset we can simply do:

dataset = Dataset.geolife()

Then, we can process it to keep only the desired classes, combine similar classes and create a train/test split as proposed on [1]:

# Classes that are going to be used
use_classes = {"car", "taxi-bus", "walk", "bike", "subway", "train"}

# Preprocess the dataset and split it into train and test sets
train, test = (
    # Remove short and poorly time sampled trajectories
    dataset.filter(lambda traj, _: len(traj) > 10 and traj.dt < 8)
    # Join "taxi" and "bus" into "taxi-bus"
    .map(lambda _, label: (_, "taxi-bus" if label in ("bus", "taxi") else label))
    # Only use the classes defined in use_classes
    .filter(lambda _, label: label in use_classes)
    # Split the dataset into train and test
    .split(train_size=0.7, random_state=SEED)
)

4. Loading the model

Since we are going to use a Random Forest model, we need to create an object that converts every trajectory into a fixed size feature vector. In this case, we are going to use the UniversalFeaturizer, which includes all available features on pactus:

featurizer = featurizers.UniversalFeaturizer()

Then, we can create the desired model using the aforementioned featurizer:

model = RandomForestModel(
    featurizer=featurizer,
    max_features=16,
    n_estimators=200,
    bootstrap=False,
    random_state=SEED,
    warm_start=True,
    n_jobs=6,
)

5. Training and evaluation

Training and evaluation can be conducted as follows:

# Train the model
model.train(data=train, cross_validation=5)

# Evaluate the model on a test dataset
evaluation = model.evaluate(test)

# Show the evaluation results
evaluation.show()

Evaluation results should look like:

General statistics:
Accuracy: 0.913
F1-score: 0.892
Mean precision: 0.910
Mean recall: 0.877

Confusion matrix:
bike      car       subway    taxi-bus  train     walk      precision
======================================================================
89.83     0.56      0.74      1.29      0.0       1.41      94.03
0.25      79.1      0.74      1.94      0.0       0.11      90.32
0.0       0.56      82.35     1.46      0.0       0.22      90.32
2.48      18.08     10.29     91.42     12.12     2.5       87.19
0.0       0.56      0.0       0.32      87.88     0.0       90.62
7.44      1.13      5.88      3.56      0.0       95.76     93.43
----------------------------------------------------------------------
89.83     79.1      82.35     91.42     87.88     95.76

6. References

[1] Zheng, Yu, et al. “Understanding mobility based on GPS data.” Proceedings of the 10th international conference on Ubiquitous computing. 2008.