Skip to contents

The COMPAS dataset includes the processed COMPAS data between 2013-2014. The data cleaning process followed the guidance in the original COMPAS repo. Contains 6172 observations and 14 features. The target column could either be "is_recid" or "two_year_recid", but often "two_year_recid" is prefered. The column "sex" is set as protected attribute, but more often "race" is used.

A classification task for the compas data set.

A classification task for the compas data set. The observations have been filtered, keeping only observations with race "Caucasian" and "African-American". The protected attribute has been set to "race".

Format

R6::R6Class inheriting from TaskClassif.

R6::R6Class inheriting from TaskClassif.

Pre-processing

  • Identifying columns are removed

  • Removed the outliers for abs(days_b_screening_arrest) >= 30.

  • Removed observations where is_recid != -1.

  • Removed observations where c_charge_degree != "O".

  • Removed observations where score_text != 'N/A'.

  • Factorize the features that are categorical.

  • Add length of stay (c_jail_out - c_jail_in) in the dataset.

  • Pre-processing Resouce: @url https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb

Metadata

  • (integer) age : The age of defendants.

  • (factor) c_charge_degree : The charge degree of defendants. F: Felony M: Misdemeanor

  • (factor) race: The race of defendants.

  • (factor) age_cat: The age category of defendants.

  • (factor) score_text: The score category of defendants.

  • (factor) sex: The sex of defendants.

  • (integer) priors_count: The prior criminal records of defendants.

  • (integer) days_b_screening_arrest: The count of days between screening date and (original) arrest date. If they are too far apart, that may indicate an error. If the value is negative, that indicate the screening date happened before the arrest date.

  • (integer) decile_score: Indicate the risk of recidivism (Min=1, Max=10)

  • (integer) is_recid: Binary variable indicate whether defendant is rearrested at any time.

  • (factor) two_year_recid: Binary variable indicate whether defendant is rearrested at within two years.

  • (numeric) length_of_stay: The count of days stay in jail.

Construction

mlr_tasks$get("compas")
tsk("compas")

mlr_tasks$get("compas_race_binary")
tsk("compas_race_binary")

Examples

data("compas", package = "mlr3fairness")