Dataset used to predict whether income exceeds $50K/yr based on census data.
Also known as "Census Income" dataset
Train dataset contains 13 features and 30178 observations.
Test dataset contains 13 features and 15315 observations.
Target column is "target": A binary factor where 1: <=50K and 2: >50K for annual income.
The column "sex"
is set as protected attribute.
Source
Dua, Dheeru, Graff, Casey (2017). “UCI Machine Learning Repository.” http://archive.ics.uci.edu/ml/.
Pre-processing
fnlwgt
Remove final weight, which is the number of people the census believes the entry representsnative-country
Remove Native Country, which is the country of origin for an individualRows containing
NA
in workclass and occupation have been removed.Pre-processing inspired by article: @url https://cseweb.ucsd.edu//classes/sp15/cse190-c/reports/sp15/048.pdf
Metadata
(integer) age: The age of the individuals
(factor) workclass: A general term to represent the employment status of an individual
(factor) education: The highest level of education achieved by an individual.
(integer) education_num: the highest level of education achieved in numerical form.
(factor) marital_status: marital status of an individual.
(factor) occupation: the general type of occupation of an individual
(factor) relationship: twhether the individual is in a relationship-
(factor) race: Descriptions of an individual’s race
(factor) sex: the biological sex of the individual
(integer) captain-gain: capital gains for an individual
(integer) captain-loss: capital loss for an individual
(integer) hours-per-week: the hours an individual has reported to work per week
(factor) target: whether or not an individual makes more than $50,000 annually