7.1. Toy datasets
scikit-learn comes with a few small standard datasets that do not require to
download any file from some external website.
They can be loaded using the following functions:
load_iris (*[, return_X_y, as_frame])
|
Load and return the iris dataset (classification). |
load_diabetes (*[, return_X_y, as_frame, scaled])
|
Load and return the diabetes dataset (regression). |
load_digits (*[, n_class, return_X_y, as_frame])
|
Load and return the digits dataset (classification). |
load_linnerud (*[, return_X_y, as_frame])
|
Load and return the physical exercise Linnerud dataset. |
load_wine (*[, return_X_y, as_frame])
|
Load and return the wine dataset (classification). |
load_breast_cancer (*[, return_X_y, as_frame])
|
Load and return the breast cancer wisconsin dataset (classification). |
These datasets are useful to quickly illustrate the behavior of the
various algorithms implemented in scikit-learn. They are however often too
small to be representative of real world machine learning tasks.
7.1.1. Iris plants dataset
Data Set Characteristics:
- Number of Instances:
150 (50 in each of three classes)
- Number of Attributes:
4 numeric, predictive attributes and the class
- Attribute Information:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
- class:
Iris-Setosa
Iris-Versicolour
Iris-Virginica
- Summary Statistics:
|
|
|
|
|
|
sepal length: |
4.3 |
7.9 |
5.84 |
0.83 |
0.7826 |
sepal width: |
2.0 |
4.4 |
3.05 |
0.43 |
-0.4194 |
petal length: |
1.0 |
6.9 |
3.76 |
1.76 |
0.9490 (high!) |
petal width: |
0.1 |
2.5 |
1.20 |
0.76 |
0.9565 (high!) |
- Missing Attribute Values:
None
- Class Distribution:
33.3% for each of 3 classes.
- Creator:
R.A. Fisher
- Donor:
Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
- Date:
July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher’s paper. Note that it’s the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher’s paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
References
Click for more details
Fisher, R.A. “The use of multiple measurements in taxonomic problems”
Annual Eugenics, 7, Part II, 179-188 (1936); also in “Contributions to
Mathematical Statistics” (John Wiley, NY, 1950).
Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
Dasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments”. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
Gates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions
on Information Theory, May 1972, 431-433.
See also: 1988 MLC Proceedings, 54-64. Cheeseman et al”s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
Many, many more …
7.1.2. Diabetes dataset
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
- Number of Instances:
442
- Number of Attributes:
First 10 columns are numeric predictive values
- Target:
Column 11 is a quantitative measure of disease progression one year after baseline
- Attribute Information:
age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples
(i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
7.1.3. Optical recognition of handwritten digits dataset
Data Set Characteristics:
- Number of Instances:
1797
- Number of Attributes:
64
- Attribute Information:
8x8 image of integer pixels in the range 0..16.
- Missing Attribute Values:
None
- Creator:
Alpaydin (alpaydin ‘@’ boun.edu.tr)
- Date:
July; 1998
This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.
Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.
For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.
References
Click for more details
C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.