IndexedRowMatrix

class pyspark.mllib.linalg.distributed.IndexedRowMatrix(rows, numRows=0, numCols=0)[source]

Represents a row-oriented distributed Matrix with indexed rows.

Parameters
  • rows – An RDD of IndexedRows or (long, vector) tuples or a DataFrame consisting of a long typed column of indices and a vector typed column.

  • numRows – Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the max row index plus one.

  • numCols – Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the size of the first row.

Methods

Attributes

Methods Documentation

columnSimilarities()[source]

Compute all cosine similarities between columns.

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows)
>>> cs = mat.columnSimilarities()
>>> print(cs.numCols())
3
computeGramianMatrix()[source]

Computes the Gramian matrix A^T A.

Note

This cannot be computed on matrices with more than 65535 columns.

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows)
>>> mat.computeGramianMatrix()
DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0)

New in version 2.0.0.

computeSVD(k, computeU=False, rCond=1e-09)[source]

Computes the singular value decomposition of the IndexedRowMatrix.

The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where

  • U: (m X k) (left singular vectors) is a IndexedRowMatrix

    whose columns are the eigenvectors of (A X A’)

  • s: DenseVector consisting of square root of the eigenvalues

    (singular values) in descending order.

  • v: (n X k) (right singular vectors) is a Matrix whose columns

    are the eigenvectors of (A’ X A)

For more specific details on implementation, please refer the scala documentation.

Parameters
  • k – Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is ill-conditioned).

  • computeU – Whether or not to compute U. If set to be True, then U is computed by A * V * s^-1

  • rCond – Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value.

Returns

SingularValueDecomposition object

>>> rows = [(0, (3, 1, 1)), (1, (-1, 3, 1))]
>>> irm = IndexedRowMatrix(sc.parallelize(rows))
>>> svd_model = irm.computeSVD(2, True)
>>> svd_model.U.rows.collect() 
[IndexedRow(0, [-0.707106781187,0.707106781187]),        IndexedRow(1, [-0.707106781187,-0.707106781187])]
>>> svd_model.s
DenseVector([3.4641, 3.1623])
>>> svd_model.V
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)

New in version 2.2.0.

multiply(matrix)[source]

Multiply this matrix by a local dense matrix on the right.

Parameters

matrix – a local dense matrix whose number of rows must match the number of columns of this matrix

Returns

IndexedRowMatrix

>>> mat = IndexedRowMatrix(sc.parallelize([(0, (0, 1)), (1, (2, 3))]))
>>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect()
[IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])]

New in version 2.2.0.

numCols()[source]

Get or compute the number of cols.

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6]),
...                        IndexedRow(2, [7, 8, 9]),
...                        IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows)
>>> print(mat.numCols())
3
>>> mat = IndexedRowMatrix(rows, 7, 6)
>>> print(mat.numCols())
6
numRows()[source]

Get or compute the number of rows.

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6]),
...                        IndexedRow(2, [7, 8, 9]),
...                        IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows)
>>> print(mat.numRows())
4
>>> mat = IndexedRowMatrix(rows, 7, 6)
>>> print(mat.numRows())
7
toBlockMatrix(rowsPerBlock=1024, colsPerBlock=1024)[source]

Convert this matrix to a BlockMatrix.

Parameters
  • rowsPerBlock – Number of rows that make up each block. The blocks forming the final rows are not required to have the given number of rows.

  • colsPerBlock – Number of columns that make up each block. The blocks forming the final columns are not required to have the given number of columns.

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows).toBlockMatrix()
>>> # This IndexedRowMatrix will have 7 effective rows, due to
>>> # the highest row index being 6, and the ensuing
>>> # BlockMatrix will have 7 rows as well.
>>> print(mat.numRows())
7
>>> print(mat.numCols())
3
toCoordinateMatrix()[source]

Convert this matrix to a CoordinateMatrix.

>>> rows = sc.parallelize([IndexedRow(0, [1, 0]),
...                        IndexedRow(6, [0, 5])])
>>> mat = IndexedRowMatrix(rows).toCoordinateMatrix()
>>> mat.entries.take(3)
[MatrixEntry(0, 0, 1.0), MatrixEntry(0, 1, 0.0), MatrixEntry(6, 0, 0.0)]
toRowMatrix()[source]

Convert this matrix to a RowMatrix.

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows).toRowMatrix()
>>> mat.rows.collect()
[DenseVector([1.0, 2.0, 3.0]), DenseVector([4.0, 5.0, 6.0])]

Attributes Documentation

rows

Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.

>>> mat = IndexedRowMatrix(sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                                        IndexedRow(1, [4, 5, 6])]))
>>> rows = mat.rows
>>> rows.first()
IndexedRow(0, [1.0,2.0,3.0])