gicentre / tidy / Tidy

Tidy and shape tabular data.


type Table

The basic organisational unit for tabular data. Each column in a table has a unique name by which it may be referenced. Table cell values are represented as Strings, but can be converted to other types via column output functions (e.g. numColumn).

Create

fromCSV : String -> Table

Create a table from a multi-line comma-separated string. For example

  myTable =
      """colA,colB,colC
  a1,b1,c1
  a2,b2,c2
  a3,b3,c3"""
          |> fromCSV

fromDelimited : Char -> String -> Table

Create a table from a multi-line string where values are separated by the given delimiter (first parameter). For example, to process a tab-delimited values file (TSV):

myTable =
    """colA colB colC
a1  b1  c1
a2  b2  c2
a3  b3  c3"""
        |> fromDelimited '\t'

fromGrid : String -> Table

Transform multi-line string input values in the form:

"""
   z00,z01,z02,z03, etc.
   z10,z11,z12,z13, etc.
   z20,z21,z22,c23, etc.
   z30,z31,z32,c33, etc.
   etc."""

into a tidy table in the form:

| row | col |   z |
| --: | --: | --: |
|   0 |   0 | z00 |
|   0 |   1 | z01 |
|   0 |   2 | z02 |
|   0 |   3 | z03 |
|   1 |   0 | z10 |
|   1 |   1 | z11 |
|     |     |     |

Values between commas are outer-trimmed of whitespace unless enclosed in quotes and entirely blank lines are ignored. Input can be ragged with different numbers of columns in each row.

Note the common convention that in grids, the origin (row 0, col 0) is at the top-left, whereas in Cartesian coordinate systems the origin (x=0, y=0) is at the bottom-left. You may therefore wish to reverse the order of row values in the input string if you are mapping onto a Cartesian coordinate system. For example,

"""z00,z01,z02,z03
z10,z11,z12,z13
z20,z21,z22,c23
z30,z31,z32,c33"""
    |> String.split "\n"
    |> List.reverse
    |> String.join "\n"
    |> fromGrid
    |> renameColumn "row" "y"
    |> renameColumn "col" "x"

fromGridRows : List (List String) -> Table

Transform list of input string lists in the form:

[ [z00, z01, z02, z03, ...]
, [z10, z11, z12, z13, ...]
, [z20, z21, z22, c23, ...]
, [z30, z31, z32, c33, ...]
, [...]
]

into a tidy table in the form:

| row | col | z   |
| --: | --: | --- |
|   0 |   0 | z00 |
|   0 |   1 | z01 |
|   0 |   2 | z02 |
|   0 |   3 | z03 |
|   1 |   0 | z10 |
|   1 |   1 | z11 |
|     |     |     |

Input can be ragged with different numbers of columns in each row. Entirely empty rows (i.e. []) are ignored, but cells with empty strings (e.g. [""]) are captured.

As with fromGrid, you may wish to reverse the input row order if you are mapping onto a Cartesian coordinate system.

empty : Table

Create an empty table. Useful if table items are to be added programatically with insertRow and insertColumn.

Edit

insertRow : List ( String, String ) -> Table -> Table

Add a row of values to a table. The new values are represented by a list of (columnName,columnValue) tuples. If the table being appended is not empty, the column names should correspond to existing columns in the table or they will be ignored. Any unspecified columns will have an empty string value inserted.

filterRows : String -> (String -> Basics.Bool) -> Table -> Table

Keep rows in the table where the values in the given column satisfy the given test. The test should be a function that takes a cell value and returns either True or False depending on whether the row containing that value in the column should be retained.

isWarm : String -> Bool
isWarm s =
    case String.toFloat s of
        Just x ->
            x >= 10

        Nothing ->
            False

warmCities =
    myTable |> filterRows "temperature" isWarm

renameColumn : String -> String -> Table -> Table

Rename the given column (first parameter) with a new name (second parameter). If the new column name matches an existing one, the existing one will be replaced by the renamed column.

insertColumn : String -> List String -> Table -> Table

Add a column of data to a table. The first parameter is the name to give the column. The second is a list of column values. If the table already has a column with this name, it will get replaced with the given data. To ensure table rows are always aligned, if the table is not empty, the column values are padded to match the longest column in the table after insertion.

insertColumnFromJson : String -> List String -> String -> Table -> Table

Add a column of data extracted from a JSON string onto a table. The first parameter is the name of the JSON object containing the data values to add. This will become the name of the column in the table. The second is a list of JSON object names that define the path to the column object. This can be an empty list if the object is in an array at the root of the JSON. The third parameter is the JSON string to parse and the fourth the table to which a new column will be added. If there is a problem finding the column object, the original table is returned.

For example,

json =
    """[
  { "person": "John Smith", "treatment": "b", "result": 2 },
  { "person": "Jane Doe", "treatment": "a", "result": 16 },
  { "person": "Jane Doe", "treatment": "b", "result": 11 },
  { "person": "Mary Johnson", "treatment": "a", "result": 3 },
  { "person": "Mary Johnson", "treatment": "b", "result": 1 }
]"""

table =
    empty
        |> insertColumnFromJson "person" [] json
        |> insertColumnFromJson "treatment" [] json
        |> insertColumnFromJson "result" [] json

would generate a table

| person       | treatment | result |
| ------------ | --------- | -----: |
| John Smith   | b         |      2 |
| Jane Doe     | a         |     16 |
| Jane Doe     | b         |     11 |
| Mary Johnson | a         |      3 |
| Mary Johnson | b         |      1 |

insertIndexColumn : String -> String -> Table -> Table

Add an index column to a table. The first parameter is the name to give the new column containing index values (replacing an existing column if it shares the same name). The second is a prefix to add to each index value, useful for giving different tables different index values (or use "" for no prefix).

Creating an index column can be useful when joining tables with keys that you wish to guarantee are unique for each row. For example, to combine the rows of two tables table1 and table2, but which may contain repeated values:

outerJoin "key"
    ( insertIndexColumn "key" "t1" table1, "key" )
    ( insertIndexColumn "key" "t2" table2, "key" )

insertSetIndexColumn : String -> String -> Table -> Table

Add an index column to a table based on the partition of values in a given column into a minimum number of sets. The first parameter is the name to give the new column containing set index values, replacing an existing column if it shares the same name. The second is the name of the column containing the values to be partitioned into sets. If the column name is not found, the original table is returned.

For example given the following treatment table,

| person       | treatment |
| ------------ | --------- |
| John Smith   | b         |
| Jane Doe     | a         |
| Jane Doe     | b         |
| Mary Johnson | a         |
| Mary Johnson | b         |

Applying insertSetIndexColumn "id" "person" would generate,

| id | person       | treatment |
| -- | ------------ | --------- |
|  1 | John Smith   | b         |
|  1 | Jane Doe     | a         |
|  2 | Jane Doe     | b         |
|  1 | Mary Johnson | a         |
|  2 | Mary Johnson | b         |

Rows with a given id will all be unique. Creating a set index column can be useful when spreading a table made up of a single pair of columns. For example, taking the treatment table and inserting a set ID before spreading,

treatmentTable
    |> insertSetIndexColumn "id" "person"
    |> spread "person" "treatment"

generates the following table:

| id | John Smith | Jane Doe | Mary Johnson |
| -- | ---------- | -------- | ------------ |
| 1  | b          | a        | a            |
| 2  |            | b        | b            |

removeColumn : String -> Table -> Table

Remove a column with the given name from a table. If the column is not present in the table, the original table is returned.

mapColumn : String -> (String -> String) -> Table -> Table

Transform the contents of the given column (first parameter) with a mapping function (second parameter). For example

newTable =
    mapColumn "myColumnHeading" impute myTable

impute val =
    if val == "" then
        "0"

    else
        val

If the column name is not found, the original table is returned.

filterColumns : (String -> Basics.Bool) -> Table -> Table

Keep columns in the table whose names satisfy the given test. The test should be a function that takes a column heading and returns either True or False depending on whether the column should be retained.

newTable =
    myTable
        |> filterColumns (\s -> String.left 11 s == "temperature")

moveColumnToEnd : String -> Table -> Table

Move the column with the given name (first parameter) to become the last column in the given table (second parameter). While column order has no effect on table processing, this can be useful for display purposes. For example when separating variables that represent observation categories from those representing observation measurements.

Tidy

Arranging tidy data (Wickham, 2014) is a convention for organising tabular data such that columns represent distinct variables and rows represent observations. This isolates the semantic meaning of items in any column independently of all others. The effect is to greatly simplify data interchange and many data analytical functions.

Wickham identifies some common problems with data that are not in tidy format ("messy" data), each of which can be solved with a small number of simple operations:

gather : String -> String -> List ( String, String ) -> Table -> Table

Combine several columns that represent the same variable into two columns, one referencing the original column, the other the values of the variable. For example, the following messy table

| location  | temperature2017 | temperature2018 |
| --------- | --------------: | --------------: |
| Bristol   |              12 |              14 |
| Sheffield |              11 |              13 |
| Glasgow   |               8 |               9 |
| Aberdeen  |                 |               7 |

can be gathered to create a tidy table:

| location  | year | temperature |
| --------- | ---- | ----------: |
| Bristol   | 2017 |          12 |
| Bristol   | 2018 |          14 |
| Sheffield | 2017 |          11 |
| Sheffield | 2018 |          13 |
| Glasgow   | 2017 |           8 |
| Glasgow   | 2018 |           9 |
| Aberdeen  | 2018 |           7 |

The first two parameters represent the names to be given to the new reference column (year in the example above) and variable column (temperature in the example above). The third is a list of the (columnName,columnReference) to be gathered (e.g. [ ("temperature2017", "2017"), ("temperature2018", "2018") ] above).

Only non-empty cell values in the variable column are gathered (e.g. note that only Aberdeen, 2018, 7 is gathered with no entry for 2017.)

If none of the columnNames in the third parameter is found in the table, an empty table is returned.

For cases where more than one set of columns needs to be gathered, you can combine three stages: (a) gather all columns, adding a column group id; (b) bisect column group id and column reference; (c) spread the bisected columns. For example:

"""flowID,originLong,originLat,destLong,destLat
   1,-71.9,41.8,-71.5,41.6
   2,-80.5,34.9,-97.6,30.2
   3,-92.1,37.0,-86.8,43.6"""
    |> fromCSV
    |> gather "odCoordType"
        "value"
        [ ( "originLong", "oLong" )
        , ( "originLat", "oLat" )
        , ( "destLong", "dLong" )
        , ( "destLat", "dLat" )
        ]
    |> bisect "odCoordType" headTail ( "od", "coordType" )
    |> spread "coordType" "value"

creates the table

| flowID | od |  Long |  Lat |
| -----: | -- | ----- | ---- |
|      1 |  o | -71.9 | 41.8 |
|      1 |  d | -71.5 | 41.6 |
|      2 |  o | -80.5 | 34.9 |
|      2 |  d | -97.6 | 30.2 |
|      3 |  o | -92.1 | 37.0 |
|      3 |  d | -86.8 | 43.6 |

spread : String -> String -> Table -> Table

The inverse of gather, spreading a pair of columns rotates values to separate columns (like a pivot in a spreadsheet). This is useful if different variables are stored in separate rows of the same column. For example, the following table contains two different variables in the temperature column:

| location  | year | readingType | temperature |
| --------- | ---- | ----------- | ----------: |
| Bristol   | 2018 | minTemp     |           3 |
| Bristol   | 2018 | maxTemp     |          27 |
| Sheffield | 2018 | minTemp     |          -2 |
| Sheffield | 2018 | maxTemp     |          26 |
| Glasgow   | 2018 | minTemp     |         -10 |
| Glasgow   | 2018 | maxTemp     |          23 |
| Aberdeen  | 2018 | maxTemp     |          14 |

We can spread the temperatures into separate columns reflecting their distinct meanings, generating the table:

| location  | year | minTemp | maxTemp |
| --------- | ---- | ------: | ------: |
| Bristol   | 2018 |       3 |      27 |
| Sheffield | 2018 |      -2 |      26 |
| Glasgow   | 2018 |     -10 |      23 |
| Aberdeen  | 2018 |         |      14 |

The first parameter is the name of the column containing the values that will form the new spread column names (readingType above). The second parameter is the name of the column containing the values to be inserted in each new column (temperature above).

Missing rows (e.g. Aberdeen, 2018, minTemp above) are rotated as empty strings in the spread column. If either of the columns to spread is not found, the original table is returned.

Spreading effectively groups by values in the non-spreading columns. If the table to spread only contains the type and value columns, an empty table will be created as there are no values to group by. In these cases, adding an index column with insertSetIndexColumn can generate values to group by.

bisect : String -> (String -> ( String, String )) -> ( String, String ) -> Table -> Table

Split a named column (first parameter) into two with a bisecting function (second parameter). The third parameter should be the names to give the two new columns, which are inserted into the table replacing the original bisected column.

For example, given a table

| row | col |   z |
| --: | --: | --- |
|   0 |   0 | z00 |
|   0 |   1 | z01 |
|   0 |   2 | z02 |
|   1 |   0 | z10 |
|   1 |   1 | z11 |
|   1 |   2 | z12 |

bisecting it with

bisect "z"
    (\z ->
        ( String.left 2 z
        , String.left 1 z ++ String.right 1 z
        )
    )
    ( "zr", "zc" )

produces the table

| row | col | zr | zc |
| --: | --: | -- | -- |
|   0 |   0 | z0 | z0 |
|   0 |   1 | z0 | z1 |
|   0 |   2 | z0 | z2 |
|   1 |   0 | z1 | z0 |
|   1 |   1 | z1 | z1 |
|   1 |   2 | z1 | z2 |

If the column to be bisected is not found, the original table is returned.

For more sophisticated disaggregation, such as splitting a column into more than two new ones, consider disaggregate.

splitAt : Basics.Int -> String -> ( String, String )

Convenience function for splitting a string (second parameter) at the given position (first parameter).

splitAt 4 "tidyString" == ( "tidy", "String" )

If the first parameter is negative, the position is counted from the right rather than left.

splitAt -4 "temperature2019" == ( "temperature", "2019" )

Useful when using bisect to split column values in two.

headTail : String -> ( String, String )

Convenience function for splitting a string into its first (head) and remaining (tail) characters. e.g. headTail "tidy" == ("t","idy"). Equivalent to splitAt 1. Useful when using bisect to split column values into one column of heads and another of tails.

disaggregate : String -> String -> List String -> Table -> Table

Disaggregate the values in a given column (first parameter) according to a regular expression (second parameter). The names to give to the new disaggregated columns are provided in the third parameter. The number of groups returned by the regular expression should match the number of new column names. Performs a similar function to tidyr's extract function.

For example, to disaggregate diagnosisCohort in the following table:

| diagnosisCohort | numCases |
| --------------- | -------: |
| new_sp_m0-14    |       52 |
| new_sp_m15-24   |      228 |
| new_sp_f0-14    |       35 |
| new_sp_f15-24   |      180 |
| new_sn_m0-14    |        9 |
| new_sn_m15-24   |       97 |
| new_sn_f0-14    |       11 |
| new_sn_f15-24   |       64 |
disaggregate "diagnosisCohort"
    "new_?(.*)_(.)(.*)"
    [ "diagnosis", "gender", "age" ]

produces a new table:

| numCases | diagnosis | gender | age   |
| -------: | --------- | ------ | ----- |
|       52 |        sp |      m | 0-14  |
|      228 |        sp |      m | 15-24 |
|       35 |        sp |      f | 0-14  |
|      180 |        sp |      f | 15-24 |
|        9 |        sn |      m | 0-14  |
|       97 |        sn |      m | 15-24 |
|       11 |        sn |      f | 0-14  |
|       64 |        sn |      f | 15-24 |

If the column to disaggregate cannot be found, the original table is returned.

transposeTable : String -> String -> Table -> Table

Transpose the rows and columns of a table. Provide the name of column that will generate the column headings in the transposed table as the first parameter and the name you wish to give the new row names as the second.

For example,

newTable =
    myTable |> transposeTable "location" "temperature"

where myTable stores:

| location  | temperature2017 | temperature2018 |
| --------- | --------------: | --------------: |
| Bristol   |              12 |              14 |
| Sheffield |              11 |              13 |
| Glasgow   |               8 |               9 |

creates the following table:

| temperature     | Bristol | Sheffield | Glasgow |
| --------------- | ------: | --------: | ------: |
| temperature2017 |      12 |        11 |       8 |
| temperature2018 |      14 |        13 |       9 |

If the column to contain new headings cannot be found, an empty table is generated. If there are repeated names in the new headings column, earlier rows are replaced with later repeated ones.

normalize : String -> List String -> Table -> ( Table, Table )

Replace some columns with a single id column and store those column values in a separate table. Useful for removing redundancy in a table where multiple rows contain several values that describe the same feature. The two resulting tables are related with a new id column (named with the first parameter). The names of columns forming the key, that are moved into the key table, are provided as the second parameter. The function returns a pair of tables in the order (key table, value table).

For example, assuming the following table recording 5 people's favourite film is stored as favFilms :

| person  | age | film         | release | director  |
| ------- | --- | ------------ | ------- | --------- |
| Alice   | 51  | Vertigo      | 1958    | Hitchcock |
| Brenda  | 60  | Citizen Kane | 1941    | Welles    |
| Cate    | 23  | Vertigo      | 1958    | Hitchcock |
| Deborah | 38  | Jaws         | 1985    | Spielberg |
| Eloise  | 45  | Citizen Kane | 1941    | Welles    |

Normalizing it with

favFilms
    |> normalize "id" [ "film", "release", "director" ]

produces a tuple containing the following two tables:

| id | film         | release | director   |
| -- | ------------ | ------- | ---------- |
| 1  | Citizen Kane | 1941    | Welles     |
| 2  | Jaws         | 1975    | Spielberg  |
| 3  | Vertigo      | 1958    | Hitchcock  |
| person  | age | id |
| ------- | --- | -- |
| Alice   | 51  | 3  |
| Brenda  | 60  | 1  |
| Cate    | 23  | 3  |
| Deborah | 38  | 2  |
| Eloise  | 45  | 1  |

The process of separating a table into two can be reversed by applying a table join, for example:

let
    ( keyTable, valueTable ) =
        favFilms
            |> normalize "id" [ "film", "release", "director" ]
in
leftJoin ( valueTable, "id" ) ( keyTable, "id" )
    |> removeColumn "id"

Join

Join two tables using a common key. While not specific to tidy data, joining tidy tables is often more meaningful than joining messy ones. Joins often rely on the existence of a 'key' column containing unique row identifiers. If tables to be joined do not have such a key, they can be added with insertIndexColumn.

The examples below illustrate joining two input tables with shared key values k2 and k4:

table1:

| Key1 | colA | colB |
| ---- | ---- | ---- |
| k1   | a1   | b1   |
| k2   | a2   | b2   |
| k3   | a3   | b3   |
| k4   | a4   | b4   |

table2:

| Key2 | colC | colD |
| ---- | ---- | ---- |
| k2   | c2   | d2   |
| k4   | c4   | d4   |
| k6   | c6   | d6   |
| k8   | c8   | d8   |

leftJoin : ( Table, String ) -> ( Table, String ) -> Table

A left join preserves all the values in the first table and adds any key-matched values from columns in the second table to it. Where both tables share common column names, including key columns, only those in the left (first) table are stored in the output.

leftJoin ( table1, "Key1" ) ( table2, "Key2" )

would generate

| Key1 | colA | colB | Key2 | colC | colD |
| ---- | ---- | ---- | ---- | ---- | ---- |
| k1   | a1   | b1   |      |      |      |
| k2   | a2   | b2   | k2   | c2   | d2   |
| k3   | a3   | b3   |      |      |      |
| k4   | a4   | b4   | k4   | c4   | d4   |

If one or both of the key columns are not found, the left table is returned.

rightJoin : ( Table, String ) -> ( Table, String ) -> Table

A right join preserves all the values in the second table and adds any key-matched values from columns in the first table to it. Where both tables share common column names, including key columns, only those in the right (second) table are stored in the output.

rightJoin ( table1, "Key1" ) ( table2, "Key2" )

would generate

| Key2 | colC | colD | Key1 | colA | colB |
| ---- | ---- | ---- | ---- | ---- | ---- |
| k2   | c2   | d2   | k2   | a2   | b2   |
| k4   | c4   | d4   | k4   | a4   | b4   |
| k6   | c6   | d6   |      |      |      |
| k8   | c8   | d8   |      |      |      |

If one or both of the key columns are not found, the right table is returned.

innerJoin : String -> ( Table, String ) -> ( Table, String ) -> Table

An inner join will contain only key-matched rows that are present in both tables. The first parameter is the name to give the new key-matched column, replacing the separate key names in the two tables. Where both tables share a common column name, the one in the first table is prioritised.

innerJoin "Key" ( table1, "Key1" ) ( table2, "Key2" )

would generate

| Key | colA | colB | colC | colD |
| --- | ---- | ---- | ---- | ---- |
| k2  | a2   | b2   | c2   | d2   |
| k4  | a4   | b4   | c4   | d4   |

If one or both of the key columns are not found, this produces an empty table.

outerJoin : String -> ( Table, String ) -> ( Table, String ) -> Table

An outer join contains all rows of both joined tables. The first parameter is the name to give the new key-matched column, replacing the separate key names in the two tables.

outerJoin "Key" ( table1, "Key1" ) ( table2, "Key2" )

would generate

| Key | colA | colB | colC | colD |
| --- | ---- | ---- | ---- | ---- |
| k1  | a1   | b1   |      |      |
| k2  | a2   | b2   | c2   | d2   |
| k3  | a3   | b3   |      |      |
| k4  | a4   | b4   | c4   | d4   |
| k6  |      |      | c6   | d6   |
| k8  |      |      | c8   | d8   |

If one or both of the key columns are not found, this produces an empty table.

leftDiff : ( Table, String ) -> ( Table, String ) -> Table

Provides a table of all the rows in the first table that do not occur in any key-matched rows in the second table.

leftDiff ( table1, "Key1" ) ( table2, "Key2" )

would generate

| Key1 | colA | colB |
| ---- | ---- | ---- |
| k1   | a1   | b1   |
| k3   | a3   | b3   |

If the first key is not found, an empty table is returned, if the second key is not found, the first table is returned.

rightDiff : ( Table, String ) -> ( Table, String ) -> Table

Provides a table of all the rows in the second table that do not occur in any key-matched rows in the first table.

rightDiff ( table1, "Key1" ) ( table2, "Key2" )

would generate

| Key2 | colC | colD |
| ---- | ---- | ---- |
| k6   | c6   | d6   |
| k8   | c8   | d8   |

If the first key is not found, the second table is returned, if the second key is not found, an empty table is returned.

Output

tableSummary : Basics.Int -> Table -> List String

Provide a textual description of a table, configurable to show a given number of table rows. If the number of rows to show is negative, all rows are output. This is designed primarily to generate markdown output, but is interpretable as raw text.

columnNames : Table -> List String

Provide a list of column names for the given table.

toCSV : Table -> String

Provide a CSV (comma-separated values) format version of a table. Can be useful for applications that need to save a table as a file.

toDelimited : String -> Table -> String

Provide text containing table values separated by the given delimiter (first parameter). Can be useful for applications that need to save a table as a file. For example, to create tab-delimited (TSV) text representing a table for later saving as a file:

toDelimited '\t' myTable

Column output

numColumn : String -> Table -> List Basics.Float

Extract the numeric values of a given column from a table. Any conversions that fail, including missing values in the table are converted into zeros. If you wish to handle missing data / failed conversions in a different way, use toColumn instead, providing a custom converter function.

dataColumn =
    myTable |> numColumn "year"

strColumn : String -> Table -> List String

Extract the string values of a given column from a table. Missing values in the table are represented as empty strings. If you wish to handle missing values in a different way, use toColumn instead, providing a custom converter function.

dataColumn =
    myTable |> strColumn "cityName"

booColumn : String -> Table -> List Basics.Bool

Extract Boolean values of a given column from a table. Assumes that True values can be represented by the case-insensitive strings true, yes and 1 while all other values are assumed to be false.

dataColumn =
    myTable |> booColumn "isMarried"

toColumn : String -> (String -> a) -> Table -> List a

Extract the values of the column with the given name (first parameter) from a table. The type of values in the column is determined by the given cell conversion function. The converter function should handle cases of missing data in the table (e.g. empty strings) as well as failed conversions (e.g. attempts to convert text into a number).

imputeMissing : String -> Int
imputeMissing =
    String.toFloat >> Maybe.withDefault 0

dataColumn =
    myTable |> toColumn "count" imputeMissing