Tidy and shape tabular data.
The basic organisational unit for tabular data. Each column in a table has a unique name by which it may be referenced. Table cell values are represented as Strings, but can be converted to other types via column output functions (e.g. numColumn).
fromCSV : String -> Table
Create a table from a multi-line comma-separated string. For example
myTable =
"""colA,colB,colC
a1,b1,c1
a2,b2,c2
a3,b3,c3"""
|> fromCSV
fromDelimited : Char -> String -> Table
Create a table from a multi-line string where values are separated by the given delimiter (first parameter). For example, to process a tab-delimited values file (TSV):
myTable =
"""colA colB colC
a1 b1 c1
a2 b2 c2
a3 b3 c3"""
|> fromDelimited '\t'
fromGrid : String -> Table
Transform multi-line string input values in the form:
"""
z00,z01,z02,z03, etc.
z10,z11,z12,z13, etc.
z20,z21,z22,c23, etc.
z30,z31,z32,c33, etc.
etc."""
into a tidy table in the form:
| row | col | z |
| --: | --: | --: |
| 0 | 0 | z00 |
| 0 | 1 | z01 |
| 0 | 2 | z02 |
| 0 | 3 | z03 |
| 1 | 0 | z10 |
| 1 | 1 | z11 |
| | | |
Values between commas are outer-trimmed of whitespace unless enclosed in quotes and entirely blank lines are ignored. Input can be ragged with different numbers of columns in each row.
Note the common convention that in grids, the origin (row 0, col 0) is at the top-left, whereas in Cartesian coordinate systems the origin (x=0, y=0) is at the bottom-left. You may therefore wish to reverse the order of row values in the input string if you are mapping onto a Cartesian coordinate system. For example,
"""z00,z01,z02,z03
z10,z11,z12,z13
z20,z21,z22,c23
z30,z31,z32,c33"""
|> String.split "\n"
|> List.reverse
|> String.join "\n"
|> fromGrid
|> renameColumn "row" "y"
|> renameColumn "col" "x"
fromGridRows : List (List String) -> Table
Transform list of input string lists in the form:
[ [z00, z01, z02, z03, ...]
, [z10, z11, z12, z13, ...]
, [z20, z21, z22, c23, ...]
, [z30, z31, z32, c33, ...]
, [...]
]
into a tidy table in the form:
| row | col | z |
| --: | --: | --- |
| 0 | 0 | z00 |
| 0 | 1 | z01 |
| 0 | 2 | z02 |
| 0 | 3 | z03 |
| 1 | 0 | z10 |
| 1 | 1 | z11 |
| | | |
Input can be ragged with different numbers of columns in each row. Entirely empty
rows (i.e. []
) are ignored, but cells with empty strings (e.g. [""]
) are captured.
As with fromGrid, you may wish to reverse the input row order if you are mapping onto a Cartesian coordinate system.
empty : Table
Create an empty table. Useful if table items are to be added programatically
with insertRow
and insertColumn
.
insertRow : List ( String, String ) -> Table -> Table
Add a row of values to a table. The new values are represented by a list of
(columnName,columnValue)
tuples. If the table being appended is not empty, the
column names should correspond to existing columns in the table or they will be
ignored. Any unspecified columns will have an empty string value inserted.
filterRows : String -> (String -> Basics.Bool) -> Table -> Table
Keep rows in the table where the values in the given column satisfy the given
test. The test should be a function that takes a cell value and returns either
True
or False
depending on whether the row containing that value in the column
should be retained.
isWarm : String -> Bool
isWarm s =
case String.toFloat s of
Just x ->
x >= 10
Nothing ->
False
warmCities =
myTable |> filterRows "temperature" isWarm
renameColumn : String -> String -> Table -> Table
Rename the given column (first parameter) with a new name (second parameter). If the new column name matches an existing one, the existing one will be replaced by the renamed column.
insertColumn : String -> List String -> Table -> Table
Add a column of data to a table. The first parameter is the name to give the column. The second is a list of column values. If the table already has a column with this name, it will get replaced with the given data. To ensure table rows are always aligned, if the table is not empty, the column values are padded to match the longest column in the table after insertion.
insertColumnFromJson : String -> List String -> String -> Table -> Table
Add a column of data extracted from a JSON string onto a table. The first parameter is the name of the JSON object containing the data values to add. This will become the name of the column in the table. The second is a list of JSON object names that define the path to the column object. This can be an empty list if the object is in an array at the root of the JSON. The third parameter is the JSON string to parse and the fourth the table to which a new column will be added. If there is a problem finding the column object, the original table is returned.
For example,
json =
"""[
{ "person": "John Smith", "treatment": "b", "result": 2 },
{ "person": "Jane Doe", "treatment": "a", "result": 16 },
{ "person": "Jane Doe", "treatment": "b", "result": 11 },
{ "person": "Mary Johnson", "treatment": "a", "result": 3 },
{ "person": "Mary Johnson", "treatment": "b", "result": 1 }
]"""
table =
empty
|> insertColumnFromJson "person" [] json
|> insertColumnFromJson "treatment" [] json
|> insertColumnFromJson "result" [] json
would generate a table
| person | treatment | result |
| ------------ | --------- | -----: |
| John Smith | b | 2 |
| Jane Doe | a | 16 |
| Jane Doe | b | 11 |
| Mary Johnson | a | 3 |
| Mary Johnson | b | 1 |
insertIndexColumn : String -> String -> Table -> Table
Add an index column to a table. The first parameter is the name to give the
new column containing index values (replacing an existing column if it shares the
same name). The second is a prefix to add to each index value, useful for giving
different tables different index values (or use ""
for no prefix).
Creating an index column can be useful when joining tables with keys that you wish
to guarantee are unique for each row. For example, to combine the rows of two
tables table1
and table2
, but which may contain repeated values:
outerJoin "key"
( insertIndexColumn "key" "t1" table1, "key" )
( insertIndexColumn "key" "t2" table2, "key" )
insertSetIndexColumn : String -> String -> Table -> Table
Add an index column to a table based on the partition of values in a given column into a minimum number of sets. The first parameter is the name to give the new column containing set index values, replacing an existing column if it shares the same name. The second is the name of the column containing the values to be partitioned into sets. If the column name is not found, the original table is returned.
For example given the following treatment table,
| person | treatment |
| ------------ | --------- |
| John Smith | b |
| Jane Doe | a |
| Jane Doe | b |
| Mary Johnson | a |
| Mary Johnson | b |
Applying insertSetIndexColumn "id" "person"
would generate,
| id | person | treatment |
| -- | ------------ | --------- |
| 1 | John Smith | b |
| 1 | Jane Doe | a |
| 2 | Jane Doe | b |
| 1 | Mary Johnson | a |
| 2 | Mary Johnson | b |
Rows with a given id will all be unique. Creating a set index column can be useful when spreading a table made up of a single pair of columns. For example, taking the treatment table and inserting a set ID before spreading,
treatmentTable
|> insertSetIndexColumn "id" "person"
|> spread "person" "treatment"
generates the following table:
| id | John Smith | Jane Doe | Mary Johnson |
| -- | ---------- | -------- | ------------ |
| 1 | b | a | a |
| 2 | | b | b |
removeColumn : String -> Table -> Table
Remove a column with the given name from a table. If the column is not present in the table, the original table is returned.
mapColumn : String -> (String -> String) -> Table -> Table
Transform the contents of the given column (first parameter) with a mapping function (second parameter). For example
newTable =
mapColumn "myColumnHeading" impute myTable
impute val =
if val == "" then
"0"
else
val
If the column name is not found, the original table is returned.
filterColumns : (String -> Basics.Bool) -> Table -> Table
Keep columns in the table whose names satisfy the given test. The test should
be a function that takes a column heading and returns either True
or False
depending on whether the column should be retained.
newTable =
myTable
|> filterColumns (\s -> String.left 11 s == "temperature")
moveColumnToEnd : String -> Table -> Table
Move the column with the given name (first parameter) to become the last column in the given table (second parameter). While column order has no effect on table processing, this can be useful for display purposes. For example when separating variables that represent observation categories from those representing observation measurements.
Arranging tidy data (Wickham, 2014) is a convention for organising tabular data such that columns represent distinct variables and rows represent observations. This isolates the semantic meaning of items in any column independently of all others. The effect is to greatly simplify data interchange and many data analytical functions.
Wickham identifies some common problems with data that are not in tidy format ("messy" data), each of which can be solved with a small number of simple operations:
gather : String -> String -> List ( String, String ) -> Table -> Table
Combine several columns that represent the same variable into two columns, one referencing the original column, the other the values of the variable. For example, the following messy table
| location | temperature2017 | temperature2018 |
| --------- | --------------: | --------------: |
| Bristol | 12 | 14 |
| Sheffield | 11 | 13 |
| Glasgow | 8 | 9 |
| Aberdeen | | 7 |
can be gathered to create a tidy table:
| location | year | temperature |
| --------- | ---- | ----------: |
| Bristol | 2017 | 12 |
| Bristol | 2018 | 14 |
| Sheffield | 2017 | 11 |
| Sheffield | 2018 | 13 |
| Glasgow | 2017 | 8 |
| Glasgow | 2018 | 9 |
| Aberdeen | 2018 | 7 |
The first two parameters represent the names to be given to the new reference column
(year
in the example above) and variable column (temperature
in the example
above). The third is a list of the (columnName,columnReference) to be gathered
(e.g. [ ("temperature2017", "2017"), ("temperature2018", "2018") ]
above).
Only non-empty cell values in the variable column are gathered (e.g. note that only
Aberdeen, 2018, 7
is gathered with no entry for 2017.)
If none of the columnName
s in the third parameter is found in the table, an empty
table is returned.
For cases where more than one set of columns needs to be gathered, you can combine three stages: (a) gather all columns, adding a column group id; (b) bisect column group id and column reference; (c) spread the bisected columns. For example:
"""flowID,originLong,originLat,destLong,destLat
1,-71.9,41.8,-71.5,41.6
2,-80.5,34.9,-97.6,30.2
3,-92.1,37.0,-86.8,43.6"""
|> fromCSV
|> gather "odCoordType"
"value"
[ ( "originLong", "oLong" )
, ( "originLat", "oLat" )
, ( "destLong", "dLong" )
, ( "destLat", "dLat" )
]
|> bisect "odCoordType" headTail ( "od", "coordType" )
|> spread "coordType" "value"
creates the table
| flowID | od | Long | Lat |
| -----: | -- | ----- | ---- |
| 1 | o | -71.9 | 41.8 |
| 1 | d | -71.5 | 41.6 |
| 2 | o | -80.5 | 34.9 |
| 2 | d | -97.6 | 30.2 |
| 3 | o | -92.1 | 37.0 |
| 3 | d | -86.8 | 43.6 |
spread : String -> String -> Table -> Table
The inverse of gather, spreading a pair of columns rotates values
to separate columns (like a pivot in a spreadsheet). This is useful if different
variables are stored in separate rows of the same column. For example, the following
table contains two different variables in the temperature
column:
| location | year | readingType | temperature |
| --------- | ---- | ----------- | ----------: |
| Bristol | 2018 | minTemp | 3 |
| Bristol | 2018 | maxTemp | 27 |
| Sheffield | 2018 | minTemp | -2 |
| Sheffield | 2018 | maxTemp | 26 |
| Glasgow | 2018 | minTemp | -10 |
| Glasgow | 2018 | maxTemp | 23 |
| Aberdeen | 2018 | maxTemp | 14 |
We can spread the temperatures into separate columns reflecting their distinct meanings, generating the table:
| location | year | minTemp | maxTemp |
| --------- | ---- | ------: | ------: |
| Bristol | 2018 | 3 | 27 |
| Sheffield | 2018 | -2 | 26 |
| Glasgow | 2018 | -10 | 23 |
| Aberdeen | 2018 | | 14 |
The first parameter is the name of the column containing the values that will form
the new spread column names (readingType
above). The second parameter is the name
of the column containing the values to be inserted in each new column (temperature
above).
Missing rows (e.g. Aberdeen, 2018, minTemp
above) are rotated as empty strings
in the spread column. If either of the columns to spread is not found, the original
table is returned.
Spreading effectively groups by values in the non-spreading columns. If the table to spread only contains the type and value columns, an empty table will be created as there are no values to group by. In these cases, adding an index column with insertSetIndexColumn can generate values to group by.
bisect : String -> (String -> ( String, String )) -> ( String, String ) -> Table -> Table
Split a named column (first parameter) into two with a bisecting function (second parameter). The third parameter should be the names to give the two new columns, which are inserted into the table replacing the original bisected column.
For example, given a table
| row | col | z |
| --: | --: | --- |
| 0 | 0 | z00 |
| 0 | 1 | z01 |
| 0 | 2 | z02 |
| 1 | 0 | z10 |
| 1 | 1 | z11 |
| 1 | 2 | z12 |
bisecting it with
bisect "z"
(\z ->
( String.left 2 z
, String.left 1 z ++ String.right 1 z
)
)
( "zr", "zc" )
produces the table
| row | col | zr | zc |
| --: | --: | -- | -- |
| 0 | 0 | z0 | z0 |
| 0 | 1 | z0 | z1 |
| 0 | 2 | z0 | z2 |
| 1 | 0 | z1 | z0 |
| 1 | 1 | z1 | z1 |
| 1 | 2 | z1 | z2 |
If the column to be bisected is not found, the original table is returned.
For more sophisticated disaggregation, such as splitting a column into more than two new ones, consider disaggregate.
splitAt : Basics.Int -> String -> ( String, String )
Convenience function for splitting a string (second parameter) at the given position (first parameter).
splitAt 4 "tidyString" == ( "tidy", "String" )
If the first parameter is negative, the position is counted from the right rather than left.
splitAt -4 "temperature2019" == ( "temperature", "2019" )
Useful when using bisect to split column values in two.
headTail : String -> ( String, String )
Convenience function for splitting a string into its first (head) and remaining
(tail) characters. e.g. headTail "tidy" == ("t","idy")
. Equivalent to splitAt 1
.
Useful when using bisect to split column values into one column of heads
and another of tails.
disaggregate : String -> String -> List String -> Table -> Table
Disaggregate the values in a given column (first parameter) according to a regular expression (second parameter). The names to give to the new disaggregated columns are provided in the third parameter. The number of groups returned by the regular expression should match the number of new column names. Performs a similar function to tidyr's extract function.
For example, to disaggregate diagnosisCohort
in the following table:
| diagnosisCohort | numCases |
| --------------- | -------: |
| new_sp_m0-14 | 52 |
| new_sp_m15-24 | 228 |
| new_sp_f0-14 | 35 |
| new_sp_f15-24 | 180 |
| new_sn_m0-14 | 9 |
| new_sn_m15-24 | 97 |
| new_sn_f0-14 | 11 |
| new_sn_f15-24 | 64 |
disaggregate "diagnosisCohort"
"new_?(.*)_(.)(.*)"
[ "diagnosis", "gender", "age" ]
produces a new table:
| numCases | diagnosis | gender | age |
| -------: | --------- | ------ | ----- |
| 52 | sp | m | 0-14 |
| 228 | sp | m | 15-24 |
| 35 | sp | f | 0-14 |
| 180 | sp | f | 15-24 |
| 9 | sn | m | 0-14 |
| 97 | sn | m | 15-24 |
| 11 | sn | f | 0-14 |
| 64 | sn | f | 15-24 |
If the column to disaggregate cannot be found, the original table is returned.
transposeTable : String -> String -> Table -> Table
Transpose the rows and columns of a table. Provide the name of column that will generate the column headings in the transposed table as the first parameter and the name you wish to give the new row names as the second.
For example,
newTable =
myTable |> transposeTable "location" "temperature"
where myTable
stores:
| location | temperature2017 | temperature2018 |
| --------- | --------------: | --------------: |
| Bristol | 12 | 14 |
| Sheffield | 11 | 13 |
| Glasgow | 8 | 9 |
creates the following table:
| temperature | Bristol | Sheffield | Glasgow |
| --------------- | ------: | --------: | ------: |
| temperature2017 | 12 | 11 | 8 |
| temperature2018 | 14 | 13 | 9 |
If the column to contain new headings cannot be found, an empty table is generated. If there are repeated names in the new headings column, earlier rows are replaced with later repeated ones.
normalize : String -> List String -> Table -> ( Table, Table )
Replace some columns with a single id column and store those column values in a separate table. Useful for removing redundancy in a table where multiple rows contain several values that describe the same feature. The two resulting tables are related with a new id column (named with the first parameter). The names of columns forming the key, that are moved into the key table, are provided as the second parameter. The function returns a pair of tables in the order (key table, value table).
For example, assuming the following table recording 5 people's favourite film is
stored as favFilms
:
| person | age | film | release | director |
| ------- | --- | ------------ | ------- | --------- |
| Alice | 51 | Vertigo | 1958 | Hitchcock |
| Brenda | 60 | Citizen Kane | 1941 | Welles |
| Cate | 23 | Vertigo | 1958 | Hitchcock |
| Deborah | 38 | Jaws | 1985 | Spielberg |
| Eloise | 45 | Citizen Kane | 1941 | Welles |
Normalizing it with
favFilms
|> normalize "id" [ "film", "release", "director" ]
produces a tuple containing the following two tables:
| id | film | release | director |
| -- | ------------ | ------- | ---------- |
| 1 | Citizen Kane | 1941 | Welles |
| 2 | Jaws | 1975 | Spielberg |
| 3 | Vertigo | 1958 | Hitchcock |
| person | age | id |
| ------- | --- | -- |
| Alice | 51 | 3 |
| Brenda | 60 | 1 |
| Cate | 23 | 3 |
| Deborah | 38 | 2 |
| Eloise | 45 | 1 |
The process of separating a table into two can be reversed by applying a table join, for example:
let
( keyTable, valueTable ) =
favFilms
|> normalize "id" [ "film", "release", "director" ]
in
leftJoin ( valueTable, "id" ) ( keyTable, "id" )
|> removeColumn "id"
Join two tables using a common key. While not specific to tidy data, joining tidy tables is often more meaningful than joining messy ones. Joins often rely on the existence of a 'key' column containing unique row identifiers. If tables to be joined do not have such a key, they can be added with insertIndexColumn.
The examples below illustrate joining two input tables with shared key values
k2
and k4
:
table1:
| Key1 | colA | colB |
| ---- | ---- | ---- |
| k1 | a1 | b1 |
| k2 | a2 | b2 |
| k3 | a3 | b3 |
| k4 | a4 | b4 |
table2:
| Key2 | colC | colD |
| ---- | ---- | ---- |
| k2 | c2 | d2 |
| k4 | c4 | d4 |
| k6 | c6 | d6 |
| k8 | c8 | d8 |
leftJoin : ( Table, String ) -> ( Table, String ) -> Table
A left join preserves all the values in the first table and adds any key-matched values from columns in the second table to it. Where both tables share common column names, including key columns, only those in the left (first) table are stored in the output.
leftJoin ( table1, "Key1" ) ( table2, "Key2" )
would generate
| Key1 | colA | colB | Key2 | colC | colD |
| ---- | ---- | ---- | ---- | ---- | ---- |
| k1 | a1 | b1 | | | |
| k2 | a2 | b2 | k2 | c2 | d2 |
| k3 | a3 | b3 | | | |
| k4 | a4 | b4 | k4 | c4 | d4 |
If one or both of the key columns are not found, the left table is returned.
rightJoin : ( Table, String ) -> ( Table, String ) -> Table
A right join preserves all the values in the second table and adds any key-matched values from columns in the first table to it. Where both tables share common column names, including key columns, only those in the right (second) table are stored in the output.
rightJoin ( table1, "Key1" ) ( table2, "Key2" )
would generate
| Key2 | colC | colD | Key1 | colA | colB |
| ---- | ---- | ---- | ---- | ---- | ---- |
| k2 | c2 | d2 | k2 | a2 | b2 |
| k4 | c4 | d4 | k4 | a4 | b4 |
| k6 | c6 | d6 | | | |
| k8 | c8 | d8 | | | |
If one or both of the key columns are not found, the right table is returned.
innerJoin : String -> ( Table, String ) -> ( Table, String ) -> Table
An inner join will contain only key-matched rows that are present in both tables. The first parameter is the name to give the new key-matched column, replacing the separate key names in the two tables. Where both tables share a common column name, the one in the first table is prioritised.
innerJoin "Key" ( table1, "Key1" ) ( table2, "Key2" )
would generate
| Key | colA | colB | colC | colD |
| --- | ---- | ---- | ---- | ---- |
| k2 | a2 | b2 | c2 | d2 |
| k4 | a4 | b4 | c4 | d4 |
If one or both of the key columns are not found, this produces an empty table.
outerJoin : String -> ( Table, String ) -> ( Table, String ) -> Table
An outer join contains all rows of both joined tables. The first parameter is the name to give the new key-matched column, replacing the separate key names in the two tables.
outerJoin "Key" ( table1, "Key1" ) ( table2, "Key2" )
would generate
| Key | colA | colB | colC | colD |
| --- | ---- | ---- | ---- | ---- |
| k1 | a1 | b1 | | |
| k2 | a2 | b2 | c2 | d2 |
| k3 | a3 | b3 | | |
| k4 | a4 | b4 | c4 | d4 |
| k6 | | | c6 | d6 |
| k8 | | | c8 | d8 |
If one or both of the key columns are not found, this produces an empty table.
leftDiff : ( Table, String ) -> ( Table, String ) -> Table
Provides a table of all the rows in the first table that do not occur in any key-matched rows in the second table.
leftDiff ( table1, "Key1" ) ( table2, "Key2" )
would generate
| Key1 | colA | colB |
| ---- | ---- | ---- |
| k1 | a1 | b1 |
| k3 | a3 | b3 |
If the first key is not found, an empty table is returned, if the second key is not found, the first table is returned.
rightDiff : ( Table, String ) -> ( Table, String ) -> Table
Provides a table of all the rows in the second table that do not occur in any key-matched rows in the first table.
rightDiff ( table1, "Key1" ) ( table2, "Key2" )
would generate
| Key2 | colC | colD |
| ---- | ---- | ---- |
| k6 | c6 | d6 |
| k8 | c8 | d8 |
If the first key is not found, the second table is returned, if the second key is not found, an empty table is returned.
tableSummary : Basics.Int -> Table -> List String
Provide a textual description of a table, configurable to show a given number of table rows. If the number of rows to show is negative, all rows are output. This is designed primarily to generate markdown output, but is interpretable as raw text.
columnNames : Table -> List String
Provide a list of column names for the given table.
toCSV : Table -> String
Provide a CSV (comma-separated values) format version of a table. Can be useful for applications that need to save a table as a file.
toDelimited : String -> Table -> String
Provide text containing table values separated by the given delimiter (first parameter). Can be useful for applications that need to save a table as a file. For example, to create tab-delimited (TSV) text representing a table for later saving as a file:
toDelimited '\t' myTable
numColumn : String -> Table -> List Basics.Float
Extract the numeric values of a given column from a table. Any conversions that fail, including missing values in the table are converted into zeros. If you wish to handle missing data / failed conversions in a different way, use toColumn instead, providing a custom converter function.
dataColumn =
myTable |> numColumn "year"
strColumn : String -> Table -> List String
Extract the string values of a given column from a table. Missing values in the table are represented as empty strings. If you wish to handle missing values in a different way, use toColumn instead, providing a custom converter function.
dataColumn =
myTable |> strColumn "cityName"
booColumn : String -> Table -> List Basics.Bool
Extract Boolean values of a given column from a table. Assumes that True
values can be represented by the case-insensitive strings true
, yes
and 1
while all other values are assumed to be false.
dataColumn =
myTable |> booColumn "isMarried"
toColumn : String -> (String -> a) -> Table -> List a
Extract the values of the column with the given name (first parameter) from a table. The type of values in the column is determined by the given cell conversion function. The converter function should handle cases of missing data in the table (e.g. empty strings) as well as failed conversions (e.g. attempts to convert text into a number).
imputeMissing : String -> Int
imputeMissing =
String.toFloat >> Maybe.withDefault 0
dataColumn =
myTable |> toColumn "count" imputeMissing