danneu / html-parser / Html.Parser

Leniently parse html5 documents and fragments and then render them into strings or Elm's virtual dom nodes.

Definition


type Node
    = Text String
    | Comment String
    | Element String (List ( String, String )) (List Node)

An html node is tree of text, comments, and element nodes.

An element (e.g. <div foo="bar">hello</div>) can have attributes and child nodes.


type alias Document =
{ legacyCompat : Basics.Bool
, root : Node 
}

An html document has a <!doctype> and then a root html node.


type Config

Configure the parser. Use the config constructors to create a config object.

Config

allCharRefs : Config

A config with char reference decoding turned on.

This will add ~40kb to your bundle, but it is necessary to decode entities like "&Delta;" into "Δ".

run allCharRefs "abc&Delta;def"
    == Ok [ text "abcΔdef" ]

noCharRefs : Config

A config with char reference decoding turned off.

If you know that the html you are parsing never has named character references, or if it's sufficient to just consume them as undecoded text, then turning this off will shrink your bundle size.

run noCharRefs "abc&Delta;def"
    == Ok [ text "abc&Delta;def" ]

customCharRefs : Dict String String -> Config

Provide your own character reference lookup dictionary.

Note that named character references are case sensitive. When providing your own, you will want to consult the exhaustive Html.CharRefs.all dictionary to see which keys appear multiple times, like "quot" and "QUOT".

Here is an example of providing a small subset of commonly-seen character references.

config : Html.Parser.Config
config =
    [ ( "quot", "\"" )
    , ( "QUOT", "\"" )
    , ( "apos", "'" )
    , ( "gt", ">" )
    , ( "GT", ">" )
    , ( "Gt", ">" )
    , ( "lt", "<" )
    , ( "LT", "<" )
    , ( "Lt", "<" )
    , ( "amp", "&" )
    , ( "AMP", "&" )
    , ( "nbsp", "\u{00A0}" )
    ]
    |> Dict.fromList
    |> customCharRefs

run config "<span>&male; &amp; &female;</span>"
    == Ok (Element "span" [] [Text "&male; & &female;"])

Notice that character references missing from the lookup table are simply parsed as text.

Parse

run : Config -> String -> Result (List Parser.DeadEnd) (List Node)

Parse an html fragment into a list of html nodes.

The html fragment can have multiple top-level nodes.

run allCharRefs "<div>hi</div><div>bye</div>"
    == Ok
        [ Element "div" [] [ Text "hi" ]
        , Element "div" [] [ Text "bye" ]
        ]

runElement : Config -> String -> Result (List Parser.DeadEnd) Node

Like run except it only parses one top-level element and it always returns a single node.

runDocument : Config -> String -> Result (List Parser.DeadEnd) Document

Parses <!doctype html> and any html nodes after.

Always returns a single root node. Wraps nodes in a root <html> node if one is not present.

Caveat: If there are multiple top-level nodes and one of them is <html>, then this function will wrap them all in another <html> node.

Render

nodeToHtml : Node -> Html msg

Turn a single node into an Elm html node that Elm can render.

nodesToHtml : List Node -> List (Html msg)

Turn a multiple html nodes into Elm html that Elm can render.

view : Html Msg
view =
    Html.div
        []
        ("<p>hello world</p>"
            |> Html.Parser.run Html.Parser.allCharRefs
            |> Result.map Html.Parser.nodesToHtml
            |> Result.withDefault [ Html.text "parse error" ]
        )

nodeToString : Node -> String

Convert an html node into a non-pretty string.

nodeToString (Element "a" [] [ Text "hi" ])
    == "<a>hi</a>"

nodesToString : List Node -> String

Convert multiple html nodes into a non-pretty string.

nodesToString
    [ Element "a" [] [ Text "hi" ]
    , Element "div" [] [ Element "span" [] [] ]
    ]
    == "<a>hi</a><div><span></span></div>"

nodeToPrettyString : Node -> String

Generate a pretty string for a single html node.

nodesToPrettyString : List Node -> String

Turn a node tree into a pretty-printed, indented html string.

("<a><b><c>hello</c></b></a>"
    |> Html.Parser.run Html.Parser.allCharRefs
    |> Result.map nodesToPrettyString
)
    == Ok """<a>
    <b>
        <c>
            hello
        </c>
    </b>
</a>"""

documentToString : Document -> String

Convert a document into a string starting with <!doctype html> followed by the root html node.

documentToPrettyString : Document -> String

Convert a document into a pretty, indented string.