elm-explorations / benchmark / Benchmark

Benchmark Elm Programs


type alias Benchmark =
Benchmark

Benchmarks that contain potential, in-progress, and completed runs.

To make these, try benchmark, compare, or scale, and organize them with describe.

Creating and Organizing Benchmarks

benchmark : String -> (() -> a) -> Benchmark

Benchmark a single function.

benchmark "head" (\_ -> List.head [ 1 ])

The name here should be short and descriptive. Ideally, it should also uniquely identify a single benchmark among your whole suite.

Your code is wrapped in an anonymous function, which we will call repeatedly to measure speed. Note that this is slightly slower than calling functions directly. This is OK! The point of this library is to reliably measure execution speed. In this case, we get more consistent results by calling them inside thunks like this.

Now, a note about benchmark design: when you first write benchmarks, you usually think something along the lines of "I need to test the worst possible complexity!" You should test this eventually, but it's a bad first step.

Instead, benchmark the smallest real sample. If your typical use of a data structure has 20 items, measure with 20 items. You'll get edge cases eventually, but it's better to get the basics right first. Solve the problems you know are real instead of inventing situations you may never encounter.

When you get the point where you know you need to measure a bunch of different sizes, we've got your back: that's what scale is for.

compare : String -> String -> (() -> a) -> String -> (() -> b) -> Benchmark

Run two benchmarks head-to-head. This is useful when optimizing data structures or other situations where you can make apples-to-apples comparisons between different approaches.

As with benchmark, the first argument is the name for the comparison. The other string arguments are the names of the functions that follow them directly.

compare "initialize"
    "Hamt"
    (\_ -> Array.HAMT.initialize 100 identity)
    "Core"
    (\_ -> Array.initialize 100 identity)

In addition to the general advice in benchmark, try as hard as possible to make the arguments the same. It wouldn't be a valid comparison if, in the example above, we told Array.HAMT to use 1,000 items instead of 100. In the cases where you can't get exactly the same arguments, at least try to match output.

scale : String -> List ( String, () -> a ) -> Benchmark

Specify scale benchmarks for a function. This is especially good for measuring the performance of your data structures under differently sized workloads.

Beware that large series can make very heavy benchmarks. Adjust your expectations and measurements accordingly!

For example, this benchmark will see how long it takes to get a dictionary size, where the size is powers of 10 between 1 and 100,000:

dictOfSize : Int -> Dict Int ()
dictOfSize size =
    List.range 0 size
        |> List.map (\a -> (\a b -> ( a, b )) a ())
        |> Dict.fromList

dictSize : Benchmark
dictSize =
    List.range 0 5
        -- tip: prepare your data structures _outside_ the
        -- benchmark function. Here, we're measuring `Dict.size`
        -- without interference from `dictOfSize` and the
        -- functions that it uses.
        |> List.map ((^) 10)
        |> List.map (\size -> ( size, dictOfSize size ))
        -- now we have a list of structures, make benchmarks pass
        -- them to `scale`!
        |> List.map (\( size, target ) -> ( toString size, \_ -> Dict.size target ))
        |> scale "Dict.size"

Note: The API for this function is newer, and may change in the future than other functions. If you use it, please open an issue with your use case so we can know the right situations to optimize for in future releases.

describe : String -> List Benchmark -> Benchmark

Group a number of benchmarks together. Grouping benchmarks using describe will never effect measurement, only organization.

You'll typically have at least one call to this in your benchmark program, at the top level:

describe "your program"
    [{- your benchmarks here -}]

Writing Runners

step : Benchmark -> Task Basics.Never Benchmark

Step a benchmark forward to completion.

Warning: step is only useful for writing runners. As a consumer of the elm-benchmark library, you'll probably never need it!

...

Still with me? OK, let's go.

This function "advances" a benchmark through a series of states (described below.) If the benchmark has no more work to do, this is a no-op. But you probably want to know about that so you can present results to the user, so use done to figure it out before you call this.

At a high level, a runner just needs to receive benchmarks from the user, iterate over them using this function, and convert them to Reports whenever it makes sense to you to do so. You shouldn't need to care too much about the nuances of the internal benchmark state, but a little knowledge is useful for making a really great user experience, so read on.

The Life of a Benchmark

     ┌─────────────┐
     │    cold     │
     │  benchmark  │
     └─────────────┘
            │
            │  warm up JIT
            ▼
     ┌─────────────┐
     │   unsized   │
     │  benchmark  │
     └─────────────┘
            │
            │  determine
            │  sample size
            ▼
    ┌──────────────┐
    │              │ ───┐
    │    sized     │    │ collect
    │  benchmark   │    │ another
    │  (running)   │    │ sample
    │              │ ◀──┘
    └──────────────┘
        │      │
     ┌──┘      └──┐
     │            │
     ▼            ▼
┌─────────┐  ┌─────────┐
│         │  │         │
│ success │  │ failure │
│         │  │         │
└─────────┘  └─────────┘

When you get a Benchmark from the user it won't have any idea how big the sample size should be. In fact, we can't know this in advance because different functions will have different performance characteristics on different machines and browsers and phases of the moon and so on and so forth.

This is difficult, but not hopeless! We can determine sample size automatically by running the benchmark a few times to get a feel for how it behaves in this particular environment. This becomes our first step. (If you're curious about how exactly we do this, check the Benchmark.LowLevel documentation.)

Once we have the base sample size, we start collecting samples. We multiply the base sample size to spread runs into a series of buckets. We do this because running a benchmark twice ought to take about twice as long as running it once. Since this relationship is perfectly linear, we can get a number of sample sizes, then create a trend from them which will be resilient to outliers.

The final result takes the form of an error or a set of samples, their sizes, and a trend created from that data.

At this point, we're done! The results are presented to the user, and they make optimizations and try again for ever higher runs per second.

done : Benchmark -> Basics.Bool

find out if a Benchmark is finished. For progress information for reporting purposes, see Benchmark.Status.progress.

The default runner uses this function to find out if it should call step any more.