A demonstration of the law of the flowering plants

The day a flower blooms is one of the earliest phenomena studied with systematic data collection and analysis. The prediction rule developed nearly three centuries ago is still used by scientists – with relatively few changes – to plan harvests, manage pests, and study ecosystems stressed by climate change. In this tutorial, Jonathan Auerbach introduces the “law of the flowering plants” and demonstrates how it can be applied to modern datasets.

Prediction
History of data science
Statistics
Author

Jonathan Auerbach

Published

April 13, 2023

Modified

January 18, 2024

This tutorial will demonstrate a popular method for predicting the day a flower will bloom. There are many reasons why you might want to predict a bloom date. You might be a scientist studying ecosystems stressed by climate change. Or you might be planning a trip to Amsterdam and would like to time your stay to when the tulips are in bloom. Or maybe you are participating in the annual Cherry Blossom Prediction Competition and want some ideas to help you get started.

In any case, you might be surprised to learn that the day a flower blooms is one of the earliest phenomena studied with systematic data collection and analysis. The mathematical rule developed in the eighteenth century to make these predictions – now called the “law of the flowering plants” – shaped the direction of statistics as a field and is still used by scientists with relatively few changes.

We present the law of the flowering plants as it was stated by Adolphe Quetelet, an influential nineteenth century statistician. Upon completing this tutorial, you will be able to:

  1. State the law of the flowering plants and explain how Quetelet derived it.
  2. Reproduce Quetelet’s findings with weather data from the Global Historical Climatology Network.
  3. Replicate Quetelet’s findings with more recent data from the USA National Phenology Network.
  4. Predict the day the lilac will bloom in Brussels in 2023 with weather forecasts from AccuWeather.
  5. Describe how the USA National Phenology Network uses the bloom dates of lilacs to monitor the start of spring.

At the end of the tutorial, we challenge you to design an algorithm that beats our predictions. The tutorial uses the R programming language. In particular, the code relies on the following packages:

```{r}
library("knitr")
library("kableExtra")
library("tidyverse")
library("plotly")
library("rnoaa")
library("rnpn")
library("rvest")
```

The law of the flowering plants

We begin by reviewing the law of the flowering plants as it was stated by Adolphe Quetelet. You may already know Quetelet as the inventor of the body mass index. Less known is that Quetelet recorded the bloom dates of hundreds of different plants between 1833 and 1852 at the Brussels Observatory, which he founded and directed. Quetelet reported that a plant flowers when exposed to a specific quantity of heat, measured in degrees of Celsius squared (°C²). For example, he calculated that a lilac blooms when the sum of the daily temperatures squared exceeds 4264°C² following the last frost.

He communicated this law in his Letters addressed to HRH the grand duke of Saxe-Coburg and Gotha (Number 33, 1846; translated 1849) and in his reporting On the climate of Belgium (Chapter 4, Part 4, 1848; data updated in Part 7, 1857). A picture of Quetelet and the title page of On the climate of Belgium are displayed in Figure 1.

Portrait of Adolphe Quetelet.

Title page of Adolphe Quetelet's 'On the climate of Belgium'.

Figure 1: Quetelet reported on the law of the flowering plants in On the climate of Belgium (1857). Sources: Wikimedia Commons, Gallica.

Quetelet was not the first to study bloom dates. Anthophiles have recorded the dates that flowers bloom for centuries. Written records of cherry trees go back as far as 812 AD in Japan and peach and plum trees as far as 1308 AD in China. Systematic record keeping began a century before Quetelet with Robert Marsham’s Indications of Spring (1789).

Quetelet was also not the first to study the relationship between temperature and bloom dates. René Réaumur (1735), an early adopter of the thermometer, noted the relationship before Marsham published his Indications. But Quetelet was the first to systematically study the relationship across a wide variety of plants and derive the amount of heat needed to bloom. An example of Quetelet’s careful record keeping can be seen in Figure 2, one of many tables he reported in his publications.

A table of bloom dates at Brussels Observatory observed by Quetelet between 1839 and 1852.

Figure 2: Bloom dates at Brussels Observatory observed by Quetelet between 1839 and 1852. Source: Gallica.

Reproducing Quetelet’s law of the flowering plants

To reproduce Quetelet’s law, we combine the data in Figure 2 with additional observations from his Letters. We focus on Quetelet’s primary example, the bloom date of the common lilac, Syringa vulgaris, row 18 of Figure 2. We do this because Quetelet carefully describes his methodology for measuring the bloom date of lilacs. For example, Quetelet considers a lilac to have bloomed when “the first corolla opens and shows the stamina.” That event is closest to what the USA Phenology Network describes as “open flowers”, depicted in the center image of Figure 3 below. This detail will become relevant when we attempt to replicate Quetelet’s law in a later section. Note that although we focus on lilacs in this tutorial, the R code is easily edited to predict the day that other plants will bloom.

Photo of lilac flower buds.

Photo of open lilac flowers.

Photo of full lilac flowers.

Figure 3: The bloom date occurs when the first corolla opens and shows the stamina (center image). Source: USA National Phenology Network.

In the R code below, the five-column tibble lilac contains the date each year that Quetelet observed the lilacs bloom at Brussels Observatory. The first three columns are the month, day, and year the lilacs bloomed between 1839 and 1852. These columns are combined to form the fourth column, the full date the lilacs bloomed. The last column converts the date to the day of the year the lilacs bloomed, abbreviated “doy.” That is, “doy” is the number of days it took for the lilacs bloom following January 1. Both “date” and “doy” representations of Quetelet’s observations will be useful throughout this tutorial.

```{r}
lilac <-                   
  tibble(month = c("May", "April", "April", "April", "April", "April", "May", 
                   "April", "May", "April", "May", "April", "May", "May"),
         day   =  c(10, 28, 24, 28, 20, 25, 13, 12, 9, 21, 2, 30, 1, 12),
         year  = 1839:1852,
         date  = as.Date(paste(month, day, year), format = "%B %d %Y"),
         doy   = parse_number(format(date, "%j"))) 

lilac %>% 
  kable(align = "c",
        caption = "Table 1: Bloom dates of lilacs observed by Quetelet between 1839 and 1852.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
```
Table 1: Bloom dates of lilacs observed by Quetelet between 1839 and 1852.
month day year date doy
May 10 1839 1839-05-10 130
April 28 1840 1840-04-28 119
April 24 1841 1841-04-24 114
April 28 1842 1842-04-28 118
April 20 1843 1843-04-20 110
April 25 1844 1844-04-25 116
May 13 1845 1845-05-13 133
April 12 1846 1846-04-12 102
May 9 1847 1847-05-09 129
April 21 1848 1848-04-21 112
May 2 1849 1849-05-02 122
April 30 1850 1850-04-30 120
May 1 1851 1851-05-01 121
May 12 1852 1852-05-12 133

To reproduce Quetelet’s law of the flowering plants, we will combine these bloom dates with daily temperature. The daily maximum and minimum temperatures at Brussels Observatory between 1839 and 1852 are available from the Global Historical Climatology Network. The data can be downloaded using the ghcnd_search function contained within the R package rnoaa (2021). The station id for Brussels Observatory is “BE000006447”.

The ghcnd_search function returns the maximum and minimum temperature as separate tibbles in a list. In the R code below, we join the tibbles using the reduce function. Note that the temperature is reported in tenths of a degree (i.e. 0.1°C) so we divide by 10 before calculating the temperature midrange, our estimate of the daily temperature.

The result is a five-column tibble temp, which contains the year of the temperature record (“year”), the date of the temperature record (“date”), the maximum temperature (“tmax”), the minimum temperature (“tmin”), and the midrange temperature (“temp”). The first 10 rows of the table are below. When you produce the full table yourself, you may notice that a small portion of temperature records are missing. We found that imputing these missing values does not significantly change the results. Therefore, we ignore these days when conducting our analysis.

```{r}
temp <- 
  ghcnd_search(stationid = "BE000006447",
               var = c("tmax", "tmin"),
               date_min = "1839-01-01",
               date_max = "1852-12-31") %>%
  reduce(left_join) %>%
  transmute(year = parse_number(format(date, "%Y")), 
            date, 
            tmax = tmax / 10, 
            tmin = tmin / 10, 
            temp = (tmax + tmin) / 2)
  
temp %>% 
  kable(align = "c", 
        col.names = c("year", "date", "maximum temperature (°C)", 
                      "minimum temperature (°C)", "midrange temperature (°C)"),
        caption = "Table 2: Temperature observed at Brussels Observatory between 1839 and 1852.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
```
Table 2: Temperature observed at Brussels Observatory between 1839 and 1852.
year date maximum temperature (°C) minimum temperature (°C) midrange temperature (°C)
1839 1839-01-01 5.7 -0.2 2.75
1839 1839-01-02 6.3 0.8 3.55
1839 1839-01-03 7.2 1.8 4.50
1839 1839-01-04 8.0 1.8 4.90
1839 1839-01-05 5.3 0.8 3.05
1839 1839-01-06 10.0 1.3 5.65
1839 1839-01-07 8.9 1.4 5.15
1839 1839-01-08 3.0 0.1 1.55
1839 1839-01-09 0.8 -0.1 0.35
1839 1839-01-10 2.8 -2.8 0.00

Reproducing Quetelet’s law is now a simple matter of calculating the sum of the squared daily temperature from the day of last frost until the bloom day. We could use the day of last frost reported in Quetelet’s Letters. However, since we will replicate Quetelet’s analysis with recent data in a later section, we use our own definition of the day of last frost. We define the day of last frost to be the day following the last day the maximum temperature is below 0. The R code below creates the function doy_last_frost to extract the day of last frost from the maximum temperature. To demonstrate this function, we then compare the bloom date with the last frost date in 1839, the first year Quetelet observed.

```{r}
doy_last_frost <- function(tmax, doy_max = 100) {
  dof <- which(tmax[1:doy_max] <= 0)
  if(length(dof) == 0) 1 else max(dof) + 1
  }

bloom_day <- 
  lilac %>% 
  filter(year == 1839) %>%
  pull(doy) + 
  as.Date("1839-01-01")
  
frost_day <- 
  temp %>% 
  filter(year == 1839) %>% 
  pull(tmax) %>% 
  doy_last_frost() + as.Date("1839-01-01") 

tibble(`last frost date` = frost_day, 
       `bloom date` = bloom_day) %>%
  kable(align = "c",
        caption = "Table 3: Last frost date and lilac bloom date at Brussels Observatory in 1839.") %>%
  kable_styling()
```
Table 3: Last frost date and lilac bloom date at Brussels Observatory in 1839.
last frost date bloom date
1839-03-08 1839-05-11

If Quetelet’s law of the flowering plants is correct, Table 3 has the following interpretation. On March 8, 1839 the lilacs at Brussels Observatory began “collecting” temperature. The lilacs continued to “collect” temperature until May 11, at which point they exceeded their 4264°C² quota and bloomed. We visualize this theory in Figure 4 with the R packages ggplot2, a member of the set of packages that constitute the “tidyverse” (2019), and plotly.

```{r}
(temp %>% 
  filter(date < as.Date("1839-06-01")) %>% 
  ggplot() + 
  aes(date, temp) + 
  geom_line() + 
  labs(
    x = "",
    y = "midrange temperature (°C)",
    title = 
      "Figure 4: According to Quetelet's law, the lilacs bloom when exposed to 4264°C² following the last frost.") +
  geom_vline(xintercept = as.numeric(c(bloom_day, frost_day)), 
             linetype = "dotted")) %>%
  ggplotly() %>% 
  add_annotations(x = as.numeric(c(frost_day, bloom_day)),
                  y = c(-4, -4),
                  text = c("last\nfrost", "first\nbloom"),
                  font = list(size = 14),
                  ay = 0,
                  xshift = c(-10, -12)) %>%
  config(displaylogo = FALSE)
```

A line graph of midrange temperature by day, for January to June 1839, with day of last frost (March 8) and day of first bloom (May 11) marked by vertical dashed lines.

Figure 4: According to Quetelet’s law, the lilacs bloom when exposed to 4264°C² following the last frost. Author provided, CC BY 4.0.

We now have all the ingredients necessary to reproduce Quetelet’s findings. Our reproduction is greatly simplified by using the nest function from the tidyr package, another member of the “tidyverse”. For an overview of nest, see the “Nested data” section of Grolemund and Wickham (2017). We will group the data by year, nest, calculate the cumulative squared temperature from the frost date to the bloom date within each year, and then unnest. We ignore temperatures below 0°C. That is, temperatures below 0°C are set to 0°C. We do this because it is clear from Quetelet’s derivation of the law that only positive temperatures should be squared. See the next section for details.

```{r}
quetelet <- 
  temp %>% 
  group_by(year) %>% 
  nest() %>% 
  left_join(lilac) %>% 
  mutate(law = map(data, ~ sum(pmax(.$temp, 0, na.rm = TRUE)[(doy_last_frost(.$tmax) + 1):doy]^2))) %>% 
  unnest(law) %>% 
  ungroup()

quetelet %>% 
  summarize(Quetelet = 4264, 
            est = mean(law), 
            se = sd(law)/sqrt(n()),
            ci  = str_c("[", round(est - 2 * se), ", ", round(est + 2 * se), "]")) %>%
  kable(dig = 0, 
        align = "c", 
        col.names = c("Quetelet's law (°C²)", "estimate (°C²)", 
                      "standard error (°C²)", "95% confidence interval (°C²)"),
        caption = "Table 4: Reproduction of Quetelet's analysis.") %>%
  kable_styling()
```
Table 4: Reproduction of Quetelet’s analysis.
Quetelet’s law (°C²) estimate (°C²) standard error (°C²) 95% confidence interval (°C²)
4264 4261 197 [3867, 4656]

The results show that Quetelet’s findings are indeed reproducible. Quetelet estimated that lilacs bloom once exposed to 4264°C² following the last frost. Our reanalysis suggests a similar amount. However, 4264°C² is the overall average across all years – the estimated amount needed to bloom varies year to year. As a result, the average has a 95% confidence interval of approximately 3870°C² to 4660°C². Quetelet was well aware of this variation. He argued it was due to unobserved factors that influence growing conditions and change each year, and he dedicated significant space in his Letters to discuss them.

These unobserved factors limit the accuracy of predictions made using the law. To assess the predictive accuracy of the law, we temporarily ignore the bloom dates Quetelet observed. Instead, we apply the 4264°C² quota to the temperature records at Brussels Observatory to predict the bloom date. We then compare our predictions with the bloom date Quetelet observed. The R code below creates the function doy_prediction to estimate the day the lilac will bloom from temperature records. Table 5 summarizes the accuracy of Quetelet’s law by the mean absolute error and root mean squared error.

```{r}
doy_prediction <- function(temp, tmax)
  doy_last_frost(tmax) + which.max(cumsum(pmax(temp[(doy_last_frost(tmax) + 1):365], 0, na.rm = TRUE)^2) > 4264)

quetelet %>% 
  mutate(pred = map(data, ~ doy_prediction(.$temp, .$tmax))) %>% 
  unnest(pred) %>% 
  ungroup() %>%
  summarize(mae  = mean(abs(doy - pred)),
            rmse = sqrt(mean((doy - pred)^2))) %>%
  kable(dig = 0,
        align = "c",
        col.names = c("mean absolute error (days)", "root mean squared error (days)"),
        caption = "Table 5: Predictions using Quetelet's law are accurate within a week on average.") %>%
  kable_styling()
```
Table 5: Predictions using Quetelet’s law are accurate within a week on average.
mean absolute error (days) root mean squared error (days)
5 6

Table 5 indicates that predictions made using the law are accurate to within a week on average. For comparison purposes, we also predict the day the lilacs will bloom using the average bloom date between 1839 and 1852. That is, on average the lilac bloomed on April 30 (April 29 on leap years), and we check the accuracy of simply predicting this average date each year. Table 6 indicates the average bloom date yields predictions that are less accurate by an average of two days.

```{r}
quetelet %>%
  summarize(pred = mean(doy),
            mae  = mean(abs(doy - pred)),
            rmse = sqrt(mean((doy - pred)^2))) %>%
  select(mae, rmse) %>%
  kable(dig = 0,
        align = "c",
        col.names = c("mean absolute error (days)",
                      "root mean squared error (days)"),
        caption = "Table 6: Predictions using the average bloom date are off by a week or more on average.") %>%
  kable_styling()
```
Table 6: Predictions using the average bloom date are off by a week or more on average.
mean absolute error (days) root mean squared error (days)
7 9

Quetelet’s derivation of the law of the flowering plants

Quetelet believed that, as in physics, universal laws govern social and biological phenomenon. Quetelet was not only inspired by physics to describe social and biological patterns using mathematical formulas. He often took his formulas directly from physics. In fact, you may have already recognized similarities between his law and Newton’s second law of motion.

Quetelet reasoned that temperature exerts a “force” on plants in the same way that gravity exerts a force on a falling object. Newton’s second law states that acceleration is proportional to force. It follows that an object initially at rest and subject to a constant force will travel a distance proportional to time squared. Quetelet simply substituted temperature for time.

We briefly elaborate. Let \(d(t)\) denote the distance an object travels after time \(t\). Let \(v(t) = d'(t)\) denote its speed and \(a(t) = v'(t)\) its acceleration. If acceleration is constant, i.e. \(a(t) = c\),

\(v(t) = \int_0^t a(s) \, ds = \int_0^t c \, ds = c t\)

and

\(d(t) = \int_0^t v(s) \, ds = \int_0^t c s \, ds = \tfrac{c}{2} t^2\)

Quetelet imagined plants experience time in temperature and bloom after “traveling” distance \(d_*\). If a plant is exposed to temperature \(t_i\) on day \(i = 1, 2, \ldots\), then the bloom date, \(n_*\), is the first day \(\sum_{i=1}^{n_*} \tfrac{c}{2} t_i^2 \geq d_*\). Multiplying both sides of the inequality by \(\tfrac{2}{c}\), yields Quetelet’s law: the bloom is the first day, \(n_*\), that \(\sum_{i=1}^{n_*} t_i^2 \geq \tfrac{2}{c} d_*\).

The derivation of laws like the law of the flowering plants was popular in the nineteenth century. But any similarities between the “force” of temperature and the force of gravity are likely coincidental. We are not aware of any biological mechanisms that justify Quetelet’s application of Newton’s law.

Today, the law of the flowering plants is considered a heuristic, or rule of thumb, that approximates complicated biological mechanisms. Like Quetelet, scientists model plants as experiencing time in temperature instead of calendar time. These temperature units are typically called “growing degree days”. Scientists often find that plants may only be sensitive to temperatures in specific ranges or “modified growing degree days”. Although modern statistical methods can greatly improve the accuracy of predictions, laws like Quetelet’s remain popular because they are simple to communicate and easy to replicate, as we demonstrate in the next section.

Replicating Quetelet’s law of the flowering plants

In the previous section, we explained how Quetelet derived the law of the flowering plants. Quetelet believed the law of the flowering plants was universal, describing the bloom date of all flowers around the world and in any year. Whether the law can in fact be considered universal requires replicating Quetelet’s results with new data collected at a different location in a different year.

In this section, we replicate the law of the flowering plants using lilac bloom dates observed by scientists between 1956 and 2009 at 53 locations throughout the Pacific Northwest (2015). The data can be downloaded from the USA National Phenology Network using the rnpn package (2022). For space considerations, the R code that downloads and cleans the data is provided in the Appendix. Running this code yields the tibble usa_npn. Each row of the tibble corresponds with a bloom date observed at a given site in a given year. There are 31 columns, only seven of which we use in our replication. The remaining columns are documented in the rnpn package, and we will not review them here.

Table 7 displays six of the seven columns (and only the first 10 rows of the full table). These columns are defined in the same way as the columns of Table 1, except for “site_id”, which denotes the site at which the observation was made. Table 1 does not have a “site_id” column because all observations were made at the same site, Brussels Observatory.

```{r}
load(url("https://github.com/jauerbach/miscellaneous/blob/main/usa_npn.RData?raw=true"))

usa_npn %>%
  transmute(site_id, 
            month = first_yes_month, 
            day   = first_yes_day, 
            year  = first_yes_year, 
            date  = as.Date(paste(month, day, year), format = "%m %d %Y"),
            doy) %>%
  kable(align = "c",
        caption = "Table 7: Bloom dates of lilacs observed in pacific northwest between 1956 and 2009.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", 
             height = "400px")
```
Table 7: Bloom dates of lilacs observed in pacific northwest between 1956 and 2009.
site_id month day year date doy
150 5 25 1956 1956-05-25 146
150 5 22 1957 1957-05-22 142
150 5 12 1958 1958-05-12 132
150 6 3 1959 1959-06-03 154
150 5 27 1960 1960-05-27 148
150 5 27 1961 1961-05-27 147
150 5 26 1962 1962-05-26 146
150 5 24 1963 1963-05-24 144
150 5 28 1964 1964-05-28 149
150 5 26 1966 1966-05-26 146

The seventh column we review is “temp”. Each row of “temp” is a tibble of temperature records taken at the nearest station in the Global Historical Climatology Network. The first tibble (again, only the first 10 rows) is displayed in Table 8 below. The columns are defined in the same way as the columns of Table 2, except for “id”, which denotes the location at which the temperature record was made. Table 2 does not have an “id” column because all observations were made at the same site, Brussels Observatory.

```{r}
usa_npn %>%
  pull(temp) %>%
  .[[1]] %>%
  mutate(year = parse_number(format(date, "%Y"))) %>%
  select(id, year, date, tmax, tmin, temp) %>%
  kable(align = "c",
        col.names = c("id", "year", "date", "maximum temperature (°C)", 
                      "minimum temperature (°C)", "midrange temperature (°C)"),
        caption = "Table 8: Temperature observed at an example pacific northwest site in 1956.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", 
             height = "400px")
```
Table 8: Temperature observed at an example pacific northwest site in 1956.
id year date maximum temperature (°C) minimum temperature (°C) midrange temperature (°C)
USC00245761 1956 1956-01-01 5.6 -5.6 0.00
USC00245761 1956 1956-01-02 1.7 -7.2 -2.75
USC00245761 1956 1956-01-03 3.3 -11.7 -4.20
USC00245761 1956 1956-01-04 4.4 -10.0 -2.80
USC00245761 1956 1956-01-05 7.8 0.0 3.90
USC00245761 1956 1956-01-06 4.4 -11.1 -3.35
USC00245761 1956 1956-01-07 2.8 -6.1 -1.65
USC00245761 1956 1956-01-08 4.4 -4.4 0.00
USC00245761 1956 1956-01-09 1.7 -9.4 -3.85
USC00245761 1956 1956-01-10 2.8 -6.1 -1.65

We are now prepared to replicate Quetelet’s findings. We will use R code nearly identical to the code we used to reproduce Quetelet’s findings earlier. The main difference is due to the fact that temperature records are dependent across sites within a year. To account for this dependence, we compute the cumulative temperature squared from the last frost to the bloom date for each site and year. We then take the average across all sites within a year. Finally, we calculate the standard error and confidence interval using only the variation of the averages across years. Table 9 displays the results.

```{r}
usa_npn %>%             
  group_by(rownames(usa_npn)) %>%
  mutate(law = 
           map(temp, ~ sum(pmax(.$temp, 0, na.rm = TRUE)[(doy_last_frost(.$tmax, doy) + 1):(doy - 1)]^2))) %>%
  unnest(law) %>% 
  group_by(year) %>%    
  summarize(law = mean(law)) %>%
  summarize(Quetelet = 4264, 
            est = mean(law), 
            se = sd(law) / sqrt(n()),
            ci  = str_c("[", round(est - 2 * se), ", ", round(est + 2 * se), "]")) %>%
  kable(dig = 0, 
        align = "c",
        col.names = c("Quetelet's law (°C²)", "estimate (°C²)",
                      "standard error (°C²)", "95% confidence interval (°C²)"),
        caption = "Table 9: Replication of Quetelet's analysis.") %>%
  kable_styling()
```
Table 9: Replication of Quetelet’s analysis.
Quetelet’s law (°C²) estimate (°C²) standard error (°C²) 95% confidence interval (°C²)
4264 4329 116 [4098, 4560]

Table 9 indicates that Quetelet’s findings are replicable in the sense that the confidence interval calculated using Quetelet’s data (Table 4) overlaps with the confidence interval calculated using the USA lilac data (Table 9). The standard error in Table 9 is smaller than Table 4 because the replication uses 54 years of data compared to Quetelet’s 14. Note that in the R code above, we subtract 1 from “doy” to correct for differences in how the bloom date is reported. This correction is not particularly important; the confidence intervals still overlap when this correction is removed.

We now investigate the accuracy of Quetelet’s law when applied to the USA lilac data. As before, we make use of the doy_prediction function.

```{r}
usa_npn <- 
  usa_npn %>% 
  mutate(pred = map(temp, ~ doy_prediction(.$temp, .$tmax))) %>% 
  unnest(pred) %>% 
  ungroup()

usa_npn %>% 
  summarize(mae  = mean(abs(doy - 1 - pred)),
            rmse = sqrt(mean((doy - 1 - pred)^2))) %>%
  kable(dig = 0,
        align = "c", 
        col.names = c("mean absolute error (days)",
                      "root mean squared error (days)"),
        caption = "Table 10: Predictions using Quetelet's law are accurate within about two weeks on average.") %>%
  kable_styling()
```
Table 10: Predictions using Quetelet’s law are accurate within about two weeks on average.
mean absolute error (days) root mean squared error (days)
10 15

Table 10 indicates that the predictions are accurate to within two weeks on average. Recall that the predictions using Quetelet’s own data were accurate to within one week on average (Table 5). We speculate that the decrease in accuracy is due in part to the fact that both Quetelet’s lilacs and the temperature were observed at the same site, Brussels Observatory. In some cases, the USA lilacs were a few miles from where the temperature was recorded.

Although the accuracy of the predictions made using Quetelet’s law is lower when applied to the USA lilac data, Figure 5 indicates that the law produces the correct bloom date on average. The figure plots the predictions made by the law against the actual bloom dates scientists observed. Note that instead of representing prediction-observation pairs as points in a scatter plot, the data are represented using blue contours. We use contours because there are more than 1,500 observations – too many to study using a scatter plot.

```{r}
(usa_npn %>% 
   mutate(doy = first_yes_doy) %>%
   unnest(pred) %>% 
   ungroup() %>%
   mutate(predicted = as.Date("2020-01-01") + pred,
          observed = as.Date("2020-01-01") + doy) %>%
   ggplot() + 
    aes(x = observed, y = predicted) +
    geom_density2d(contour_var = "ndensity") +
    geom_abline(intercept = 0, slope = 1, linetype = 2) +
    labs(x = "date observed", 
         y = "date predicted",
         title = "Figure 5: Predictions using Quetelet's law are accurate within about two weeks on average.") +
    theme(legend.position = "none")) %>%
  ggplotly(tooltip = "") %>%
  config(displaylogo = FALSE)
```

This figure plots the predictions made by Quetelet's law against the actual bloom dates scientists observed from US lilac data. Data are represented using blue contours, and a dotted line of equality is drawn through the graph. The dotted line intersects the blue contours at their peak, suggesting that the law derived from Quetelet’s data accurately predicts the typical bloom date of the USA data.

Figure 5: Predictions using Quetelet’s law are accurate within about two weeks on average. Author provided, CC BY 4.0.

The contours are easy to interpret. The blue lines are much like a mountain range observed from above. The inner circles are peaks of high elevation in which many prediction-observation pairs co-occur. The outer circles are areas of low elevation in which few prediction-observation pairs co-occur.

The dotted line is the “y = x” line, having zero intercept and unit slope. Prediction-observation pairs that lie on the line indicate perfect predictions. The fact that the dotted line intersects the blue contours at their peak suggests the law derived from Quetelet’s data accurately predicts the typical bloom date of the USA data. This accuracy is impressive given the fact that the USA lilacs were observed more than a century later and on a different continent. The blue curves deviate from the line by about two weeks in the vertical direction, which is consistent with Table 10.

An average accuracy of two weeks might not sound impressive. But it is far more accurate than using the average bloom date Quetelet observed, April 30 (April 29 on leap years). The average bloom date yields predictions that are off by an additional eleven days on average.

```{r}
usa_npn %>%
  mutate(doy = first_yes_doy) %>%
  ungroup() %>%
  summarize(
    pred = mean(quetelet$doy), 
    mae  = mean(abs(doy - pred)),
    rmse = sqrt(mean((doy - pred)^2))) %>%
  select(mae, rmse) %>%
  kable(
    dig = 0,
    align = "c",
    col.names = c("mean absolute error (days)",
                  "root mean squared error (days)"),
    caption = "Table 11: Predictions using the average bloom date are off by three weeks or more on average.") %>%
  kable_styling()
```
Table 11: Predictions using the average bloom date are off by three weeks or more on average.
mean absolute error (days) root mean squared error (days)
21 24

Predicting the day the lilac will bloom in Brussels in 2023

Any weather forecast can become a flower forecast by applying the law of the flowering plants. In this section, we use the AccuWeather forecast to predict the day a hypothetical lilac will bloom in Brussels in 2023. AccuWeather forecasts daily maximum and minimum temperatures three months into the future. We do not evaluate the quality of these forecasts. The purpose of this section is to simply convert them into flower forecasts.

We use the AccuWeather forecast as it appeared on the webpage AccuWeather.com on February 19, 2023. AccuWeather reports the forecast for each month on a separate webpage. For reproducibility, we saved each page on the Internet Archive. The following R code creates the function get_weather_table to retrieve each page we saved, extract the forecast contained within that page, and arrange the data as a tibble. The get_weather_table function combines several functions from the rvest package, which is yet another member of the “tidyverse”. In particular, the forecast on each page is contained within the div “monthly-calendar” and can be extracted with the html_nodes and html_text2 functions.

Applying the get_weather_table function to the url for each page yields a five column tibble temp_br, with columns defined in the same way as the tibble temp, discussed in previous sections. The first 10 rows are below; the data are also available on the author’s GitHub.

```{r}
 get_weather_table <- function(url)
  read_html(url) %>% 
  html_nodes("div.monthly-calendar") %>% 
  html_text2() %>%
  str_remove_all("°|Hist. Avg. ") %>%
  str_split(" ", simplify = TRUE) %>%
  parse_number() %>%
  matrix(ncol = 3, 
         byrow = TRUE,
         dimnames = list(NULL, c("day", "tmax", "tmin"))) %>%
  as_tibble() %>%
  filter(
    row_number() %in%
      (which(diff(day) < 0) %>% (function(x) if(length(x) == 1) seq(1, x[1], 1) else seq(x[1] + 1, x[2], 1))))

temp_br <-
  tibble(
    base_url = "https://web.archive.org/web/20230219151906/https://www.accuweather.com/en/be/brussels/27581/",
    month = month.name[1:5],
    year = 2023,
    url = str_c(base_url, tolower(month), "-weather/27581?year=", year, "&unit=c")) %>%
  mutate(temp = map(url, get_weather_table)) %>%
  pull(temp) %>%
  reduce(bind_rows) %>%
  transmute(date = seq(as.Date("2023-01-01"), as.Date("2023-05-31"), 1),
            year = parse_number(format(date, "%Y")),
            tmax,
            tmin,
            temp = (tmax + tmin) / 2)

temp_br %>%
  relocate(year) %>%
  kable(dig = 2,
        align = "c", 
        col.names = c("year", "date", "maximum temperature (°C)",
                      "minimum temperature (°C)", "midrange temperature (°C)"),
        caption = "Table 12: Temperature forecast for Brussels, retrieved on February 19, 2023.") %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")
```
Table 12: Temperature forecast for Brussels, retrieved on February 19, 2023.
year date maximum temperature (°C) minimum temperature (°C) midrange temperature (°C)
2023 2023-01-01 15 11 13.0
2023 2023-01-02 14 5 9.5
2023 2023-01-03 9 3 6.0
2023 2023-01-04 13 8 10.5
2023 2023-01-05 12 10 11.0
2023 2023-01-06 12 10 11.0
2023 2023-01-07 11 9 10.0
2023 2023-01-08 10 6 8.0
2023 2023-01-09 8 5 6.5
2023 2023-01-10 12 4 8.0

We now predict the day the lilacs will bloom. The R code below uses the doy_prediction and doy_last_frost functions created in earlier sections and displays the prediction in Table 13. At the time of our writing, the predicted date is April 19. The forecast is easily updated by providing the url to the updated AccuWeather webpage. (You might use the url https://web.archive.org/save to save a webpage to the Internet Archive to ensure your work is reproducible.)

```{r}
bloom_day_br <-
  temp_br %>%
  summarize(date = doy_prediction(temp, tmax) + as.Date("2023-01-01")) %>%
  pull(date)

frost_day_br <- 
  temp_br %>% 
  pull(tmax) %>% 
  doy_last_frost() + as.Date("2023-01-01") 

tibble(`last frost date` = frost_day_br, 
       `bloom date` = bloom_day_br) %>%
  kable(align = "c",
        caption = "Table 13: Last frost date and lilac bloom date in Brussels in 2023.") %>%
  kable_styling()
```
Table 13: Last frost date and lilac bloom date in Brussels in 2023.
last frost date bloom date
2023-01-27 2023-04-19

We visualize the predictions in Figure 6, which has the same interpretation as Figure 4. If the temperature forecast and Quetelet’s law are correct, on January 27, 2023 the lilacs in Brussels began “collecting” temperature. The lilacs will continue to “collect” temperature until April 19, at which point they will exceed their 4264°C² quota and bloom.

```{r}
(temp_br %>% 
  ggplot() + 
  aes(date, temp) + 
  geom_line() + 
  labs(
    x = "",
    y = "midrange temperature (°C)",
    title =
      "Figure 6: According to Quetelet's law, the lilacs will bloom once exposed to 4264°C² following the last frost.") +
  geom_vline(xintercept = as.numeric(c(frost_day_br, bloom_day_br)), 
             linetype = "dotted")) %>%
  ggplotly() %>% 
  add_annotations(x = as.numeric(c(frost_day_br, bloom_day_br)),
                  y = c(14, 14),
                  text = c("last\nfrost", "first\nbloom"),
                  font = list(size = 14),
                  ay = 0,
                  xshift = c(-14, -16)) %>%
  config(displaylogo = FALSE)
```

A line graph of midrange temperature by day in Brussels for January 1 to February 18 (observed) and February 19 to May 20 (forecast), 2023, with day of last frost (January 27) and predicted day of first bloom (April 19) marked by vertical dashed lines.

Figure 6: According to Quetelet’s law, the lilacs will bloom once exposed to 4264°C² following the last frost. Author provided, CC BY 4.0.

Quetelet’s legacy: advocate, mentor, and perhaps data scientist

In this tutorial, we stated the law of the flowering plants and explained how Quetelet derived it. We also reproduced and replicated Quetelet’s findings before using his law to predict the day the lilac will bloom in Brussels. We now conclude with a reflection on Quetelet’s legacy.

The law of the flowering plants surely stands the test of time. It continues to be used by scientists – with relatively few changes – to plan harvests, manage pests, and study ecosystems stressed by climate change. We speculate the law’s longevity is due to the fact that it balances simplicity with relatively accurate predictions.

Although Quetelet did not discover the law, he did much to advance it. Quetelet founded an international network for “observations of the periodical phenomena” (in addition to numerous statistical societies and publications, including the precursor to the Royal Statistical Society). Quetelet’s network of 80 stations collected observations throughout Europe from 1841 until 1872. In particular, Quetelet collaborated with Charles Morren – who later coined the term phenology, the name of the field that now studies biological life-cycle events like the timing of flower blooms (Demarée and Rutishauser 2011).

In recent years, the observations collected through phenology networks have become an important resource for understanding the impacts of climate change. For example, the USA National Phenology Network calculates the Spring Bloom Index, which measures the “first day of spring” using the days lilacs are observed to bloom at locations across the United States. The index is then compared to previous years. Figure 7 shows one comparison, called the Return Interval. The Return Interval is much like a p-value, calculating how frequently more extreme spring indices were observed in previous decades. Bloom dates that are uncommonly early (green) or late (purple) may indicate environments stressed by changing climate. Scientists exploit the relationship between temperature and bloom date to extrapolate the index to areas with few observations.

A map of the United States showing the Spring Bloom Index Return Interval 2020, which measures whether spring is typical when compared to recent decades.

Figure 7: The Spring Bloom Index Return Interval measures whether spring is typical when compared to recent decades. Source: USA National Phenology Network.

Quetelet’s emphasis on discovering the universal laws he believed govern social and biological phenomenon has not endured. But data scientists continue to appropriate laws from one area of science to study another. For example, data scientists use neural networks and genetic algorithms to study a wide variety of phenomenon unrelated to neuroscience or genetics. Perhaps Quetelet’s appropriation of Newton’s law, in addition to his careful use of data, make him among the first data scientists?

Your turn: Do you have what it takes to beat Quetelet’s law?

Quetelet reported that a plant flowers when the sum of the daily temperatures squared exceeds a specific quantity. His prediction rule was state of the art in 1833. But surely you, a twenty-first century data scientist, can do better. Here are some ideas to get you started.

  1. Quetelet squared the temperature before calculating the sum. Would another function of temperature produce a more accurate prediction?

    1. Remove the square so that a plant flowers once the sum of the daily temperatures exceeds a (different) specific quantity. Does this version of the law produce more accurate predictions? What if you use the daily temperatures cubed? (Beginner)

    2. Suppose a lilac only registers temperatures between 0°C and 10°C. That is, a lilac experiences temperature below the lower limit, 0°C, as 0°C, and above the upper limit, 10°C, as 10°C. Does the accuracy of the predictions improve if you use the temperature the lilac experienced instead of the ambient temperature measured by a weather station? Write a program that finds the lower and upper limits that produce the most accurate predictions. (Intermediate)

    3. Quetelet used mean absolute error to evaluate the accuracy of his predictions. But his estimate of the specific quantity of heat needed to bloom, 4264°C², does not actually minimize mean absolute error. Write a program that finds the specific quantity that minimizes mean absolute error. Redo part i. and ii. using this function. (Advanced)

  2. Quetelet calculated the sum of the daily temperature squared between the day of last frost and the bloom date. Would another time interval produce more accurate predictions?

    1. We estimated the day of last frost using the last day the maximum temperature was below 0°C. Try estimating the day of last frost by the last day the midrange temperature was below 0°C? Which estimate yields the most accurate predictions? What if you ignore the day of last frost and simply calculate the sum of the daily temperatures squared between February 1 and the bloom date? When you change the time interval, be sure to calculate the new specific quantity of heat needed to bloom. (Beginner)

    2. Write a program that finds the time interval which yields the best predictions. (Intermediate)

    3. Write a program that calculates the prediction rule for many different time intervals. Use cross-validation to combine these prediction rules into a single prediction rule. (Advanced)

  3. Quetelet’s law only considers the temperature. Would additional information provide more accurate predictions?

    1. Is the specific quantity of heat needed to bloom different in years with abnormally cold winters? Would the predictions be more accurate if you use one quantity of heat for years with cold winters and a different quantity of heat for years with warm winters? (Beginner)

    2. Is the estimated quantity of heat needed to bloom similar for locations close in space and time? Write a program that leverages spatial and temporal correlation to improve the accuracy of the predictions. (Intermediate)

    3. Some biologists report that a plant must be exposed to a fixed amount of cold temperature in the winter – in addition to a fixed amount of warm temperature in the spring – before it can bloom. Augment the law of the flowering plants to require the accumulation of a specific quantity of cold temperature before the accumulation of a specific quantity of warm temperature. Write a program that uses this new law to predict the day the lilac blooms. (Advanced)

Feeling good about your prediction algorithm? Show it off at the annual Cherry Blossom Prediction Competition!

Appendix: Preparing USA NPN Data

```{r}
# 1. download lilac data using `rnpn`
usa_npn <- 
  npn_download_individual_phenometrics(request_source = "Jonathan Auerbach",
                                       year = 1900:2050,
                                       species_ids = 36,                       
                                       phenophase_ids = c(77, 412))            

# 2. limit analysis to sites that report more than 25 times
site_ids <- 
  usa_npn %>% 
  group_by(site_id) %>% 
  summarize(n = n()) %>% filter(n > 25) %>% pull(site_id)

usa_npn <- 
  usa_npn %>% 
  filter(site_id %in% site_ids)

# 3. find nearest weather stations for each site
locations <- 
  usa_npn %>% 
  group_by(site_id) %>% 
  summarize(latitude = first(latitude), 
            longitude = first(longitude))

stations <- 
  ghcnd_stations() %>%
  filter(first_year <= min(usa_npn$first_yes_year),
         last_year  >= max(usa_npn$first_yes_year),
         state != "") %>%
  group_by(id, latitude, longitude, state) %>%
  summarize(temp_flag = sum(element %in% c("TMIN", "TMAX"))) %>%            
  filter(temp_flag == 2) %>% 
  ungroup()

dist <- function(x, y = stations %>% select(latitude, longitude)) 
  stations$id[which.min(sqrt((x[1] - y[,1])^2 + (x[2] - y[,2])^2)[,1])]

locations$station_id <- apply(locations, 1, function(x) dist(c(x["latitude"], x["longitude"])))

# 4. get weather data from nearest station using `rnoaa`
get_station_data <- function(station_id) 
  ghcnd_search(stationid = station_id,
               var = c("tmin", "tmax"),
               date_min = "1956-01-01",
               date_max = "2011-12-31") %>%
  reduce(left_join, by = c("id", "date")) %>%
  transmute(id, 
            date, 
            tmax = tmax / 10,
            tmin = tmin / 10)

usa_npn <- 
  locations %>%
  mutate(temp = map(station_id, get_station_data)) %>%
  right_join(usa_npn, by = c("site_id", "latitude", "longitude")) %>% 
  group_by(rownames(usa_npn)) %>% 
  mutate(temp = map(temp, ~ .x %>% 
                      filter(format(date, "%Y") == first_yes_year) %>%
                      mutate(temp = (tmin + tmax) / 2)),
         num_obs = map(temp,~ sum(format(.x$date,"%j") <= 150)),
         doy = first_yes_doy, year = first_yes_year) %>% 
  unnest(num_obs) %>%  
  filter(num_obs == 150) %>%
  ungroup()
```

Explore more Tutorials

About the author
Jonathan Auerbach is an assistant professor in the Department of Statistics at George Mason University. His research covers a wide range of topics at the intersection of statistics and public policy. His interests include the analysis of longitudinal data, particularly for data science and causal inference, as well as urban analytics, open data, and the collection, evaluation, and communication of official statistics. He co-organizes the annual Cherry Blossom Prediction Competition with David Kepplinger and Elizabeth Wolkovich.
Copyright and licence
© 2023 Jonathan Auerbach

Text and code are licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Images are not covered by this licence, except where otherwise noted.

How to cite
Auerbach, Jonathan. 2023. “A demonstration of the law of the flowering plants.” Real World Data Science, April 13, 2023. DOI

References

Chamberlain, Scott. 2021. “’NOAA’ Weather Data from r [r Package Rnoaa Version 1.3.8].” The Comprehensive R Archive Network. Comprehensive R Archive Network (CRAN). https://CRAN.R-project.org/package=rnoaa.
De Réaumur, René. 1735. “Observations Du Thermometre, Faites a Paris Pendant l’annees 1735, Comparees a Celles Qui Ont Ete Faites Sous La Ligne, a l’isle de France, a Alger Et En Quelques-Unes de Nos Isles de l’amerique.” Mémoire de l’Académie Royale Des Sciences, 545–76. https://www.academie-sciences.fr/pdf/dossiers/Reaumur/Reaumur_pdf/p545_576_vol3532m.pdf.
Demarée, Gaston R., and This Rutishauser. 2011. “From ‘Periodical Observations’ to ‘Anthochronology’ and ‘Phenology’ – the Scientific Debate Between Adolphe Quetelet and Charles Morren on the Origin of the Word ‘Phenology’.” International Journal of Biometeorology 55 (6): 753–61. https://doi.org/10.1007/s00484-011-0442-5.
Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. Sebastopol, CA: O’Reilly Media.
Marsham, Robert. 1789. “XIII. Indications of Spring, Observed by Robert Marsham, Esquire, f. R. S. Of Stratton in Norfolk. Latitude 52° 45’.” Philosophical Transactions of the Royal Society of London 79: 154–56. https://doi.org/10.1098/rstl.1789.0014.
Observatoire royal de Bruxelles. 1848. Annales de l’observatoire Royal de Bruxelles. Bruxelles: M. Hayez. https://catalog.hathitrust.org/Record/000553895.
Quetelet, Adolphe. 1846. Lettres à s.a.r. Le Duc Régnant de Saxe-Coburg Et Gotha: Sur La Théorie Des Probabilités, Appliquée Aux Sciences Morales Et Politiques. Bruxelles: M. Hayez. https://catalog.hathitrust.org/Record/001387625.
———. 1849. Letters Addressed to h.r.h. The Grand Duke of Saxe Coburg and Gotha on the Theory of Probabilities as Applied to the Moral and Political Sciences. London: C. & E. Layton. https://catalog.hathitrust.org/Record/008956987.
———. 1857. Sur Le Climat de La Belgique : De l’état Du Ciel En Général. Bruxelles: M. Hayez. https://catalog.hathitrust.org/Record/000553895.
Rosemartin, Alyssa H., Ellen G. Denny, Jake F. Weltzin, R. Lee Marsh, Bruce E. Wilson, Hamed Mehdipoor, Raul Zurita-Milla, and Mark D. Schwartz. 2015. “Lilac and Honeysuckle Phenology Data 1956-2014.” Scientific Data 2 (1). https://doi.org/10.1038/sdata.2015.38.
Rosemartin, Alyssa, Chamberlain Scott, Lee Marsh, and Kevin Wong. 2022. “Interface to the National ’Phenology’ Network ’API’ [r Package Rnpn Version 1.2.5].” The Comprehensive R Archive Network. Comprehensive R Archive Network (CRAN). https://cran.r-project.org/package=rnpn.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.