Live from Toronto: Real World Data Science at the Joint Statistical Meetings

We’re in Toronto for this year’s Joint Statistical Meetings (JSM). Over the next few days, we’ll be sharing key takeaways from a selection of talks and sessions. Check back regularly for updates.


Brian Tarran


August 15, 2023

Sunday, August 6

Use of color in statistical charts

Haley Jeppson, Danielle Albers Szafir, and Ian Lyttle

JSM 2023 is underway, and the first session I attended today was this panel on the use of colour in statistical charts.

The topic appealed to me for two reasons:

The “Best Practices…” guide links to several useful data visualisation tools, and this session today has put a few more on my radar:

  • Color Crafting, by Stephen Smart, Keke Wu, and Danielle Albers Szafir. The authors write: “Visualizations often encode numeric data using sequential and diverging color ramps. Effective ramps use colors that are sufficiently discriminable, align well with the data, and are aesthetically pleasing. Designers rely on years of experience to create high-quality color ramps. However, it is challenging for novice visualization developers that lack this experience to craft effective ramps as most guidelines for constructing ramps are loosely defined qualitative heuristics that are often difficult to apply. Our goal is to enable visualization developers to readily create effective color encodings using a single seed color.”

  • Computing on color, a collection of Observable notebooks by Ian Lyttle that allow users to see how different colour spaces and colour scales work with different types of colour vision deficiency.

Monday, August 7

Astronomers Speak Statistics

Astrophysicist Joel Leja kicked off his JSM talk with a video of the launch of the James Webb Space Telescope – an inspiring way to start the day, and a prelude to a discussion of the statistical challenges involved in studying the deep universe.

James Webb, since launch, has “completely expanded our point of view”, said Leja, allowing astronomers to explore the first stars and galaxies at greater resolution than ever before.

Image from the James Webb telescope showing two galaxies in the process of merging, twisting each other out of shape. Credit: ESA/Webb, NASA & CSA, L. Armus, A. Evan, licenced under CC BY 2.0.

Already, after only 13 months of operation, the images and data sent back by the telescope have left observers astounded: for example, finding suspected early galaxies that are bigger than thought possible based on extreme value analysis.

But the big challenge facing those studying the early universe is trying to work out how early galaxies evolved over time. “We can’t watch this happen,” said Leja, joking that this process lasts longer than a typical PhD. So, instead, he said, “We need to use statistics to understand this, to figure out how they grow up.”

Teaching statistics in higher education with active learning

Great talk from Nathaniel T. Stevens of the University of Waterloo, explaining how a posting for a Netflix job inspired the creation of a final project for students to learn response surface methodology.

The job ad in question “really opened my eyes” to the use of online controlled experiments by companies, said Stevens. He told delegates how LinkedIn, the business social networking site, runs over 400 experiments per day, trying to optimise user experience and other aspects of site engagement.

Netflix’s job ad highlighted just how sophisticated these experiments are, said Stevens. People might hear companies refer to their use of A/B tests, but the term trivialises what’s involved, Stevens explained.

Having encountered a job ad from Netflix, looking for someone to design, run, and analyse experiments and support internal methodological research, Stevens was inspired to present students with a hypothetical business problem, based on the Netflix homepage. That homepage, for those not familiar, features rows and rows of movies and TV shows sorted by theme, each show presented as a tile that, when hovered over, leads to a pop-up with a video preview and a match score – a prediction of how likely a viewer is to enjoy the show.

Stevens explained the hypothetical goal as trying to minimise “browsing time” – the time it takes a Netflix user to pick something to watch. Browsing time was defined as time spent scrolling and searching, not including time spent watching previews.

Students were given four factors that might influence browsing time – tile size, match score, preview length, and preview type – and through a sequence of experiments based on data generated by a Shiny app, students sought to minimise browsing time.

The response from the students? Two Netflix-style thumbs up. Ta-dum!

Tuesday, August 8

The Next 50 Years of Data Science

Stanford University’s David Donoho wrestled with the question of whether a singularity is approaching in this post-lunch session on the future of data science.

Taking his cue from the 2005 Ray Kurzweil book, The Singularity is Near, Donoho reviewed recent – and sometimes rapid – advances in data science and artificial intelligence to argue that a singularity may have already arrived, just not in the way Kurzweil supposed.

Kurzweil’s book argues that at some point after the 2030s, machine intelligence will supersede human intelligence, leading to a takeover or disruption of life as we know it.

At JSM, Donoho argued that we have certainly seen a “massive scaling” of compute over the past decade, along with expanded communications infrastructure and the wider spread of information – all of which is having an impact on human behaviour.

That human behaviour can often now be directly measured thanks to the proliferation of digital devices with data collection capabilities, and this in turn is leading to a major scaling in datasets and performance scaling for machine learning models.

But does this mean that an AI singularity is near? Not according to Donoho. The notion of an AI singularity “is a kind of misdirection”, he said. Something very profound is happening, Donoho argued, and it is the culmination of three long-term initiatives in data science that have come together in recent years. “They constitute a singularity on their own.”

These three initiatives, as Donoho described, are: datafication and data sharing; adherence to the “challenge problem” paradigm; and documentation and sharing of code. These are solid achievements that came out of the last decade, said Donoho, and they are “truly revolutionary” when they come together to form what he refers to as “frictionless reproducibility.”

Slide text reads: Today's data scientists: typical interactions: What's your package name? What's your URL? QR Code? What's your stack? Today's data scientists: implicit demands: Data sharing, Specific numerical performance measures, Code sharing, Single-click access. Frictionless replications.

Photo of David Donoho’s slide, describing the scientific revolution of the “data science decade”. Photo by Brian Tarran, licenced under CC BY 4.0.

Frictionless reproducibility, when achieved, leads to a “reproducibility singularity” – the moment where it takes almost no time at all for an idea to spread. “If there is an AI singularity,” said Donoho, “it will be because this came first.”

Wednesday, August 9

New frontiers of statistics in trustworthy machine learning

Data, data everywhere, but is it safe to “drink”? A presentation this morning from Yaoliang Yu of the University of Waterloo looked at the issue of data poisoning attacks on algorithms and the effectiveness of current approaches.

Yu began by explaining how machine learning algorithms require a lot of data for training, and that large amounts of data can be obtained cheaply by scraping the web.

But, he said, when researchers download this cheap data, they are bound to worry about the quality of it. Drawing an analogy to food poisoning, Yu asked: What if the data we “feed” to algorithms is not clean? What is the impact of that?

Illustration by Yasmin Dwiputri & Data Hazards Project / Better Images of AI / Managing Data Hazards / Licenced by CC-BY 4.0.

As a real-world example of a data poisoning attack, Yu pointed to TayTweets, the Microsoft Twitter chatbot that spewed racism within hours of launch after Twitter users began engaging with it.

Yu then walked delegates through some experiments showing how, generally, indiscriminate data poisoning attacks are ineffective when the ratio of poisoned data to clean data is small. A poisoning rate of 3%, for example, leads to model accuracy drops of 1.5%–2%, Yu said.

However, he then put forward the idea of “parameter corruption” – an attack that seeks to modify a model directly. Yu showed that this would be more effective in terms of accuracy loss, though – fortunately – perhaps less practical to implement.

Data Science and Product Analysis at Google

Our final session at JSM 2023, before heading home, was a whistle-stop tour of various data science projects at Google, covering YouTube, Google Maps, and Google Search.

Jacopo Soriano kicked us off with a brief intro to the role and responsibilities of statisticians and data scientists at Google, and within YouTube specifically – the main task being to make good decisions based on uncertain data.

Soriano also spoke about the key role randomised experiments play in product development – harking back to Nathaniel Stevens’ earlier talk on this subject. YouTube runs hundreds, if not thousands, of concurrent experiments, Soriano said; statisticians can’t, therefore, be involved in each one. As Soriano’s colleague, Angela Schoergendorfer, explained later in the session, the role of the data scientist is to build methodology and metrics that others in the business can use to run their own experiments.

Photo by Pawel Czerwinski on Unsplash.

For every experiment YouTube runs, a portion of its voluminous daily traffic will be assigned to control arms and treatment arms, with traffic able to be diverted to different groups based on user type, creators, videos, advertisers, etc. Once experiments are running, metrics such as search clickthrough rates, watch time using specific devices, or daily active user numbers are monitored. Teams tend to look at percentage change as the scale to measure whether something is working or not, said Soriano, rather than comparing treatment to control group.

Next up was Lee Richardson, who spoke about the use of proxy metrics. Technology companies like Google are often guided by so-called “north star metrics”, which executive leadership use to guide the overall strategy and priorities of an organisation. However, Richardson said, these can be hard to design experiments around, and so proxy metrics stand in for the north star metrics. Proxies need to be sensitive, he said, and move in the same direction as, e.g., a long-term positive user experience.

On the subject of user experience, Christopher Haulk then explained how YouTube measures user satisfaction through single-question surveys – typically asking a YouTube user to rate the video they just watched. The company doesn’t send out that many surveys, Haulk said, and response rates are in the single-digit percentage range, so it can be hard to evaluate whether changes YouTube makes to, e.g., its video recommendation algorithm are working to improve user satisfaction. Haulk then went on to explain a modelling approach the company uses to predict how users are likely to respond in order to “fill in” for missing responses.

Over at Google Search, user feedback is also regularly sought to help support the evolution of the product. Angela Schoergendorfer explained how, with so many people already using Google Search, statistically significant changes in top-line metrics like daily active users can take months to see. Decision metics should ideally capture user value quickly, said Schoergendorfer – within days. For this, Google has 10,000 trained “search quality” raters they can call on. Random samples of user search queries and results are sent to these raters, who are asked to evaluate the quality of the search results. Users can also be asked in the moment, or offline through the Google Rewards app.

In 2021, Schoergendorfer said, Google conducted approximately 800,000 experiments and quality tests. But perhaps the most impressive statistic of the day came from Sam Morris, who works on Google Maps. Something, somewhere, is always changing in the world, said Morris – be it a road closure or a change to business hours. The Maps team cannot evaluate every single piece of data – a lot of changes are automated or algorithmic, he explained. “So far this year, we have probably processed 16 billion changes to the map,” said Morris – a staggering figure!

Back to Editors’ blog

Copyright and licence
© 2023 Royal Statistical Society

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite
Tarran, Brian. 2023. “Live from Toronto: Real World Data Science at the Joint Statistical Meetings.” Real World Data Science, August 6, 2023, updated August 15, 2023. URL