Research – Interactive AI CDT Blog

ELISE Wrap up Event

Posted on July 5, 2024July 9, 2024 by j.crabbe

This blog post is written by AI CDT student, Jonathan Erskine

I recently attended the ELISE Wrap up Event in Helsinki, marking the end of just one of many programs of research conducted under the ELLIS society, which “aims to strengthen Europe’s sovereignty in modern AI research by establishing a multi-centric AI research laboratory consisting of units and institutes distributed across Europe and Israel”.

This page does a good job of explaining ELISE and ELLIS if you want more information.

Here I summarise some of the talks from the two-day event (in varying detail). I also provide some useful contacts and potential sources of funding (you can skip to the bottom for these).

Robust ML Workshop

Peter Grünwald: ‘e’ is the new ‘p’

P-values are an important indicator of statistical significance when testing a hypothesis, whereby a calculated p-value must be smaller than some predefined value, typically $\alpha = 0.05$. This is a guarantee that Type 1 Errors (where null hypothesis can be falsely rejected) are less than 5% likely.

“p-hacking” is a malicious practice where statistical significance can be manufactured by, for example:

stopping the collection of data once you get a P<0.05
analyzing many outcomes, but only reporting those with P<0.05
using covariates
excluding participants
etc.

Sometimes this is morally ambiguous. For example, imagine a medical trial where a new drug shows promising, but not statistically significant results. Should a p-test fail, you can simply repeat the trial, sweep the new data into the old and repeat until you achieve the desired p-value, but this can be prohibitively expensive, and it is hard to know whether you are p-hacking or haven’t tested enough people to prove your hypothesis. This approach, called “optional stopping”, can lead to violation of Type 1 Error guarantees i.e. it is hard to have faith in your threshold $\alpha$ due to the increasing cumulative probability that individual trials are in the minority case of false positives.

Peter described the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for “effortlessly combining results from several studies in the common scenario where the decision to perform a new study may depend on previous outcomes.“

Unlike with the p-value, this proposed method is “safe under optimal continuation with respect to Type 1 error”; no matter when the data collecting and combination process is stopped, the Type-I error probability is preserved. For singleton nulls, e-values coincide with Bayesian Factors.

In any case, general e-values can be used to construct Anytime-Valid Confidence Intervals (AVCIs), which are useful for A/B testing as “with a bad prior, AVCIs become wide rather than wrong”.

In comparison to classical approaches, you need more data to apply e-values and AVCIs, with the benefit of performing optional stopping without introducing Type 1 errors. In the worst case you need more data, but on average you can stop sooner.

This is being adopted for online A/B testing but is more challenging for expensive domains, such as medical trials; you need to reserve more patients for your trial, but you wont need them all – a challenging sell, but probability indicates that you should save time and effort in the majority of cases.

Other relevant literature which is pioneering this approach to significance testing is Waudby-smith and Ramdas, JRSS B, 2024

There is an R package here for anyone who wants to play with Safe Anytime-Valid Inference.

Watch the full seminar here:

https://www.youtube.com/watch?v=PFLBWTeW0II

Tamara Broderick: Can dropping a little data change your conclusions – A robustness metric

arxiv.org

Tamara advocated the value of economics datasets as rich test beds for machine learning, highlighting that one can examine the data produced from economic trials with respect to robustness metrics and can come to vastly different conclusions than those published in the original papers.

Focusing in, she described a micro-credit experiment where economists ran random controlled trials on small communities, taking approximately 16500 data points with the assumption that their findings would generalise to larger communities. But is this true?

When can I trust decisions made from data?

In a typical setup, you (1) run an analysis on a series of data, (2) come to some conclusion on that data, and (3) ultimately apply those decisions to downstream data which you hope is not so far out-of-distribution that your conclusions no longer apply.

Why do we care about dropping data?

Useful data analysis must be sensitive to some change in data – but certain types of sensitivity are concerning to us, for example, if removing some small fraction of the data $\alpha$ were to:

Change the sign of an effect
Change the significance of an effect
Generate a significant result of the opposite sign

Robustness metrics aim to give higher or lower confidence on our ability to generalise. In the case described, this implies a low signal-to-noise ratio, which is where Tamara introduces her novel metric (Approximate Maximum Influence Perturbation) which should help to quantify this vulnerability to noise.

Can we drop one data point to flip the sign of our answer?

In reality, this is very expensive to test for any dataset where the sample size N is large (by creating N*(N-1) datasets and re-running your analysis. Instead, we need an approximation.

Let the Maximum Influence Perturbation be the largest possible change induced in the quantity of interest by dropping no more than 100α% of the data.

From the paper:

We will often be interested in the set that achieves the Maximum Influence Perturbation, so we call it the Most Influential Set.

And we will be interested in the minimum data proportion α ∈ [0,1] required to achieve a change of some size ∆ in the quantity of interest, so we call that α the Perturbation-Inducing Proportion. We report NA if no such α exists.

In general, to compute the Maximum Influence Perturbation for some α, we would need to enumerate every data subset that drops no more than 100α% of the original data. And, for each such subset, we would need to re-run our entire data analysis. If m is the greatest integer smaller than 100α, then the number of such subsets is larger than $\binom{N}{m}$. For N = 400 and m = 4, $\binom{N}{m} = 1.05\times10^9$. So computing the Maximum Influence Perturbation in even this simple case requires re-running our data analysis over 1 billion times. If each data analysis took 1 second, computing the Maximum Influence Perturbation would take over 33 years to compute. Indeed, the Maximum Influence Perturbation, Most Influential Set, and Perturbation-Inducing Proportion may all be computationally prohibitive even for relatively small analyses.

Further definitions are described better in the paper, but suffice to say the approximation succeeds in identifying where analyses can be significantly affected by a minimal proportion of the data.For example, in the Oregon Medicaid study (Finkelstein et al., 2012), they identify a subset containing less than 1% of the original data that controls the sign of the effects of Medicaid on certain health outcomes. Dropping 10 data points takes data from significant to non-significant.

Code for the paper is available at:

https://github.com/rgiordan/AMIPPaper/blob/main/README.md

An R version of the AMIP metric is available:

https://github.com/maswiebe/metrics.git

Watch a version of this talk here:

https://www.youtube.com/watch?v=7eUrrQRpz2w

Cedric Archambeau | Beyond SHAP : Explaining probabilistic models with distributional values

Abstract from the paper:

A large branch of explainable machine learning is grounded in cooperative game theory. However, research indicates that game-theoretic explanations may mislead or be hard to interpret. We argue that often there is a critical mismatch between what one wishes to explain (e.g. the output of a classifier) and what current methods such as SHAP explain (e.g. the scalar probability of a class). This paper addresses such gap for probabilistic models by generalising cooperative games and value operators. We introduce the distributional values, random variables that track changes in the model output (e.g. flipping of the predicted class) and derive their analytic expressions for games with Gaussian, Bernoulli and categorical payoffs. We further establish several character- ising properties, and show that our framework provides fine-grained and insightful explanations with case studies on vision and language models.

Cedric described how Shap values can be reformulated as random variables on a simplex, shifting from weight of individual players to distribution of transition probabilities. Following this insight, they generate explanations on transition probabilities instead of individual classes, demonstrating their approach on several interesting case studies. This work is in it’s infancy – and has plenty of opportunity for further investigation.

Semantic, Symbolic and Interpretable Machine Learning Workshop

Nada Lavrač: Learning representations for relational learning and literature-based discovery

This was a survey of types of representation learning, focusing on Nada’s area of expertise in propositionalisation and relational data, Bisociative Literature-Based Discovery, and interesting avenues of research in this direction.

Representation Learning

Deep learning, while powerful (accurate), raises concerns over interpretability. Nada takes a step back to survey different forms of representation learning.

Sparse, Symbolic, Propositionalisation:

These methods tend to be less accurate but are more interpretable.
Examples include propositionalization techniques that transform relational data into a propositional (flat) format.

Dense, Embeddings:

These methods involve creating dense vector representations, such as word embeddings, which are highly accurate but less interpretable.

with recent work focusing on unifying approaches which can incorporate the strengths of both approaches.

Hybrid Methods:

Incorporate Sparse and Deep methods
DeepProp, PropDRM, propStar(?) – Methods discussed in their paper.

Representation learning for relational data can be achieved by:

Propositionalisation – transforming a relational database into a single-table representation. example: Wordification
Inductive logic programming
Semantic relational learning
Relational sub-route discovery (written by Nada and our own P. Flach)
Semantic subgroup discovery system, “Hedwig” that takes as input the training examples encoded in RDF, and constructs relational rules by effective top-down search of ontologies, also encoded as RDF triples.
Graph-based machine learning
- data and ontologies are mapped to nodes and edges
- In this example, gene ontologies are used as background knowledge for improving quality assurance of literature-based Gene Ontology Annotation

These slides, although a little out of date, talk about a lot of what I have noted here, plus a few other interesting methodologies.

The GitHub Repo for their book contains lots of jupyter notebook examples.

https://github.com/vpodpecan/representation_learning.git

Marco Gori: Unified approach to learning over time and logic reasoning

I unfortunately found this very difficult to follow, largely due to my lack of subject knowledge. I do think what Marco is proposing requires an open mind as he re-imagines learning systems which do not need to store data to learn, and presents time as an essential component of learning for truly intelligent “Collectionless AI”.

I wont try and rewrite his talk here, but he has full classroom series available on google, which he might give you access to if you email him.

Conclusions:

Emphasising environmental interactions – collectionless AI which doesn’t record data
Time is the protagonist: higher degree of autonomy, focus of attention and consciousness
Learning theory inspired from theoretical physics & optimal control: hamiltonian learning
Nuero-symbolic learning and reasoning over time: semantic latent fields and explicit semantics
Developmental stages and gradual knowledge acquisitation

Contacts & Funding Sources

For Robust ML:

e-values, AVCIs:

Aaditya Ramdas at CMU

Peter Grünwald Hiring

For anyone who wants to do a Robust ML PhD, apply to work with Ayush Bharti : https://aalto.wd3.myworkdayjobs.com/aalto/job/Otaniemi-Espoo-Finland/Doctoral-Researcher-in-Statistical-Machine-Learning_R40167

If you know anyone working in edge computing who would like 60K to develop an enterprise solution, here is a link to the funding call: https://daiedge-1oc.fundingbox.com/ The open call starts on 29 August 2024.

If you’d like to receive monthly updates with new funding opportunities from Fundingbox, you can subscribe to their newsletter: https://share-eu1.hsforms.com/1RXq3TNh2Qce_utwh0gnT0wfegdm

Yoshua Bengio said he had fellowship funding but didn’t give out specific details, or I forgot to write them down… perhaps you can send him an email.

Collaborating with a Designer to Craft Visual Resources for my PhD Project

Posted on November 29, 2023November 29, 2023 by j.crabbe

This blog post is written by AI CDT student, Vanessa Hanschke

“With regards to AI, this mythologizing and enchantment is apparent when we explore the disjoint between the reality of the technology and its representation.”

…says Beth Singler in her analysis of what she calls the AI Creation Meme – the ubiquitous image of a human hand and a robot hand reaching out to each other with the index finger as in Michelangelo’s infamous painting. Several researchers have commented on the bulk of images used to depict artificial intelligence ranging from inappropriate (e.g. an anthropomorphized robot for natural language processing) to harmful (e.g. the unnecessarily sexist additions of breasts to illustrations of AI in the service industry).

Visual representations of AI matter. In a world where a lot of hype is being generated around AI in industry and policy, I think it is especially important for AI researchers to lead the way in creating better images that are grounded in more accurate conceptualizations of AI. This was one of the many reasons I decided to work with a designer to make visual materials that supported my research in responsible AI.

The Project

A little sidenote description of my research project: the Data Ethics Emergency Drill (DEED) is a method that we created to help industry practitioners of AI, Machine Learning and Data Science reflect on the societal impact and values embedded in their systems. The idea is similar to ethical roleplay. We created fictional scenarios of a data ethics dilemma, which members of a data and AI team discussed in a fake meeting. This fictional scenario is crafted together with some members of the team to address their particular context and AI application. It is presented as an urgent problem that needs fixing within this fake meeting. After trialling this process with two separate industry data science teams, we made a toolbox for other industry teams to pick up and conduct their own drills. This toolbox consists of a PowerPoint slide deck and a Miro board template. We wanted to update these toolbox resources with a professional designer to make them visually engaging and accessible. We collaborated with Derek Edwards, a local designer to Bristol, to create the designs.

The Design Process

Designing is an iterative process and it took some back and forth for the design to come together. Our initial ideas were very vague: we wanted it to be playful as the DEED was about stepping outside of the day-to-day mindset that is focused on technical delivery. We wanted it to be about the human developer responsibility in how we construct our technology today, as opposed to a long-term perspective granting AI human rights. Although “Emergency Drill” is in the title of the toolbox, it is not about hyping AI, but about establishing a safe space to reflect on values embedded in the application.

Emergency Exit Signs served as reference for our designs. Photo by Dids . from Pexels: https://www.pexels.com/photo/emergency-exit-signage-1871343/

The original metaphor that we built this method on was a fire drill. A fire drill goes beyond just looking at the fire exits on a map; it is about experiencing evacuating a building with many people at once. It is about practising collaboration between fire wardens, other security staff and everyone else. Similarly, the DEED goes beyond looking at a list of AI ethics principles, but going through the concrete experience of discussing ethics and values and understanding how responsible AI practises are distributed within a team.

The general look we were going for was inspired by video game arcades. Photo by Stanislav Kondratiev from Pexels : https://www.pexels.com/photo/video-arcade-games-5883539/

Because the outputs were visual, I found it helpful to use images to communicate my personal vision. I set up a folder where I would share material with Derek. Seeing some of Derek’s initial design drafts helped me clarify some of these ideas I had.

The Result

This is the final design of the title slide from the research project *drumroll*:

The final design oft the logo on the slide deck for crafting scenarios.

It is inspired by the sign for assembly point, which is a great metaphor for what the DEED takes from emergency drills: creating an opportunity for an industry team to come together to understand better what is necessary for their responsible AI practise. The colours were inspired by Nineties (capitalise?) arcade video games to add a playful element of technology pop culture.

Working with a designer such as Derek was a very gratifying process and I enjoyed reflecting on what concepts of the DEED toolbox I wanted to transmit visually. The end product of the toolbox resources is a much more an engaging workshop, a more user-friendly slide deck, and a more cohesive visual language of the project overall. I believe it will certainly help with getting more participant teams to engage with my research project.

Recommendations for PhD-Design Collaborations

I would recommend design collaborations to any PhD student carrying out interactive research with visual artefacts. Here are some considerations that might guide your planning:

What outputs do you need, what formats and how many? Some formats may be more suitable than others, depending on whether you need parts of your design to be editable as your research evolves. An elaborate design may be more striking, but will not always be modifiable (e.g. a hand drawn script logo).
What is your timeline? The iterative process may take a few weeks, but having that back-and-forth is essential to creating a good design end product.
What do you want your thing to look like? Collect inspiration on Pinterest and websites that you like. Often online magazines will work with an array of interesting graphic designers. I found a lot of great AI-inspired art in articles of tech magazines.

Thanks

I would like to thank Derek Edwards for the great collaboration and the Interactive AI CDT for funding this part of my research. You can find Derek’s portfolio here. If you are a data scientist, AI or ML engineer thinking about carrying out a Data Ethics Emergency Drill with your team, you can get in touch with me at vanessa.hanschke@bristol.ac.

2023 AAAI Conference Blog – Amarpal Sahota

Posted on February 22, 2023February 23, 2023 by j.crabbe

This blog post is written by AI CDT Student Amarpal Sahota

I attended the 37^th AAAI Conference on Artificial Intelligence from the 7^th of February 2023 to the 14^th February. This was my first in person conference and I was excited to travel to Washington D.C.

The conference schedule included Labs and Tutorials February 7^th – 8^th , the main conference February 9^th – 12^th followed by the workshops on February 13^th – 14^th.

Arriving and Labs / Tutorials

I arrived at the conference venue on 7^th February to sign in and collect my name badge. The conference venue (Walter E. Washington Convention Center) was huge and had within it everything you could need from areas to work or relax to restaurants and of course many halls / lecture theatres to host talks.

I was attending the conference to present a paper at the Health Intelligence Workshop. Two of my colleagues from the University of Bristol (Jeff and Enrico) were also attending to present at this workshop (we are pictured together below!).

The tutorials were an opportunity to learn from experts on topics that you may not be familiar with yourself. I attended tutorials on Machine Learning for Causal Inference, Graph Neural Networks and AI for epidemiological forecasting.

The AI for epidemiological forecasting tutorial was particularly engaging. The speakers were very good at giving an overview of historical epidemiological forecasting methods and recent AI methods used for forecasting before introducing state of the art AI methods that use machine learning combined with our knowledge of epidemiology. If you are interested, the materials for this tutorial can be accessed at : https://github.com/AdityaLab/aaai-23-ai4epi-tutorial .

Main conference Feb 9^th – Feb 12^th

The main conference began with a welcome talk in the ‘ball room’. The room was set up with a stage and enough chairs to seat thousands. The welcome talk introduced included an overview of the different tracks within the conference (AAAI Conference of AI, Innovative Application of AI, Educational Advances in AI) , statistics around conference participation / acceptance and introduced the conference chairs.

The schedule for the main conference each day included invited talks and technical talks running from 8:30 am to 6pm. Each day this would be followed by a poster session from 6pm – 8pm allowing us to talk and engage with researchers in more detail.

For the technical talks I attended a variety of sessions from Brain Modelling to ML for Time-Series / Data Streams and Graph-based Machine Learning. Noticeably, all of the sessions were not in person. They were hybrid, with some speakers presenting online. This was disappointing but understandable given visa restrictions for travel to the U.S.

I found that many of the technical talks became difficult to follow very quickly with these talks largely aimed at experts in the respective fields. I particularly enjoyed some of the time-series talks as these relate to my area of research. I also enjoyed the poster sessions that allowed us to talk with fellow researchers in a more relaxed environment and ask questions directly to understand their work.

For example, I enjoyed the talk ‘SVP-T: A Shape-Level Variable-Position Transformer for Multivariate Time Series Classification‘ by PhD researcher Rundong Zhuo. At the poster session I was able to follow up with Rundong to ask more questions and understand his research in detail. We are pictured together below!

Workshops Feb 13^th – 14^th

I attended the 7th International Workshop On Health Intelligence from 13^th to 14^th February. The workshop began with opening remarks from the Co-chair Martin Michalowski before a talk by our first keynote speaker. This was Professor Randi Foraker who spoke about her research relating to building trust in AI for Improving Health Outcomes.

This talk was followed by paper presentations with papers on related topics grouped into sessions. My talk was in the second session of the day titled ‘Classification’. My paper (pre-print here) is titled ‘A Time Series Approach to Parkinson’s Disease Classification from EEG’. The presentation went reasonably smoothly and I had a number of interesting questions from the audience about applications of my work and the methods I had used. I am pictured giving the talk below!

The second half of the day focused on the hackathon. The theme of the hackathon was biological age prediction. Biological ageing is a latent concept with no agreed upon method for estimation. Biological age tries to capture a sense of how much you have aged in the time you have been alive. Certain factors such as stress and poor diet can be expected to age individuals faster. Therefore two people of the same chronological age may have different biological ages.

The hackathon opened with a talk on biological age prediction by Morgan Levin (The founding Principal Investigator at Altos Labs). Our team for the hackathon included four people from the University of Bristol – myself , Jeff , Enrico and Maha. Jeff (pictured below) gave the presentation for our team. We would have to wait until the second day of the conference to find out if we won one of the three prizes.

The second day of the workshop consisted of further research talks, a poster session and an awards ceremony in the afternoon. We were happy to be awarded the 3^rd place prize of $250 for the hackathon! The final day concluded at around 5pm. I said my good byes and headed to Washington D.C. airport for my flight back to the U.K

Through the AI of the storm – Emily Vosper at the Allianz climate risk award 2022

Posted on December 15, 2022December 15, 2022 by j.crabbe

This blog post is written by CDT Student Emily Vosper

This December I travelled to Munich, Germany, to take part in the Allianz climate risk award. Allianz set up this initiative to acknowledge the work done by young scientists who aim to build resilience to and/or reduces the risk of extreme weather events that are exacerbated by climate change. The award is open to PhD candidates and post-doctoral researchers who first submit an essay that outlines their work and the top four are invited to Munich where they present to the Allianz team.

In previous years, finalists have been working on very different climate hazards, but by chance this year the finalists all came from a tropical cyclone and/or flooding background. The finalists consisted of Mona Hemmati (Columbia University) who is a postdoctal researcher specialising in flood-related risks in tropical cyclones, Peter Pfeiderer (Humboldt University Berlin) whose work includes studying seasonal forecasts of tropical cyclones and Daniel Kahl (University of California) who studies flood exposure on a demographic level to understand community vulnerability for his PhD.

On Monday evening, the finalists were invited to meet the Allianz climate risk team at a Bavarian tapas bar. This evening was a great opportunity to get to know a bit about each other in a more relaxed setting, and a chance to sample some of the local cuisine!

On Tuesday, we met at the Allianz offices for the award day. With an excited buzz in the air, the event commenced with a keynote talk by Dr. Nicola Ranger, Oxford University, who spoke on the need to implement climate resilient finance strategies and during the Q and A session there was active discussion on how this could be achieved effectively. We also heard from Chris Townsend, a member of the board of management for Allianz SE, who introduced us to Allianz’ legacy and highlighted the exciting work going on in the climate risk space. We then heard engaging talks from Mona and Peter before a coffee break, followed by an articulate talk from Daniel. As the final speaker, I rounded off the presentation with my talk about how I’ve been using a generative adversarial network to enhance the resolution of tropical cyclone rainfall data. All presentations were followed by a group Q and A session where we discussed the exciting possibility of a collaboration between the four of us as our projects are very complimentary in nature.

With the award in its sixth year, there is now an alumni network of previous finalists rich with expertise in climate hazards and ample opportunity for future collaboration, so watch this space!

Left to Right: Holger Tewes-Kampelmann (CEO Allianz Reinsurance), Peter Pfeiderer (Humboldt University Berlin), Dr. Sibylle Steimen (MD Advisory & Services, Allianz Reinsurance), Emily Vosper (University of Bristol), Mona Hemmati (Columbia University), Daniel Kahl (UC Irvine), Chris Townsend (Member of the Board of Management, Allianz SE) and Dr. Nicola Ranger (Smith School of Enterprise and the Environment, Oxford University).

Understanding Dimensionality Reduction

Posted on July 8, 2022October 18, 2022 by j.crabbe

This blog post is written by CDT Student Alex Davies

Sometimes your data has a lot of features. In fact, if you have more than three, useful visualisation and understanding can be difficult. In a machine-learning context high numbers of features can also lead to the curse of dimensionality and the potential for overfitting. Enter this family of algorithms: dimensionality reduction.

The essential aim of a dimensionality reduction algorithm is to reduce the number of features of your input data. Formally, this is a mapping from a high dimensional space to low dimensional. This could be to make your data more concise and robust, for more efficient applications in msachine-learning, or just to visualise how the data “looks”.

There are a few broad classes of algorithm, with many individual variations inside each of these branches. This means that getting to grips with how they work, and when to use which algorithm, can be difficult. This issue can be compounded when each algorithm’s documentation focuses more on “real” data examples, which is hard for humans to really grasp, so we end up in a situation where we are using a tool we don’t fully understand to interpret data that we also don’t understand.

The aim of this article is to give you some intuition into how different classes of algorithm work. There will be some maths, but nothing too daunting. If you feel like being really smart, each algorithm will have a link to a source that gives a fully fleshed out explanation.

How do we use these algorithms?

This article isn’t going to be a tutorial in how to code with these algorithms, because in general its quite easy to get started. Check the following code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE

#    Load the data, here we're using the MNIST dataset
digits     = load_digits()

#    Load labels + the data (here, 16x16 images)
labels     = digits["target"]
data       = digits["data"]

#    Initialise a dimensionality model - this could be sklearn.decomposition.PCA or some other model
TSNE_model = TSNE(verbose = 0)
#    Apply the model to the data
embedding  = TSNE_model.fit_transform(data)

#    Plot the result!
fig, ax = plt.subplots(figsize = (6,6))
for l in np.unique(labels):
    ax.scatter(*embedding[labels == l,:].transpose(), label = l)
ax.legend(shadow = True)
plt.show()

/Users/alexdavies/miniforge3/envs/networks/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(

/Users/alexdavies/miniforge3/envs/networks/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

The data we’re using is MNIST, a dataset of 16×16 monochrome hand-written digits from one to ten. I’ll detail how TSNE works later on. Admittedly this code is not the tidiest, but a lot of it is also just for a nice graph at the end. The only lines we dedicated to TSNE were the initial import, calling the model, and using it to embed the data.

We still have the issue, though, of actually understanding the data. Here its not too bad, as these are images, but in other data forms we can’t really get to grips with how the data works. Ideally we’d have some examples of how different algorithms function when applied to data we can understand…

An image from the MNIST dataset, here of the number 3

Toy examples

Here are some examples of how different algorithms function apply to data we can understand. I’d credit this sklearn page on clustering as inspiration for this figure.

Toy 3D data, and projections of it by differerent algorithms. All algorithms are run with their default parameters.

The examples down the first column are the original data in 3D. The other columns are how a given algorithm projects these examples down into scatter-plot-able 2D. Each algorithm is run with its default parameters, and these examples have in part been designed to “break” each algorithm.

We can see that PCA (Principle Component Analysis) and TruncatedSVD (the sklearn version of Singular Value Decomposition) function like a change in camera angle. MDS (Multi-Dimensional Scaling) is a little more odd, and warps the data a little. Spectral Embedding has done a similar thing to MDS.

Things get a little weirder when we get onto TSNE (T-Stochastic Neighbours Embedding) and UMAP (Uniform Manifold Approximation and Projection). The data doesn’t end up looking much like the original (visually), but they have represented something about the nature of the data, often separating the data where appropriate.

PCA, SVD, similar

This is our first branch of the dimensionality reduction family tree, and are often the first dimensionality reduction methods people are introduced to, in particular PCA. Assuming we’re in a space defined by basis vectors V, we might want to find the most useful (orthogonal) combination of these basis vectors U, as in uᵢ = α v₁ + β v₂+ …

V here can be thought of our initial angle to the axes of the data, and U a “better” angle to understand the data. This would be called a “linear transform”, which only means that a straight line in the original data will also be a straight line in the projection.

Linear algebra-based methods like SVD and PCA try and use a mathematical approach to find the optimum form of U for a set of data. PCA and SVD aim to find the angle that best explain the covariance of the data. There’s a lot of literature out there about these algorithms, and this isn’t really the place for mathematic discussion, but I will give some key takeaways.

A full article about PCA by Tony Yiu can be found here, and here’s the same for SVD by Hussein Abdulrahman.

Firstly, these algorithms don’t “change” the data, as shown in the previous section. All they can do is “change the angle” to one that hopefully explains as much as possible about the data. Secondly, they are defined by a set of equations, and there’s no internal optimisation process beyond this, so you don’t need to be particularly concerned about overfitting. Lastly, each actually produces the same number of components (dimensions) as your original data, but ordered by how much covariance that new component explains. Generally you’ll only need to use the first few of these components.

We can see this at play in that big figure above. All the results of these algorithms, with a little thought, are seen to explain the maximum amount of variance. Taking the corkscrews as an example, both algorithms take a top down view, ignoring the change along the z axis. This is because the maximum variance in the data is the circles that the two threads follow, so a top down view explains all of this variance.

For contrast, here’s PCA applied to the MNIST data:

PCA applied to the MNIST data

Embeddings (MDS, Spectral Embedding, TSNE, UMAP)

The other branch of this family of algorithms are quite different in their design and function to the linear algebra-based methods we’ve talked about so far. Those algorithms look for global metrics — such as covariance and variance — across your data. That means they can be seen as taking a “global view”. These algorithms don’t necessarily not take a global view, but they have a greater ability to represent local dynamics in your data.

We’ll start with an algorithm that bridges the gap between linear algebra methods and embedding methods: Spectral Embedding.

Spectral Embedding

Often in machine-learning we want to be able to separate groups within data. This could be the digits in the MNIST data we’ve seen so far, types of fish, types of flower, or some other classification problem. This also has applications in grouping-based regression problems.

The problem (sometimes) with the linear-algebra methods above is that sometimes the data just isn’t that easily separable by “moving the camera”. For example take a look at the sphere-in-sphere example in that big figure above: PCA and SVD do not separate the two spheres. At this point it can be more practical to instead consider the relationships between individual points.

Spectral Embedding does just this: it measures the distance between a given number of neighbours. What we’ve actually done here is make a graph (or network) that represents the relationships between points. A simple example of building a distance-graph is below.

A very simple example of making a distance-graph and a distance-matrix

Our basic aim is to move to a 2D layout of the data that best represents this graph, which hopefully will show how the data “looks” in high dimensions.

From this graph we construct a matrix of weights for each connection. Don’t get too concerned if you’re not comfortable with the term “matrix”, as this is essentially just a table with equal numbers of rows and columns, with each entry representing the weight of the connection between two points.

In spectral embedding these distances are then turned into “weights”, often by passing through a normal distribution or simply binary edges. Binary edges would have [1 = is neighbour, 0 = is not neighbour]. From the weight matrix we apply some linear algebra. We’ll call the weight matrix W.

First we build a diagonal matrix (off diagonal = 0) by summing across rows or columns. This means that each point now has a given weight, instead of weights for pairs of points. We’ll call this diagonal weight matrix D.

We then get the “Laplacian”, L = D-W. The Laplacian is a difficult concept to summarise briefly, but essentially is a matrix that represents the weight graph we’ve been building up until now. Spectral Embedding then performs an “eigenvalue decomposition”. If you’re familiar with linear algebra this isn’t anything new. If you’re not I’m afraid there isn’t space to have a proper discussion of how this is done, but check this article about Spectral Embedding by Elemento for more information.

The eigenvalue decomposition produces a nice set of eigenvectors, one for each dimension in the original data, and like in PCA and SVD, we take the first couple as our dimensionality reduction. I’ve applied a Spectral Embedding to the MNIST data we’ve been using, which is in this figure:

Spectral Embedding applied to the MNIST digits dataset from sklearn

Interpreting spectral embeddings can be tricky compared to other algorithms. The important thing to bear in mind is that we’ve found the vectors that we think best describe the distances between points.

For a bit more analysis we can again refer to that big figure near the start. Firstly, to emphasise that we’re looking at distances, check out the cube-in-cube and corkscrews examples. The projections we’ve arrived at actually do explain the majority of the distances between points. The cubes are squashed together — because the distances between cubes is far less than the point-to-point diagonal distance. Similarly the greatest variation in distance in the corkscrew example is the circular — so that’s what’s preserved in our Spectral Embedding.

As a final observation have a look at what’s happened to our intersecting gaussians. There is a greater density of points at the centre of the distributions than at their intersection — so the Spectral Embedding has pulled apart them apart.

The basic process for (most) embedding methods, for example MDS, UMAP and t-SNE

UMAP, TSNE, MDS

What happens when we don’t solve the dimensionality reduction problem with any (well, some) linear algebra? We arrive at the last general group of algorithms we’ll talk about.

These start much the same as Spectral Embedding: by constructing a graph/network of the data. This is done in different ways by each of these algorithms.

Most simple is MDS, which uses just the distance between all points. This is a computationally costly step, as for N points, we have to calculate O(N²) distances.

TSNE does a similar thing as Spectral Embedding, and moves from distance into a weight of connection, which represents the probability that these two points are related. This is normally done using a normal distribution. Unlike Spectral Embedding or UMAP, TSNE doesn’t consider these distances for a given number of neighbours, but instead draws a bubble around itself and gets distances and weights for all the points in that bubble. It’s not quite that simple, but this is already a long article, so check this article by Kemal Erdem for a full walkthrough.

UMAP considers a fixed number of neighbours for each point, like Spectral Embedding, but has a slightly different way of calculating distances. The “Uniform Manifold” in UMAP means that UMAP is assuming that points are actually uniformly distributed, but that the data-space itself is warped, so that points don’t show this.

Again the maths here is difficult, so check the UMAP documentation for a full walkthrough. Be warned, however, that the authors are very thorough in their explanation. As of May 2022, the “how UMAP works” section is over 4500 words.

In UMAP, TSNE and Spectral Embedding, we have a parameter we can use to change how global a view we want the embedding to take. UMAP and Spectral Embedding are fairly intuitive, where we control simply the number neighbours considered, but in TSNE we use perplexity, which kind of like the size of the bubble around each point.

Once these algorithms have a graph with weighted edges, they try and lay out the graph in a lower number of dimensions (for our purposes this is 2D). They do this by trying to optimise according to a given function. This just means that they try and find the best way to place the nodes in our graph in 2D, according to an equation.

For MDS this is “stress”:

Credit goes to Saul Dobilas article on MDS

TSNE first calculated the student-t distribution with a single degree of freedom:

The student-t distribution with one degree of freedom, as used by TSNE

Its important to note in the equation above that all y are actually vectors, as in the stress equation, so that with those ||a— b|| partswe’re calculating a kind of distance. q here is the similarity between points. From here TSNE uses the Kullback-Leibler divergence of the two graphs to measure their similarity. It all gets very mathsy — check out Kemal Erdem’s article for more information.

UMAP again steps up the maths, and optimises the cross-entropy between the low-D and high-D layouts:

Cross-entropy used by UMAP. Subscript H indicates higher-D, subscript L indicates lower-D

The best explanation of UMAP actually comes from its own documentation. This might be because they don’t distribute it through sklearn.

This can be broken down into a repulsive and attractive “force” between points in the high-D and low-D graphs, so that the layout step acts like a set of atoms in a molecule. Using forces is quite common in graph layouts, and can be intuitive to think about compared to other metrics. It also means that UMAP (in theory) should be able to express both global and local dynamics in the data, depending on your choice of the number of neighbours considered. In practise this means that you can actually draw conclusions about your data from the distance between points across the whole UMAP embedding, including the shape of clusters, unlike TSNE.

It’s been a while without a figure, so here’s all three applied to the MNIST data:

MDS, TSNE and UMAP applied to the sklearn version of the MNIST dataset

We can see that all three seem to have been fairly effective. But how should we understand the results of something that, while optimising the lower-D graph, uses a stochastic (semi-random) process? How do you interpret something that changes based on its random seed? Ideally, we develop an intuition as to how the algorithms function, instead of just knowing the steps of the algorithm.

I’m going to ask you to scroll back up to the top and take another look at the big figure again. Its the last time we do this, I promise.

Firstly, MDS considers all the inter-point relations, so the global shape of data is preserved. You can see this in particular in the first two examples. The inner cube has all of its vertices connected, and the vertices of each cube “line up” with each other (vague, hopefully you see what I mean). There is some disconnection in the edges of the outer cube, and some warping in all the other edges. This might be due to the algorithm trying to preserve distances, but as with all stochastic processes, its difficult to decipher.

UMAP and TSNE also maintain the “lines” of the cube. UMAP is actually successful at separating the two cubes, and makes interesting “exploded diagram” style representations of them. In the UMAP embedding only one of the vertices of each cube is separated. In the TSNE embedding the result isn’t as promising, possibly because the “bubble” drawn by the algorithm around points also catches the points in the other cube.

Both UMAP and TSNE separate the spheres in the second example and the gaussians in the fourth. The “bubble” vs neighbours difference between UMAP and TSNE is also illustrated by the corkscrews example (with a non-default number of neighbours considered by UMAP the result might be more similar). So these algorithms look great! Except there’s always a catch.

Check the final example, which we haven’t touched on before. This is just random noise along each axis. There should be no meaningful pattern in the projections — and the first four algorithms reflect this. SVD, MDS and the Spectral Embedding are actually able to represent the “cube” shape of the data. However, TSNE and UMAP could easily be interpreted as having some pattern or meaning, especially UMAP.

We’ve actually arrived at the classic machine-learning compromise: expressivity vs bias. As the algorithms become more complex, and more able to represent complex dynamics in the data, their propensity to also capture confounding or non-meaningful patterns also becomes greater. UMAP, arguably the most complex of these algorithms, arguably has the greatest expressivity, but also the greatest risk of bias.

Conclusions

So what have we learnt? As algorithms become more complex, they’re more able to express dynamics in the data, but risk also expressing patterns that we don’t want. That’s true all over machine learning, in classification, regression or something else.

Distance-based models with stochastic algorithms (UMAP, TSNE, MDS) represent the relationships between points, and linear-algebra methods (PCA, SVD) take a “global” view, so if you want to make sure that your reduction is true to the data, stick with these.

Parameter choice becomes more important in the newer, stochastic, distance-based models. The corkscrew and cube examples are useful here — a different choice of parameters and we might have had UMAP looking more like TSNE.

CDT Research Showcase Day 1 – 30 March 2022

Posted on April 5, 2022April 6, 2022 by nikki.horrobin

Blog post written by CDT Student Oli Deane.

This year’s IAI CDT Research Showcase represented the first real opportunity to bring the entire CDT together in the real world, permitting in-person talks and face-to-face meetings with industry partners.

Student Presentations

Pecha Kucha presentation given by Grant Stevens — Grant Stevens giving his Pecha Kucha talk

The day began with a series of quickfire talks from current CDT students. Presentations had a different feel this year as they followed a Pecha Kucha style; speakers had ~6 minutes to present their research with individual slides automatically progressing after 20 seconds. As a result, listeners received a whistle-stop tour of each project without delving into the nitty gritty details of research methodologies.

Indeed, this quickfire approach highlighted the sheer diversity of projects carried out in the CDT. The presented projects had a bit of everything; from a data set for analyzing great ape behaviors, to classification models that determine dementia progression from time-series data.

It was fascinating to see how students incorporated interactivity into project designs. Grant Stevens, for example, uses active learning and outlier detection methods to classify astronomical phenomena. Tashi Namgyal has developed MIDI-DRAW, an interactive musical platform that permits the curation of short musical samples with user-provided hand-drawn lines and pictures. Meanwhile, Vanessa Hanschke is collaborating with LV to explore how better ethical practices can be incorporated into the data science workflow; for example, her current work explores an ethical ‘Fire-drill’ – a framework of emergency responses to be deployed in response to the identification of problematic features in existing data-sets/procedures. This is, however, just the tip of the research iceberg and I encourage readers to check out all ongoing projects on the IAI CDT website.

Industry Partners

Gustavo Medina Vazquez's presentation, EDF Energy, with Q&A session being led by Peter Flach — Gustavo Medina Vazquez’s EDF Energy presentation with the Q&A session being led by CDT Director Peter Flach

Next, representatives from three of our industry partners presented overviews of their work and their general involvement with the CDT.

First up was Dylan Rees, a Senior Data Engineer at LV. With a data science team stationed in MVB at the University of Bristol, LV are heavily involved with the university’s research. As well as working with Vanessa to develop ethical practices in data science, they run a cross-CDT datathon in which students battle to produce optimal models for predicting fair insurance quotes. Rees emphasized that LV want responsible AI to be at the core of what they do, highlighting how insurance is a key example of how developments in transparent, and interactive, AI are crucial for the successful deployment of AI technologies. Rees closed his talk with a call to action: the LV team are open to, and eager for, any collaboration with UoB students – whether it be to assist with data projects or act as “guinea pigs” for advancing research on responsible AI in industry.

Gustavo Vasquez from EDF Energy then discussed their work in the field and outlined some examples of past collaborations with the CDT. They are exploring how interactive AI methods can assist in the development and maintenance of green practices – for example, one ongoing project uses computer vision to identify faults in wind turbines. EDF previously collaborated with members of the CDT 2019 cohort as they worked on an interactive search-based mini project.

Finally, Dr. Claire Taylor, a representative from QINETIQ, highlighted how interactive approaches are a major focus of much of their research. QINETIC develop AI-driven technologies in a diverse range of sectors: from defense to law enforcement, aviation to financial services. Dr. Taylor discussed the changing trends in AI, outlining how previously fashionable methods that have lost focus in recent years are making a come-back courtesy of the AI world’s recognition that we need more interpretable, and less compute-intensive, solutions. QINETIQ also sponsor Kevin Flannagan’s (CDT 2020 cohort) PhD project in which he explores the intersection between language and vision, creating models which ground words and sentences within corresponding videos.

Academic Partners and Poster Session

To close out the day’s presentations, our academic partners discussed their relevant research. Dr. Oliver Ray first spoke of his work in Inductive Logic Programming before Dr. Paul Marshall gave a perspective from the world of human computer interaction, outlining a collaborative cross-discipline project that developed user-focused technologies for the healthcare sector.

Finally, a poster session rounded off proceedings; a studious buzz filled the conference hall as partners, students and lecturers alike discussed ongoing projects, questioning existing methods and brainstorming potential future directions.

In all, this was a fantastic day of talks, demonstrations, and general AI chat. It was an exciting opportunity to discuss real research with industry partners and I’m sure it has produced fruitful collaborations.

I would like to end this post with a special thank you to Peter Relph and Nikki Horrobin who will be leaving the CDT for bigger and better things. We thank them for their relentless and frankly spectacular efforts in organizing CDT events and responding to students’ concerns and questions. You will both be sorely missed, and we all wish you the very best of luck with your future endeavors!

Neglected Aspects of the COVID-19 pandemic

Posted on October 13, 2021April 5, 2022 by pr16096

This week’s post is written by IAI CDT student Gavin Leech.

I recently worked on two papers looking at neglected aspects of the COVID-19 pandemic. I learned more than I wanted to know about epidemiology.

The first: how much do masks do?

There were a lot of confusing results about masks last year.

We know that proper masks worn properly protect people in hospitals, but zooming out and looking at the population effect led to very different results, from basically nothing to a huge halving of cases.

Two problems: these were, of course, observational studies, since we don’t run experiments on the scale of millions. (Or not intentionally anyway.) So there’s always a risk of missing some key factor and inferring the completely wrong thing.

And there wasn’t much data on the number of people actually wearing masks, so we tended to use the timing of governments making it mandatory to wear masks, assuming that this caused the transition to wearing behaviour.

It turns out that the last assumption is mostly false: across the world, people started to wear masks before governments told them to. (There are exceptions, like Germany.) The correlation between mandates and wearing was about 0.32. So mask mandate data provide weak evidence about the effects of mass mask-wearing, and past results are in question.

We use self-reported mask-wearing instead: the largest survey of mask wearing (n=20 million, stratified random sampling) and obtain our effect estimates from 92 regions across 6 continents. We use the same model to infer the effect of government mandates to wear masks and the effect of self-reported wearing. We do this by linking confirmed case numbers to the level of wearing or the presence of a government mandate. This is Bayesian (using past estimates as a starting point) and hierarchical (composed of per-region submodels).

For an entire population wearing masks, we infer a 25% [6%, 43%] reduction in R, the “reproduction number” or number of new cases per case (B).

In summer last year, given self-reported wearing levels around 83% of the population, this cashed out into a 21% [3%, 23%] reduction in transmission due to masks (C).

One thing which marks us out is being obsessive about checking this is robust; that different plausible model assumptions don’t change the result. We test 123 different assumptions about the nature of the virus, of the epidemic monitoring, and about the way that masks work. It’s heartening to see that our results don’t change much (D)

It was an honour to work on this with amazing epidemiologists and computer scientists. But I’m looking forward to thinking about AI again, just as we look forward to hearing the word “COVID” for the last time.

The second: how much does winter do?

We also look at seasonality: the annual cycle in virus potency. One bitter argument you heard a lot in 2020 was about whether we’d need lockdown in the summer, since you expect respiratory infections to fall a lot in the middle months.

We note that the important models of what works against COVID fail to account for this. We look at the dense causal web involved:

This is a nasty inference task, and data is lacking for most links. So instead, we try to directly infer a single seasonality variable.

It looks like COVID spreads 42% less [25% – 53%, 95% CI] from the peak of winter to the peak of summer.

Adding this variable improves two of the cutting-edge models of policy effects (as judged by correcting bias in their noise terms).

One interesting side-result: we infer the peak of winter, we don’t hard-code it. (We set it to the day with the most inferred spread.) And this turns out to be the 1st January! This is probably coincidence, but the Gregorian calendar we use was also learned from data (astronomical data)…

—

BIAS Day 4 Review: ‘Data-Driven AI’

Posted on October 6, 2021April 5, 2022 by pr16096

This review of the 4th day of the BIAS event, ‘Data-Driven AI’, is written by CDT Student Stoil Ganev.

The main focus for the final day of BIAS was Data-Driven AI. Out of the 4 pillars of the Interactive AI CDT, the Data-Driven aspect tends to have a more “applied” flavour compared to the rest. This is due to a variety of reasons but most of them can be summed up in the statement that Data-Driven AI is the AI of the present. Most deployed AI algorithms and systems are structured around the idea of data X going in and prediction Y coming out. This paradigm is popular because it easily fits into modern computer system architectures. For all of their complexity, modern at-scale computer systems generally function like data pipelines. One part takes in a portion of data, transforms it and passes it on to another part of the system to perform its own type of transformation. We can see that, in this kind of architecture, a simple “X goes in, Y comes out” AI is easy to integrate, since it will be no different from any other component. Additionally, data is a resource that most organisations have in abundance. Every sensor reading, user interaction or system to system communication can be easily tracked, recorded and compiled into usable chunks of data. In fact, for accountability and transparency reasons, organisations are often required to record and track a lot of this data. As a result, most organisations are left with massive repositories of data, which they are not able to fully utilise. This is why Data-Driven AI is often relied on as a straight forward, low cost solution for capitalising on these massive stores of data. This “applied” aspect of Data-Driven AI was very much present in the talks given at the last day of BIAS. Compared to the other days, the talks of the final day reflected some practical considerations in regards to AI.

The first talk was given by Professor Robert Jenssen from The Arctic University of Norway. It focused on work he had done with his students on automated monitoring of electrical power lines. More specifically how to utilise unmanned aerial vehicles (UAVs) to automatically discover anomalies in the power grid. A point he made in the talk was that the amount of time they spent on engineering efforts was several times larger than the amount spent on novel research. There was no off the shelf product they could use or adapt, so their system had to be written mostly from scratch. In general, this seems to be a pattern with AI systems where even, if the same model is utilised, the resulting system ends up extremely tailored to its own problem and cannot be easily reused for a different problem. They ran into a similar problem with the data set, as well. Given that the problem of monitoring power lines is rather niche, there was no directly applicable data set they could rely on. I found their solution to this problem to be quite clever in its simplicity. Since gathering real world data is rather difficult, they opted to simulate their data set. They used 3D modelling software to replicate the environment of the power lines. Given that most power masts sit in the middle of fields, that environment is easy to simulate. For more complicated problems such as autonomous driving, this simulation approach is not feasible. It is impossible to properly simulate human behaviour, which the AI would need to model, and there is a large variety in urban settings as well. However, for a mast sitting in a field, you can capture most of the variety by changing the texture of the ground. Additionally, this approach has advantages over real world data as well. There are types of anomalies that are so rare that they might simply not be captured by the data gathering process or be too rare for the model to notice them. However, in simulation, it is easy to introduce any type of anomaly and ensure it has proper representation in the data set. In terms of the architecture of the system, they opted to structure it as a pipeline of sub-tasks. There are separate models for component detection, anomaly detection, etc. This piecewise approach is very sensible given that most anomalies are most likely independent of each other. Additionally, the more specific a problem is, the easier and faster it is to train a model for it. However, this approach tends to have larger engineering overheads. Due to the larger amount of components, proper communication and synchronisation between them needs to be ensured and is not a given. Also, depending on the length of the pipeline, it might become difficult to ensure that it perform fast enough. In general I think that the work Professor Jenssen and his students did in this project is very much representative of what deploying AI systems in the real world is like. Often your problem is so niche that there are no readily available solutions or data sets, so a majority of the work has to be done from scratch. Additionally, even if there is limited or even no need for novel AI research, a problem might still require large amounts of engineering efforts to solve.

The second talk of the day was given by Jonas Pfeiffer, a PhD student from the Technical University of Darmstadt. In this talk he introduced us to his research on Adapters for Transformer models. Adapters are a light weight and faster approach to fine tuning Transformer models to different tasks. The idea is rather simple, the Adapters are small layers that are added between the Transformer layers, which are trained during fine tuning, while keeping the transformer layers fixed. While pretty simple and straight forward, this approach appears to be rather effective. However, other than focusing on his research on Adapters, Jonas is also one of the main contributors to AdapterHub.ml, a framework for training and sharing Adapters. This brings our focus to an important part of what is necessary to get AI research out of the papers and into the real world – creating accessible and easy to use programming libraries. We as researchers often neglect this step or consider it to be beyond our responsibilities. That is not without sensible reasons. A programming library is not just the code it contains. It requires training materials for new users, tracking of bugs and feature requests, maintaining and following a development road map, managing integrations with other libraries that are dependencies or dependers, etc. All of these aspects require significant efforts by the maintainers of the library. Efforts that do not contribute to research output and consequently do not contribute to the criteria by which we are judged as successful scientists. As such, it is always a delight when you see a researcher willing to go this extra mile, to make his or her research more accessible. The talk by Jonas also had a tutorial section where he led us though the process of fine tuning an off the shelf pre-trained Transformer. This tutorial was delivered through Jupyter notebooks easily accessible from the projects website. Within minutes we had our own working examples, for us to dissect and experiment with. Given that Adapters and the AdapterHub.ml framework are very recent innovations, the amount and the quality of documentation and training resources within this project is highly impressive. Adapters and the AdapterHub.ml framework are excellent tools that, I believe, will be useful to me in the future. As such, I am very pleased to have attended this talk and to have discovered these tools though it.

The final day of BIAS was an excellent wrap up to the summer school. With its more applied focus, it showed us how the research we are conducting can be translated to the real world and how it can have an impact. We got a flavour of both, what it is like to develop and deploy an AI system, and what it is like to provide a programming library for our developed methods. These are all aspects of our research that we often neglect or overlook. Thus, this day served as great reminder that our research is not something confined within a lab but that it is work that lives and breathes within the context of the world that surrounds us.

BIAS Day 3 Review: ‘Responsible AI’

Posted on October 1, 2021April 5, 2022 by pr16096

This review of the 3rd day of the BIAS event, ‘Responsible AI’, is written by CDT Student Emily Vosper.

Monday was met with a swift 9:30am start, made easier to digest with a talk on AI and Ethics, why all the fuss? By Toby Walsh. This talk, and subsequent discussion, covered the thought-provoking topic of fairness within AI. The main lesson considered whether we actually need new ethical principles to govern AI, or whether we can take inspiration from well-established areas, such as medicine. Medicine works by four key principles: Beneficence, non-maleficence, autonomy and justice and AI brings some new challenges to this framework. The new challenges include autonomy, decision making and culpability. Some interesting discussions were had around reproducing historical biases when using autonomous systems, for example within the justice system such as predictive policing or parole decision making (COMPAS).

The second talk of the day was given by Nirav Ajmeri and Pradeep MuruKannaiah on ethics in sociotechnical systems. They broke down the definition of ethics as distinguishing between right and wrong which is a complex problem full of ethical dilemmas. Such dilemmas include examples such as Les Miserables where the actor steals a loaf of bread, stealing is obviously bad, but the bread is being stollen to feed a child and therefore the notion of right and wrong becomes nontrivial. Nirav and Pradeep treated ethics as a multiagent concern and values were brought in as the building blocks of ethics. Using this values-based approach the notion of right and wrong could be more easily broken down in a domain context i.e. by discovering what the main values and social norms are for a certain domain rules can be drawn up to better understand how to reach a goal within that domain. After the talk there were some thought provoking discussions surrounding how to facilitate reasoning at both the individual and the societal level, and how to satisfy values such as privacy.

In the afternoon session, Kacper Sokol ran a practical machine learning explainability session where he introduced the concept of Surrogate Explainers – explainers that are not model specific and can therefore be used in many applications. The key takeaways are that such diagnostic tools only become explainers when their properties and outputs are well understood and that explainers are not monolithic entities – they are complex with many parameters and need to be tailer made or configured for the application in hand.

The practical involved trying to break the explainer. The idea was to move the meaningful splits of the explainer so that they were impure, i.e. they contain many different classes from the black box model predictions. Moving the splits means the explainer doesn’t capture the black box model as well, as a mixture of points from several class predictions have been introduced to the explainer. Based on these insights it would be possible to manipulate the explainer with very impure hyper rectangles. We found this was even more likely with the logistical regression model as it has diagonal decision boundaries, while the explainer has horizontal and vertical meaningful splits.