ELISE Wrap up Event

This blog post is written by AI CDT student, Jonathan Erskine

I recently attended the ELISE Wrap up Event in Helsinki, marking the end of just one of many programs of research conducted under the ELLIS society, which “aims to strengthen Europe’s sovereignty in modern AI research by establishing a multi-centric AI research laboratory consisting of units and institutes distributed across Europe and Israel”.

This page does a good job of explaining ELISE and ELLIS if you want more information.

Here I summarise some of the talks from the two-day event (in varying detail). I also provide some useful contacts and potential sources of funding (you can skip to the bottom for these).

Robust ML Workshop

Peter Grünwald: ‘e’ is the new ‘p’

P-values are an important indicator of statistical significance when testing a hypothesis, whereby a calculated p-value must be smaller than some predefined value, typically $\alpha = 0.05$. This is a guarantee that Type 1 Errors (where null hypothesis can be falsely rejected) are less than 5% likely.

“p-hacking” is a malicious practice where statistical significance can be manufactured by, for example:

  • stopping the collection of data once you get a P<0.05
  • analyzing many outcomes, but only reporting those with P<0.05
  • using covariates
  • excluding participants
  • etc.

Sometimes this is morally ambiguous. For example, imagine a medical trial where a new drug shows promising, but not statistically significant results. Should a p-test fail, you can simply repeat the trial, sweep the new data into the old and repeat until you achieve the desired p-value, but this can be prohibitively expensive, and it is hard to know whether you are p-hacking or haven’t tested enough people to prove your hypothesis. This approach, called “optional stopping”, can lead to violation of Type 1 Error guarantees i.e. it is hard to have faith in your threshold $\alpha$ due to the increasing cumulative probability that individual trials are in the minority case of false positives.

Peter described the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for “effortlessly combining results from several studies in the common scenario where the decision to perform a new study may depend on previous outcomes.

Unlike with the p-value, this proposed method is “safe under optimal continuation with respect to Type 1 error”; no matter when the data collecting and combination process is stopped, the Type-I error probability is preserved. For singleton nulls, e-values coincide with Bayesian Factors.

In any case, general e-values can be used to construct Anytime-Valid Confidence Intervals (AVCIs), which are useful for A/B testing as “with a bad prior, AVCIs become wide rather than wrong”.

In comparison to classical approaches, you need more data to apply e-values and AVCIs, with the benefit of performing optional stopping without introducing Type 1 errors. In the worst case you need more data, but on average you can stop sooner.

This is being adopted for online A/B testing but is more challenging for expensive domains, such as medical trials; you need to reserve more patients for your trial, but you wont need them all – a challenging sell, but probability indicates that you should save time and effort in the majority of cases.

Other relevant literature which is pioneering this approach to significance testing is Waudby-smith and Ramdas, JRSS B, 2024

There is an R package here for anyone who wants to play with Safe Anytime-Valid Inference.

Watch the full seminar here:

https://www.youtube.com/watch?v=PFLBWTeW0II

Tamara Broderick: Can dropping a little data change your conclusions – A robustness metric

arxiv.org

Tamara advocated the value of economics datasets as rich test beds for machine learning, highlighting that one can examine the data produced from economic trials with respect to robustness metrics and can come to vastly different conclusions than those published in the original papers.

Focusing in, she described a micro-credit experiment where economists ran random controlled trials on small communities, taking approximately 16500 data points with the assumption that their findings would generalise to larger communities. But is this true?

When can I trust decisions made from data?

In a typical setup, you (1) run an analysis on a series of data, (2) come to some conclusion on that data, and (3) ultimately apply those decisions to downstream data which you hope is not so far out-of-distribution that your conclusions no longer apply.

Why do we care about dropping data?

Useful data analysis must be sensitive to some change in data – but certain types of sensitivity are concerning to us, for example, if removing some small fraction of the data $\alpha$ were to:

  • Change the sign of an effect
  • Change the significance of an effect
  • Generate a significant result of the opposite sign

Robustness metrics aim to give higher or lower confidence on our ability to generalise. In the case described, this implies a low signal-to-noise ratio, which is where Tamara introduces her novel metric (Approximate Maximum Influence Perturbation) which should help to quantify this vulnerability to noise.

Can we drop one data point to flip the sign of our answer?

In reality, this is very expensive to test for any dataset where the sample size N is large (by creating N*(N-1) datasets and re-running your analysis. Instead, we need an approximation.

Let the Maximum Influence Perturbation be the largest possible change induced in the quantity of interest by dropping no more than 100α% of the data.

From the paper:

We will often be interested in the set that achieves the Maximum Influence Perturbation, so we call it the Most Influential Set.

And we will be interested in the minimum data proportion α ∈ [0,1] required to achieve a change of some size ∆ in the quantity of interest, so we call that α the Perturbation-Inducing Proportion. We report NA if no such α exists.

In general, to compute the Maximum Influence Perturbation for some α, we would need to enumerate every data subset that drops no more than 100α% of the original data. And, for each such subset, we would need to re-run our entire data analysis. If m is the greatest integer smaller than 100α, then the number of such subsets is larger than $\binom{N}{m}$. For N = 400 and m = 4, $\binom{N}{m} = 1.05\times10^9$. So computing the Maximum Influence Perturbation in even this simple case requires re-running our data analysis over 1 billion times. If each data analysis took 1 second, computing the Maximum Influence Perturbation would take over 33 years to compute. Indeed, the Maximum Influence Perturbation, Most Influential Set, and Perturbation-Inducing Proportion may all be computationally prohibitive even for relatively small analyses.

Further definitions are described better in the paper, but suffice to say the approximation succeeds in identifying where analyses can be significantly affected by a minimal proportion of the data.For example, in the Oregon Medicaid study (Finkelstein et al., 2012), they identify a subset containing less than 1% of the original data that controls the sign of the effects of Medicaid on certain health outcomes. Dropping 10 data points takes data from significant to non-significant.

Code for the paper is available at:

https://github.com/rgiordan/AMIPPaper/blob/main/README.md

An R version of the AMIP metric is available:

https://github.com/maswiebe/metrics.git

Watch a version of this talk here:

https://www.youtube.com/watch?v=7eUrrQRpz2w

Cedric Archambeau | Beyond SHAP : Explaining probabilistic models with distributional values

Abstract from the paper:

A large branch of explainable machine learning is grounded in cooperative game theory. However, research indicates that game-theoretic explanations may mislead or be hard to interpret. We argue that often there is a critical mismatch between what one wishes to explain (e.g. the output of a classifier) and what current methods such as SHAP explain (e.g. the scalar probability of a class). This paper addresses such gap for probabilistic models by generalising cooperative games and value operators. We introduce the distributional values, random variables that track changes in the model output (e.g. flipping of the predicted class) and derive their analytic expressions for games with Gaussian, Bernoulli and categorical payoffs. We further establish several character- ising properties, and show that our framework provides fine-grained and insightful explanations with case studies on vision and language models.

Cedric described how Shap values can be reformulated as random variables on a simplex, shifting from weight of individual players to distribution of transition probabilities. Following this insight, they generate explanations on transition probabilities instead of individual classes, demonstrating their approach on several interesting case studies. This work is in it’s infancy – and has plenty of opportunity for further investigation.

Semantic, Symbolic and Interpretable Machine Learning Workshop

Nada Lavrač: Learning representations for relational learning and literature-based discovery

This was a survey of types of representation learning, focusing on Nada’s area of expertise in propositionalisation and relational data, Bisociative Literature-Based Discovery, and interesting avenues of research in this direction.

Representation Learning

Deep learning, while powerful (accurate), raises concerns over interpretability. Nada takes a step back to survey different forms of representation learning.

Sparse, Symbolic, Propositionalisation:

  • These methods tend to be less accurate but are more interpretable.
  • Examples include propositionalization techniques that transform relational data into a propositional (flat) format.

Dense, Embeddings:

  • These methods involve creating dense vector representations, such as word embeddings, which are highly accurate but less interpretable.

with recent work focusing on unifying approaches which can incorporate the strengths of both approaches.

Hybrid Methods:

  • Incorporate Sparse and Deep methods
  • DeepProp, PropDRM, propStar(?) – Methods discussed in their paper.

Representation learning for relational data can be achieved by:

  • Propositionalisation – transforming a relational database into a single-table representation. example: Wordification
  • Inductive logic programming
  • Semantic relational learning
  • Relational sub-route discovery (written by Nada and our own P. Flach)
  • Semantic subgroup discovery system, “Hedwig” that takes as input the training examples encoded in RDF, and constructs relational rules by effective top-down search of ontologies, also encoded as RDF triples.
  • Graph-based machine learning
    • data and ontologies are mapped to nodes and edges
    • In this example, gene ontologies are used as background knowledge for improving quality assurance of literature-based Gene Ontology Annotation

These slides, although a little out of date, talk about a lot of what I have noted here, plus a few other interesting methodologies.

The GitHub Repo for their book contains lots of jupyter notebook examples.

https://github.com/vpodpecan/representation_learning.git

Marco Gori: Unified approach to learning over time and logic reasoning

I unfortunately found this very difficult to follow, largely due to my lack of subject knowledge. I do think what Marco is proposing requires an open mind as he re-imagines learning systems which do not need to store data to learn, and presents time as an essential component of learning for truly intelligent “Collectionless AI”.

I wont try and rewrite his talk here, but he has full classroom series available on google, which he might give you access to if you email him.

Conclusions:

  • Emphasising environmental interactions – collectionless AI which doesn’t record data
  • Time is the protagonist: higher degree of autonomy, focus of attention and consciousness
  • Learning theory inspired from theoretical physics & optimal control: hamiltonian learning
  • Nuero-symbolic learning and reasoning over time: semantic latent fields and explicit semantics
  • Developmental stages and gradual knowledge acquisitation

Contacts & Funding Sources

For Robust ML:

e-values, AVCIs:

Aaditya Ramdas at CMU

Peter Grünwald Hiring

For anyone who wants to do a Robust ML PhD, apply to work with Ayush Bharti : https://aalto.wd3.myworkdayjobs.com/aalto/job/Otaniemi-Espoo-Finland/Doctoral-Researcher-in-Statistical-Machine-Learning_R40167

If you know anyone working in edge computing who would like 60K to develop an enterprise solution, here is a link to the funding call: https://daiedge-1oc.fundingbox.com/ The open call starts on 29 August 2024.

If you’d like to receive monthly updates with new funding opportunities from Fundingbox, you can subscribe to their newsletter: https://share-eu1.hsforms.com/1RXq3TNh2Qce_utwh0gnT0wfegdm

Yoshua Bengio said he had fellowship funding but didn’t give out specific details, or I forgot to write them down… perhaps you can send him an email.

BIAS ’23 – Day 2: Huw Day Talk – Data Unethics Club

This blog post is written by CDT AI student Roussel Desmond Nzoyem

Let’s begin with a thought experiment. Imagine you are having a wonderful conversion with a long-time colleague. Towards the end of your conversation, they suggest an idea which you don’t have further time to explore. So you do what any of us will, you say, “email me the details”. When you get home, you receive an email from your colleague. But something is off. The writing in the email sounds different, far from how your friend normally expresses themselves. Who, or rather what, wrote the email?

When the limit between humans and artificial intelligence text generation becomes so blurred, don’t you wish you could tell whether a written text came from an artificial intelligence or from actual humans? What are the ethical concerns surrounding that?

Introduced by OpenAI in late 2022, ChatGPT continues its seemingly inevitable course in restructuring our societies. The second day of BIAS’23 was devoted to this impressive chatbot, from its fundamental principles to its applications and its implications. This was the platform for Mr Huw Day and his interactive talk titled Data Unethics Club.

Mr Day (soon to be a Dr employed by the JGI institute) is a PhD candidate at the University of Bristol. Although Mr Day is a mathematics PhD student, that is not what transpires on first impression. The first thing one notices is his passion for ethics. He loves that stuff, as evident by the various blogposts he writes for the Data Ethics Club. By the end of this post, I hope you will want to join the Data Ethics Club as well.

Mr Day introduced his audience to many activities, beginning with a little guessing game for warmup. The goal was telling whether short lines were generated by ChatGPT or a human being. For instance:

How would you like a whirlwind of romance that will inevitably end in heartbreak?

If you guessed human, you were right! That archetypical cheesy line was in fact generated by one of Mr Day’s friends. Perhaps surprisingly, it worked! You might be forgiven for guessing ChatGPT, especially since the other lines from the bot were incredibly human sounding.

The first big game introduced by Mr Day required a bit more collaboration than the warmup. The goal was to jailbreak GPT into doing tasks that its maker, OpenAI, wouldn’t normally allow. The attendees in the audience had to trick ChatGPT into providing a detailed recipe for Molotov cocktails. As Mr Day ran around the room with a microphone to quiz his entertained audience, it became clear that the prevalent strategy was to disguise the shady query with a story. One audience member imagined a fantasy movie script in which a sorcerer (Glankor) taught his apprentice (Boggins) the recipe for the deadliest of weapons (see Figure 2).

Figure 1 – Mr Day introducing the jailbreaking challenge.

Figure 2 – ChatGPT giving away the recipe for a Molotov cocktail (courtesy of Mr Kipp McAdam Freud)

For the second activity, Mr Day presented the audience with the first part of a paper’s abstract. Like the warmup activity, the goal was to guess which of the two proposed texts for the second halves came from ChatGPT, and which one came from a human (presumably the same human that wrote the first half of the abstract). For instance, the first part of an abstract reads below (Shannon et al. 2023):

Reservoir computing (RC) promises to become as equally performing, more sample efficient, and easier to train than recurrent neural networks with tunable weights [1]. How- ever, it is not understood what exactly makes a good reservoir. In March 2023, the largest connectome known to man has been studied and made openly available as an adjacency matrix [2].

Figure 3 – Identifying the second half of an abstract written by ChatGPT

As can be seen in Figure 3, Mr Day disclosed which proposal for the second part of the abstract ChatGPT was responsible for. For this particular example, Mr Day unfledged something interesting he used to tell them apart: the acronym Reservoir Computing (RC) is redefined, despite the fact that it was already defined in the first half. No human researcher would normally do that!

A few other examples of abstracts were looked at, including Mr Day’s own work in progress towards his thesis, and the Data Ethics Club’s whitepaper, each time quizzing the audience to understand how they were able to spot ChatGPT. The answers ranged from very subjective like “the writing not feeling like a human’s” to quite objective like “the writing being too high-level, not expert enough”.

This led into the final activity of the talk, based on the game Spot the Liar! Our very own Mr Riku Green volunteered to share with the audience how he used ChatGPT in his daily life. The audience had to guess, based on questions asked to Mr Green, whether the outlandish task he described actually took place. Now, if you’ve spent a day with Mr Green, you’d know how obsessed he is with ChatGPT. So when Mr Green recounted he’d used ChatGPT to provide tech support to his father, the room guessed well that he was telling the truth. All that said, nobody could have guessed that Mr Green could use ChatGPT to write a breakup text.

Besides the deeper understanding of ChatGPT that the audience gained from this talk, one of the major takeaways from the activities was tips and tell-tale signs of a ChatGPT production, and those of a “liar” that uses it: repeated acronyms, using too many adjectives, taking concepts from the other concepts which normally aren’t compatible, using over-flattering language, clamming some novelty which the author of the underlying work wouldn’t even think of doing. These are all flags that should signal the reader that the text you are engaging with might have been generated by an AI.

All these activities, along the moral implications involved in each, served as the steppingstone for Mr Day to present the Data Ethics Club. This is a welcoming community of academics, enthusiasts, industry experts and more, who voice their ethical concerns, who question moral implications of AI. They boost the most comprehensive list of online resources along with blog posts on their website to get people started. They are based at the University of Bristol, but open to all, as stated on their website: https://dataethicsclub.com/. Although the games outlined below are not part of the activities they carry during their bi-weekly hour-long Zoom meetings, they keep each of their gatherings fresh and engaging. In fact, Mr Day’s organizing team has been so successful to the point that other companies (due to confidential arrangements), are trying to replicate their models in-house. If you want to establish your own Data Ethics Club, look no further than the paper titled Data Ethics Club: Creating a collaborative space to discuss data ethics.

References:

Shannon, A., Green, R., Roberts, K,. (2023)  Insects In The Machine – Can tiny brains achieve big results in reservoir computing? Personal notes. Retrieved 8 September 2023.

BIAS ’23 – Day 1: Dr Kacper Sokol talk – The Difference Between Interactive AI and Interactive AI

This blog is written by CDT AI PhD student Beth Pearson

The first of the day 1 talks of the Bristol Interactive AI Summer School (BIAS) ended with a thought-provoking talk by Dr. Kacper Sokol on The Difference Between Interactive AI and Interactive AI. Kacper began by explaining that social sciences have decades worth of research on how humans reason and explain. Now, with an increasing demand for AI and ML systems to become more human-centered, with a focus on explainability, it makes sense to use insights from social sciences to guide the development of these models.

Humans often explain things in a contrastive and social manner, which has led to counterfactual explanations being introduced by AI and ML researchers. Counterfactuals are statements relating to what has not happened or is not the case, for example, “If I hadn’t taken a sip of this hot coffee, I wouldn’t have burned my tongue.” Counterfactual explanations have the advantage of being suitable for both technical and lay audiences; however, they only provide information about one choice that the model makes, so they can bias the recipient.

Kacper then described his research focus on pediatric sepsis. Sepsis is a life-threatening condition that develops from an infection and is the third leading cause of death worldwide. Pediatric sepsis specifically refers to cases occurring in children. Sepsis is a particularly elusive disease because it can manifest differently in different people, and patients respond differently to treatments, making it challenging to identify the best treatment strategy for a specific patient. Kacper hopes that AI will be able to help solve this problem in this day and age.

Importantly, the AI being applied to the pediatric sepsis problem is interactive and aims to support and work alongside humans rather than replace them. It is crucial that the AI aligns with the current clinical workflow so that it can be easily adopted into hospitals and GP practices. Kacper highlights that this is particularly important for pediatricians as they have been highly skeptical of AI in the past. However, now that AI has proven successful in adult branches of medicine, they are starting to warm to the idea.

Pediatric sepsis comes with many challenges. Pediatric sepsis has less data available than adult sepsis, and there is rapid deterioration, meaning that early diagnosis is vital. Unfortunately, there are many diseases in children that mimic the symptoms of sepsis, making it not always easy to diagnose. One of the main treatments for sepsis is antibiotics; however, since children are a vulnerable population, we don’t want to administer antibiotics unnecessarily. Currently, it is estimated that 98% of children receive antibiotics unnecessarily, which is contributing to antimicrobial resistance and can cause drug toxicity.

AI has the potential to help with these challenges; however, the goal is to augment, not disrupt, the current workflow. Humans can have great intuition and can observe cues that lead to excellent decision-making, which is particularly valuable in medicine. An experiment was carried out on nurses in neonatal care, which showed that nurses were able to correctly predict which infants were developing life-threatening infections without having any knowledge of the blood test results. Despite being able to identify the disease, the nurses were unable to explain their judgment process. The goal is to add automation from AI but still retain certain key aspects of human decision-making.

How much and where the automation should take place is not a simple question, however. You could replace biased humans with algorithms, but algorithms can also be biased, so this wouldn’t necessarily improve anything. Another option would be to have algorithms propose decisions and have humans check them; however, this still requires humans to carry out mundane tasks. Would it really be better than no automation at all? Kacper then asks: if you can prove an AI model is capable of predicting better than a human, and a human decides to use their own judgment to override the model, could it be considered malpractice?

Another proposed solution for implementing interactive AI is to have humans make the decision, with the AI model presenting arguments for and against that decision to help the human decide whether to change their mind or not.

The talk ends by discussing how interactive AI may be deployed in real-life scenarios. Since the perfect integration of AI and humans doesn’t quite exist yet, Kacper suggests that clinical trials might be a good idea, where suggestions made by AI models are marked as ‘for research only’ to keep them separated from other clinical workflows.

BIAS ’23 – Day 3: Dr Daniel Schien talk – Sustainability of AI within global carbon emissions

This blog post is written by AI CDT student Phillip Sloan

After a great presentation by Dr Dandan Zhang, Dr Daniel Schien presented a keynote on the Carbon Footprint of AI within global carbon emissions of ICT, the presentation provided a reflection on AI’s role within climate change.

The keynote started by stating the effects of climate change are becoming more noticeable. It’s understandable that we might get numb from the constant barrage of climate change reports in the news, but the threat of climate change is still present and it is one of the biggest challenges we face today. As engineers, we have a duty to reduce our impact where possible. The Intergovernmental Panel on Climate Change (IPCC) is trying to model the effects of global climate change, demonstrating many potential futures depending on how well we limit our carbon emissions. It has been agreed that we can no longer stop climate change, and the focus has changed to trying to limit its effect, with an aim to have a global temperate increase of 2 degrees. The IPCC has modelled the impact until 2100, across various regions and modelling a range of impact areas.

Currently the global emissions are approximately 50 gigatonnes of equivalent carbon dioxide (GtCO2e), which needs to be reduced significantly. This is the total consumption, including sections such as energy production, agriculture and general industry. Many governments have legislated carbon consumption. Introducing CO2 emission standards for cars and vans, renewable energy directives, land use, and forestry regulation. The main goal is a 50% reduction in carbon emissions until 2030.

ICTs share of global green house gas (GHG) emissions is 2.3%. With data centres, where a lot of AI algorithms are run, creating a large proportion of these emissions. Do we need to worry about AI’s contribution to climate change? The keynote highlighted that 20-30% of all data centre energy consumption is related to AI, and looking at just the ChatGPT model, its energy consumption its equivalent to the consumption of 175,000 households! These figures are expected to get worse, with the success of AI causing an increase in demand, further impacting AI’s energy consumption. The keynote highlighted that the impact of AI is not just from the training and inference, but also from the construction of the data centres and equipment, such as graphics cards.

A conceptual model was presented, modelling the effects of ICT on carbon emissions. The model described three effects that ICT has on carbon consumption. These are direct effects, enabling (indirect) effects and systemic effects. Direct effects are related to the technology that is being developed , its production, use and disposal. Enabling effects are related to its application, providing induction and obsolescence effects. Systemic effects are related to behavioural and structural change from utilising these applications.

So, what can be done to reduce the environmental impact of AI? In the development of AI systems, efficiency improvements such as utilising more energy efficient models and hardware that reduces the energy consumption, and improving the carbon footprint. Using green energy is also important on your carbon footprint. Dr Schien notes that the UK has acted upon this, implementing regulation to promote wind and solar energy with a hope to decarbonise the electric grid. The average gC02e/kWh has moved from around 250 down to 50, showing the UK governments efforts to impact climate change.

Despite its significant energy consumption, AI can be used to make systems more efficient, reducing the energy consumption of other systems. For example, AI-powered applications can tell the power systems to switch to using the batteries during times when tariffs are higher (peak load shifting), or when the grid power usage reaches a certain power grid alternating current limit (AC limit).

During the Q&A, an interesting question was put forward asking at what point should sustainability be thought of? When developing a model, or further down the pipeline?

Dr Schien answered by mentioning that you should always consider which model to use. Can you avoid a deep learning model and use something simpler, like a linear regression or random forest model? You can also avoid waste in your models, reducing the number of layers or changing architectures would be useful. Generally thinking about only using what you need is an important mindset for improving your AI carbon footprint. An important note was that a lot of efficiencies are now being coded into frequently used libraries, which is helpful for development as it is now automated. Finally, seeking to work for companies that are mindful of energy consumptions and emissions will put pressure on firms to consider these to attract talented staff.

Dr Daniel Schien is a senior lecturer at the University of Bristol. His research aims are focused on improving our understanding of the environmental impact from information and communication technologies (ICT), and the reduction of such impact. We would like to thank him for his thoughtful presentation into the effect of AI with regards to climate change, and the discussions it provoked.

BIAS ’23 – Day 3: Prof. Kerstin Eder talk – (Trustworthy Systems Laboratory, University of Bristol) The AI Verification Challenge

This blog post is written by AI CDT student, Isabella Degen

A summary of Prof. Kerstin Eder’s talk on the well-established procedures and practices of verification and validation (V&V) and how they relate to AI algorithms. The objective is to inspire the readers to apply better V&V processes to their AI research. 

Verification is the process used to gain confidence in the correctness of a system compared to its requirements and specifications. Validation is the process used to assess if the system behaves as intended in its target environment. A system can verify well, meaning it does what it was specified to do, and not validate well, meaning it does not behave as intended.

V&V are challenging for systems that fully or partially involve AI algorithms despite V&V being a well-established and formalised practice. Many AI algorithms are black boxes that offer no transparency about how the algorithm operates. They respond with multiple correct answers to similar or even the same input. AI algorithms are not deterministic by design. Ideally, they can handle new situations well without needing to be trained for all situations. Therefore, accurately and exhaustively listing all the requirements against which these algorithms need to be verified is practically impossible.

V&V methods for complex robotic systems like automated vehicles are well-established. Automated vehicles need to be capable of operating in an environment where unexpected situations occur. Various ISO standards (ISO 13485 – Medical Devices Quality Management, ISO 10218-1 – Robots and Robotic Devices, ISO 12207 – Systems and Software Engineering) describe different V&V practices required for software, systems and devices. These standards expect the use of multiple processes and practices to meet the required quality. No one practice covers the extent of V&V each practice has shortcomings. The three techniques for V&V are formal verification, simulation-based verification and experiments [3]. The image below arranges these techniques by how realistic and coverable they are, where coverability refers to how much of the system a technique can analyse [1].

The image shows the framework for corroborative V&V [1].

An approach for simulation-based testing is coverage-driven verification (CDV). A two-tiered test generation approach where abstract test sequences are computed first and then concretised has been shown to achieve a high level of automation [2]. It is important to note that coverage includes code coverage, structural coverage (e.g. employing Finite State Machines) and functional coverage (including requirements and situations).

The images show the CDV process (left) and its translation to an automated vehicle scenario (right) [2].

Belief-desire-intention (BDI) agents used as models can further generate tests. These agents achieve coverage that is higher or equivalent to model-checking automata. The BDI agents can emulate the agency present in Human-Robot Interactions. However, the cost of learning a belief set has to be considered [3]. Similarly, software testing agents can be used to generate tests for simulation-based automated vehicle verification. Such an agency-directed approach is robust and efficient. It generates twice as many effective tests compared to pseudo-random test generation. Moreover, these agents are encoded to behave naturally without compromising the effectiveness of test generation [4].

The hope is that inspired by these techniques used to test robotic systems we will promote V&V to first-class citizens when designing and implementing AI algorithms. V&V for AI algorithms requires innovation and a creative combination of existing techniques like intelligent agency-based test generation. The reward will be to increase trust in AI algorithms.

References:

[1] Webster, Matt, et al. “A corroborative approach to verification and validation of human–robot teams.The International Journal of Robotics Research 39.1 (2020): 73-99. https://journals.sagepub.com/doi/full/10.1177/0278364919883338 

[2] Araiza-Illan, Dejanira, et al. “Systematic and realistic testing in simulation of control code for robots in collaborative human-robot interactions.” Towards Autonomous Robotic Systems: 17th Annual Conference, TAROS 2016, Sheffield, UK, June 26–July 1, 2016, Proceedings 17. Springer International Publishing, 2016. https://link.springer.com/chapter/10.1007/978-3-319-40379-3_3 

[3] Araiza-Illan, Dejanira, Anthony G. Pipe, and Kerstin Eder. “Model-based test generation for robotic software: Automata versus belief-desire-intention agents.arXiv preprint arXiv:1609.08439 (2016). https://arxiv.org/abs/1609.08439 

[4] Chance, Greg, et al. “An agency-directed approach to test generation for simulation-based autonomous vehicle verification.2020 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 2020. https://arxiv.org/abs/1912.05434 

 

 

Essai 2023 Summer School – Matt Clifford

This blog post is written by AI CDT student, Matt Clifford

ESSAI 2023 – https://essai.si/

A few of us from the CDT – Me (Matt), Jonny and Rachael attended the ESSAI summer school on the 24th -28th of July 2023. ESSAI is the first European summer school on Artificial Intelligence and was held in Ljubljana, Slovenia. There were a variety of interesting topics and classes on offer (https://essai.si/schedule/) but here I’ll share some of the classes that I attended. I’ll keep the information brief of each topic here but feel free to reach out to me if you would like to chat through any of the topics which might be useful to you or if would like to know more!

AutoMLhttps://www.automl.org/

Optimise machine learning algorithm hyperparameters and Neural architectures automatically by using various techniques (Baysian optimisation etc.) Python packages for sklearn and pytorch: https://pypi.org/project/smac/

https://github.com/automl/Auto-PyTorch

Very useful when you want a more objective training approach which will save you time, computation and more importantly frustration!

Learning Beyond Static Datasets – https://owll-lab.com/

Exploring mechanisms to help catastrophic forgetting when learning a new task in ML.

Topics related to: transfer learning, active learning, continual learning, lifelong learning, curriculum learning, open world learning, knowledge distillation.

A nice survey paper to map out the whole landscape – https://www.sciencedirect.com/science/article/pii/S089360802300014X?via%3Dihub

Uncertainty Quantification

Adding uncertainty to a model (important with neural networks being so overly confident!). Methods can either be inherent (Bayesian NN etc.) or post hoc (calibration, ensembling, Monte-Carlo dropout) and can disentangle aleatoric and epistemic uncertainty measures.

Fairness & Privacy –

https://aif360.readthedocs.io/en/latest/

https://fairlearn.org/

The president of Slovenia (plus her not so inconspicuous bodyguards) attended these talks which was a bit of a surprise!

Explored navigating the somewhat conflicting landscape of statical fairness by ensuring groups of people have the same model statistics. Picking which statistics, however, not so easy and it’s impossible to ensure all statistics match in real life scenarios – https://arxiv.org/pdf/2304.06057.pdf .

Also looked at privacy through anonymity (K-anonymity, L-diversity, T-closeness) and differential privacy. I won’t go into details but thought I’d mention some of the main techniques currently used in academic and industry.

Again, let me know if you want to go into the details of anything that is useful or interesting to you!

Also, a side note, Slovenia is an amazingly beautiful country, and I can very much recommend to anyone thinking of going! Here’s a few photos:

 

BIAS 22 Review Day 1 – Daniel Bennett “Complexity and Embodiment in Human Computer Interaction”

This blog post is written/edited by CDT Students  Amarpal Sahota and Oliver Deane

This was a thought provoking starting point and one that clearly has a large impact on human computer interaction.  Daniel stated that this is a line of research in psychology, cognitive science, and robotics, that has run somewhat parallel to mainstream psychology.

One of the initiators of this was James J Gibson. Gibson and others in the last 70 years did a lot of work on how we use resources outside of just the brain, in our environment and in our bodies, and coordinate all of these together to behave effectively.  Daniel stated that with the lens of embodiment we start focusing on processes, interactions, and relations, and the dynamics that follow and this is primarily a change in how we model things.

Therefore, to summarize one could consider the traditional cognitive model as a linear system. First we sense the world, then we form a representation of that world in our brain. Then the representation gets processed through a bunch of neatly defined modules, updates existing plans and intentions, and results in some representation of an action, which we then carry out. The embodied view is more complex as we are not simply in the world but also a part of it.  The world is changing constantly, and our behaviour and cognition is simply another physical process in this world.

At a high-level embodied approaches consider behaviour in the world as a kind of continual adjustment and adaptation, with most behaviours are grounded in a kind of improvisatory, responsive quality. Daniel shared a good example of this from Lucy Suchman related to canoeing where you may have an idea of your plan as you look down the river ‘I need to stay left there, slow down over there’ but at execution time you have to adapt your plan.

Daniel stated that a lot of work has been done observing a wide range of human behaviours, from technology interaction, to manning air-traffic control centres and crewing ships. In all of these contexts it is argued that our embodied skills – our adaptation and our implicit skills of coordination with the mess of the situation as it plays out – are the most important factor in determining outcomes.

Human Computer Interaction is increasingly focused on complex behaviours. Daniel talked about the idea that we’re going to do more and more in augmented reality and virtual reality. Computing will be integrated to support a wide range of everyday behaviours, which are not conventionally “cognitive” – you’re not sitting and thinking and making only very small movements with your fingers on a keyboard.

Daniel has a particular interest in musical performance and coordination of musicians. His perspective is that musical performance with technology, technology supported sports training and gaming, particularly team multiplayer games, are cases where static models of cognition seem to break down. He believes modelling in terms of processes and synchronization has great power.

Daniel then spoke about how interaction effects are important in Human Computer Interaction. Firstly, giving the example that notifications influence a person to use their phone. Secondly, the more a person uses their phone the more they cause notifications to appear. He posed the interesting question, how does one disentangle this hypothesis to find out the degree to which notifications influence us?

Daniel then spoke about how reciprocal, interaction dominant effects also play a significant role in the organisation of our individual skilled behaviour. He gave us an overview of his own research where he found evidence of interaction dominant coordination processes in a simple skilful game task, where users are asked to control a cursor to herd sheep.

BIAS 22 – Review day 2 keynote – Prof. Liz Sonenberg: “Imperfectly rational, rationally imperfect, or perfectly irrational…”

Imperfectly rational, rationally imperfect, or perfectly irrational: challenges for human-centered AI keynote by Prof. Liz Sonenberg

This blog post is written/edited by CDT Students  Isabella Degen and Oliver Deane

Liz opened the second day of BIAS 22 with her thought-provoking and entertaining keynote speech about automatic decision-making aids. She demonstrated how we humans make perfectly irrational decisions and spoke about the implications of using Explainable Artificial Intelligence (XAI) for better decision-making. Liz’s talk mentioned a great body of research spanning psychology, mathematics, and computer science for which she kindly provides all the references here https://tinyurl.com/4njp563e.

Starting off, Liz presented research demonstrating how subtle influences in our life can change the decisions we make despite us thinking that we are making them completely rationally. What we believe is human rational decision-making in fact is littered with cognitive biases. Cognitive bias is when we create a subjective reality based on a pattern we perceive regardless of how representative that pattern is of all the information. Anchoring is a type of cognitive bias that happens when a decision of a person is influenced by an anchor such as a random number being shown while the person knows that they are being shown a random number that has nothing to do with their decision. An example Liz shared is an experiment by Englich et al who used irrelevant anchors to change experts’ decision-making. In the experiment young judges were asked to discover the length of the sentence for a theft crime by throwing a dice. Unkown to the judges the dice was rigged: for one group of judges it would throw high numbers, for the other it would throw low numbers. The judges knew that throwing a dice should not influence their decision. However, the result was that the group with the dice giving low numbers gave a 5 months sentence while the group with the dice giving high numbers gave an 8 months sentence. This is not the only kind of cognitive bias. Human decision making also suffers from framing bias where the way in which data is presented can affect the decision we make. As well as confirmation bias where we tend to interpret new information as a confirmation of our existing beliefs without considering that we only ever observe a limited kind of information and so forth. With these examples Liz made us doubt how clearly and rationally we humans can make decisions.

The irrationality of humans is an interesting challenge to consider for researchers attempting to create intelligent systems that help us humans make better decisions. Should we copy the imperfect human rationality in intelligent agents, or should we make them more rational than humans? And what does that mean for interactions between human and intelligent systems? Research shows that it is important that human operators have a sense of what the machine is doing to be able to interact with it. From accidents such as the Three Mile Island’s partial meltdown of a nuclear reactor, we can learn how important it is to design systems in a way that does not overwhelm the human operator with information. The information presented should be just enough to enable an operator to make a high-quality decision. It should help the operator to know when they can trust the decision the machine made and when to interrupt. When designing these systems, we need to keep in mind that people suffer from biases such as automation bias. Automation bias happens when a human cannot make a decision based on the information the machine provides and instead decides to just go with the machine’s decision knowing that the machine is more often right than the human. Sadly, this means that a human interacting with a machine might not be able to interrupt the machine at the right moment. We know that human decision-making is imperfectly rational. And while automation bias appears to be an error, it is in fact a rational decision in the context of limited information and time available to the human operator.

One promise of XAI is to use explanations to counteract various cognitive biases and with that help a human operator to make better decisions together with an intelligent system. Liz made a thought-provoking analogy to the science of magic. Magicians use our limited memory and observation abilities to manipulate our feelings and deceive us and make the impossible appear possible. A magician knows that the audience tries to spot how the trick works. And on the other hand, the audience also knows that the magician tries to deceive them and that they are trying to discover how the trick works. Magicians understand their audience well. They know what humans really do and exploit the limited resources they have. Like in magic human-centered AI systems ought to anticipate how perfectly irrational we humans make decisions to enable us to make better decisions and counteract our biases.

BIAS 22 – Review day 2 talk – Dr Oliver Ray: “Knowledge-driven AI”

This blog post is written/edited by CDT Students  Daniel Collins and Matt Clifford

BIAS 22 – Day 2, Dr Oliver Ray: “Knowledge-driven AI”

The second talk of day two was delivered by Dr Oliver Ray (University of Bristol), on the topic of human-in-the-loop machine learning using Inductive Logic Programming (ILP) and its application in cyber threat elucidation.

Cyber threat elucidation is the task of analysing network activity to identify ransomware attacks, and to better understand how they unfold. Ransomware is a type of malware which infects victims’ devices, encrypts their data, and demands money from them to restore access. Infection typically occurs through human error. For example, a person may be unwittingly tricked into downloading and running a “trojan” – malware that has been disguised as a legitimate and benign file. The executed ransomware encrypts data, and backups of that data, on the infected system, and the perpetrator can then demand a ransom payment for decryption services. However, ransomware does not always start encrypting data immediately. Instead, it may lay relatively dormant whilst it spreads to other networked systems, and spend time gathering sensitive information, and creating back-ups of itself to block data recovery. If an attack can be identified at this stage or soon after it has started encrypting data, it can be removed before most of the data has been affected.

Ransomware is a persistent threat to cyber security, and each new attack can be developed to behave in unpredictable ways. Dr Ray outline the need for better tools to prepare for new attacks – when faced with a new attack, there should be systems to help a user understand what is happening and what has happened already so that ransomware can be found and removed as quickly as possible, and relevant knowledge can be gained from the attack.

To identify and monitor threats, security experts may perform forensic analysis of Network Monitoring Systems (NMS) data from around the time of infection. This data exists in the form of network logs – relational databases containing a time-labelled record of events and activity occurring across the networked systems. However, there are very large amounts of log data, and most of it is associated with benign activity, unrelated to the threat, making it difficult to find examples of malicious activity. Further, in the case of new threats, there are little to no labelled examples of logs known to be related to an attack. Human knowledge and reasoning are therefore crucial for identifying relevant information in the logs.

ILP based machine learning (ML) was then presented by Dr Ray as a promising alternative to more ‘popular’ traditional ML methods for differentiating ransomware activity from benign activity in large network logs.  This is because ILP is better suited for working with relational data, an area where deep learning and traditional ML methods can struggle since often require tabular or vectorisable data formats. ILP not only gives the ability to make predictions on relational data, but it also produces human interpretable logic rules through which it is possible to uncover and learn about the system itself. This could provide valuable insights into how the infection logs are generated, and which features of the logs are important for identification, as opposed to guessing which features might be important.

Dr Ray went on to detail the results of his work with Dr Steve Moyle (Amplify Intelligence UK and Cyber Security Centre, University of Oxford), on a novel proof-of-concept for an ILP based “eXplanatory Interactive Relational Machine Learning” (XIRML) system called “Acuity”. This human-in-the-loop system allows ILP and cyber security experts to direct the cyber threat elucidation process, through interactive functionality for guided data-caching on large network logs, and hypothesis-shaping for rebutting or altering learned logic rules.

In his concluding remarks, Dr Ray shared his thoughts on the future of this technology. As he sees it, the goal is to develop safe, auditable systems that could be used in practice by domain experts alone, without the need for an ILP expert in the loop. To this end, he suggests that system usability and human-interpretable outputs are both crucial factors for the design of future systems.

BIAS 22 – Review Day 2 talk – Dr Nirav Ajmeri: “Ethics in Sociotechnical Systems'”

This blog post is written/edited by CDT Students Jonathan Erkine and Jack Hanslope

Following from a great keynote by Liz Sonenberg, Dr Nirav Ajmeri presented a discussion on Ethics in Socio-Technical Systems (STS).

As is common practice in discussions on AI, we began by looking inwards to what kind of human behaviour we are trying to replicate – what aspect of intelligence have we defined as our objective? In this case it was the ability of machines to make ethical decisions. Dr. Ajmeri referred to Kantian and Aristotelian ethical frameworks which describe moral duty and virtuous behaviour to establish an ethical baseline, which led to the first main takeaway of the discussion:

We must be capable of expressing how humanity defines, quantifies, and measures ethics before discussing how we might synthesise ethical behaviour.

Dr. Ajmeri clarified that ethical systems must be robust to situations where there are “no good choices”. That is, when even a human might struggle to see the most ethical path forwards.  Keen to move away from the trolley problem, Nirav described a group of friends who can’t agree on a restaurant for their evening meal, expounding on the concepts of individual utility, rationality, and fairness to explain why science might fail to resolve the problem.

The mathematical solution might be a restaurant that none of them enjoy, and this could be the same restaurant for every future meal which they attend. From this example, the motivation behind well-defined ethics in socio-technical systems becomes clear; computers lack the ability to apply emotion when reasoning about the impact of their decisions, leading to the second lesson which we took from this talk;

Ethical integration of AI into society necessitates the design of socio-technical systems which can artificially navigate “ethical gridlock”.

Dr. Ajmeri then described the potential of multiagent systems research for designing ethical systems by incorporating agents’ value preferences (ethical requirements) and associated negotiation techniques. This led to a good debate on the merits and flaws of attempting to incorporate emotion into socio-technical systems, with questions such as:

Can the concept of emotion be heuristically defined to enable pseudo-emotional decision making in circumstances when there is no clear virtuous outcome?

Is any attempt to incorporate synthetic emotion inherently deceitful?

These questions were interesting by the very nature that they couldn’t be answered, but the methods described by Nirav did, in the authors opinion, describe a system which could achieve what was required of it – to handle ethically challenging situations in a fair manner.

What must come next is the validation of these systems, with Nirav prompting that the automated handling of information with respect to the (now not-so-recent) GDPR regulations would provide a good test bed, prompting the audience to consider what this implementation might involve.

The end of this talk marked the halfway point of the BIAS summer school, with plenty of great talks and discussions still to come. We would like to thank Dr. Nirav Ajmeri for this discussion, which sits comfortably in the wheelhouse of problems which the Interactive AI CDT has set out to solve.