BMVA Symposium 2024

This blog post is written by AI CDT student, Phillip Sloan

I had the opportunity to go to the British Machine Vision Association 2024 Symposium, which took place at the British Computer Society in London on the 17th of January, 2024. The symposium was chaired by Dr. Michael Wray from the University of Bristol, Dr. Davide Moltisanti from the University of Bath, and Dr. Tengda Han from the University of Oxford.

The day kicked off with three invited speakers, the first being Professor Hilde Kuehne from the University of Bonn and MIT-IBM Watson AI Lab. Her presentation was related to vision language understanding for video, she started her presentation with an introduction to the field, how it began and how it has adapted over time before moving on to the current work that she and her students have been working on including the paper “MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge” by Wei Lin.

Her final remarks outlined potential issues for evaluation within the field, when the field was more focused on classification the simple labels could easily be evaluated to be right or wrong, however now the field has moved to vision-language retrieval the ground truth might not actually be the best, or most relevant caption that is contained within the dataset which is a hurdle that must be overcome.

The second invited speaker, Professor Frank Keller from the University of Edinburgh, had a very interesting talk on the topic of visual story generation, a domain where a coherent narrative is constructed to describe a sequence of images often related to the characters within the images. He broke his talk down into three sections, first introducing the field more concretely before going on to explain two different areas: Characters in visual stories and planning in visual stories.

He emphasised that the characters within a story are important, and so character detection and grounding are important in order to generate a fluent story. To help improve this aspect, Prof. Keller and his students introduced a dataset called VIST-Character that contains character groundings and visual and textual character co-reference chains. To help with planning the stories, Prof. Keller explained that their current methods utilise a blueprint, which focuses on localising characters in the text and images before relating them together. These blueprints are used as a guide to generate the story.

He explained that the domain is more difficult than image captioning as you have characters, and are required to have a fluent sequence of text which renders current NLP evaluations such as BLEU as poor measures for this task as it is concerned about generating interesting, coherent and grounded stories rather than exact matches to the ground truth. His research implemented human evaluators which is an interesting way to add humans to the loop.

Following Prof. Keller’s talk we had a break for poster sessions, before coming back for talks from a select few people who brought posters to the symposiums including talks related to explainability of autonomous driving and evaluating the reliability of LLMs in the face of adversarial attacks.

After lunch we had talks from the remaining two invited speakers. Professor Andrew Zisserman from the University of Oxford presented research for training visual language models to generate audio descriptions, helping people who are blind or partially blind to enjoy movies.

The talk started by providing a brief introduction and explanation of the field and then outlined that the current available datasets, explaining that they were not sufficient, so a new dataset utilising AudioVault was created through the use of processing the audio to provide audio descriptions and subtitles.

The talk walked us through a basic model overview architecture. Its limitations were pointed out, including the fact that characters were often not used (often using he, it) and descriptions were often incomplete. Prof. Zisserman explained that, to combat these limitations, they took two research directions, improving “the who”: providing supplementary information about the characters within the film and on “the what”: improving the models ability to provide better context by utilising preteained video-language models.

Finally, he discusses how evaluation measures, e.g. cider are not fit for the purpose of audio description generation, explaining that large language models are starting to be used in the domain as an evaluation tool.

The second talk of the afternoon was related to vision-language with limited and no supervision and was presented by Dr. Yuki Asano from the University of Amsterdam, who asked the question: “Why care about Self-supervised Learning ideas in the age of CLIP et al?”

He presented three works that were undertaken by him and his team. The first being the “Similarities of Unimodal representations across language and vision”. Demonstrating a model that uncoupled image-language pairs and trained them in an unsupervised fashion to reach 75% of the performance of the CLIP model “Localisation in visual language models” was the second topic that was reviewed, a task that vision language models are not traditionally good at. the solution of his team was to unlock localisation abilities in frozen VLMs  by adding a low weight module called the positional insert module (PIN).

The final part of the talk was on the topic of image encoder pretraining with a single video with many details. Their model, called Dora ( discover and track), has the high level idea of tracking multiple objects across time and enforce invariance of features across time. They evaluated their model against DINO, finding the model to perform better on various datasets.

After a coffee break, we had some shorter talks from people presenting posters at the event, including a radiology report generation presentation which was particularly relevant to me. CXR-IRGen was proposed, a diffusion model which is used to generate extra image-report pairs which could potentially help improve the problem of lack of data within the field. Kevin Flanagan, a fellow CDT memory also presented his research into learning temporal sentence grounding from narrated egovideos, showcasing his method called CliMer which merges clips from rough narration timestamps and trains in a contrastive manner.

Throughout the day we were encouraged to use Padlet to put our thoughts and questions down. After the talks had concluded there was a final informal Q&A session into the future of the vision-language domain which used our Padlet responses as talking points. We discussed points including the need for better evaluation metrics (which was a big theme from a lot of talks), the role of academia in the age of large language models and utilising NLP to make vision models explainable.

A very interesting and thought provoking day! There were several people working within medical image analysis so it was great to network and discuss ideas. Thank you to the speakers and people who presented for their contributions and to the chairs and organisers of the event for making it possible!


Leave a Reply

Your email address will not be published. Required fields are marked *