Apple researchers achieve breakthroughs in multimodal AI as company ramps up investments

by | Mar 15, 2024 | Technology

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.

Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.

The work, described in a research paper titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training” that was quietly posted to arxiv.org this week, demonstrates how carefully combining different types of training data and model architectures can lead to state-of-the-art performance on a range of AI benchmarks.

“We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks,” the researchers explain. By training models on a diverse dataset spanning visual and linguistic information, the MM1 models were able to excel at tasks like image captioning, visual question answering, and natural language inference.

Scaling visual components is key

The researchers also found that the choice of image encoder and the resolution of input images had a major impact on model performance. “We show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance,” they said. This suggests that continued scaling and refinement of the visual components of these multimodal models will be key to unlocking further gains.

VB Event
The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership wi …

Article Attribution | Read More at Article Source

[mwai_chat context=”Let’s have a discussion about this article:nn
Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.

Apple researchers have developed new methods for training large language models on both text and images, enabling more powerful and flexible AI systems, in what could be a significant advance for artificial intelligence and for future Apple products.

The work, described in a research paper titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training” that was quietly posted to arxiv.org this week, demonstrates how carefully combining different types of training data and model architectures can lead to state-of-the-art performance on a range of AI benchmarks.

“We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks,” the researchers explain. By training models on a diverse dataset spanning visual and linguistic information, the MM1 models were able to excel at tasks like image captioning, visual question answering, and natural language inference.

Scaling visual components is key

The researchers also found that the choice of image encoder and the resolution of input images had a major impact on model performance. “We show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance,” they said. This suggests that continued scaling and refinement of the visual components of these multimodal models will be key to unlocking further gains.

VB Event
The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership wi …nnDiscussion:nn” ai_name=”RocketNews AI: ” start_sentence=”Can I tell you more about this article?” text_input_placeholder=”Type ‘Yes'”]

Share This