Ai2 achieved this by getting human annotators to describe the images in the model’s training data set in excruciating detail over multiple pages of text. They asked the annotators to talk about what they saw instead of typing it. Then they used AI techniques to convert their speech into data, which made the training process much quicker while reducing the computing power required.
These techniques could prove really useful if we want to meaningfully govern the data that we use for AI development, says Yacine Jernite, who is the machine learning and society lead at Hugging Face, and was not involved in the research.
“It makes sense that in general, training on higher-quality data can lower the compute costs,” says Percy Liang, the director of the Stanford Center for Research on Foundation Models, who also did not participate in the research.
Another impressive capability is that the model can “point” at things, meaning it can analyze elements of an image by identifying the pixels that answer queries.
In a demo shared with MIT Technology Review, Ai2 researchers took a photo outside their office of the local Seattle marina and asked the model to identify various elements of the image, such as deck chairs. The model successfully described what the image contained, counted the deck chairs, and accurately pinpointed to other things in the image as the researchers asked. It was not perfect, however. It could not locate a specific parking lot, for example.
#tiny #open #source #model #performs #powerful #big