As someone who wants to interpret and explain decisions from deep learning models, I like to highlight difficult datasets as a subject for study. My current challenge is Iwana et. al’s “Judging a Book By its Cover”, which introduced a 200k+ image multi-feature dataset of book covers from Amazon. The paper posed a genre classification challenge. The original tasks are very challenging for a convolutional neural network to tackle. My own attempts with a resnet-50 fine-tuned from ImageNet only slightly beat the published top result, with 0.306 top-1 accuracy. Training procedure was SGD with warm restarts, discriminative fine-tuning, cyclical learning rate decay schedule, progressive image resizing and data augmentation at test time.
This is a T-SNE visualisation of my model’s test set performance on a sample of the dataset:
The spikes represent 30% of the images, with 70% in the centroid, underrepresented for their label. T-SNE ran with perplexity=15.
The visualisation suggests that relying on the convolutional inductive bias only works for a small number of naturalistic covers represented in the dataset (the spikes in the image), but fails to find any genre-unique similarity between most varieties of plain human-designed text, fonts and graphic design. It might also be due to different forms of imbalance in the dataset. This visualisation and theory is worth more exploration. Testing multimodal models with a Text+Vision inductive bias on this dataset might shed some light on this. For example, evaluating and visualising contrastive language-image pretraining (CLIP) in inference mode. The CLIP paper claims that it can do OCR. Can it classify the recognized text as well? Here we would evaluate CLIP’s zero-shot performance based on the task of classifying the text in the book cover image by genre. The task would be: each image would have a question asking "Is this a picture of a <genre>?"
There's an interpretability project here, which would be to visualise multimodal embeddings activated by this task and use that to explain why it works better, if it does. This would be one entry point for work on visualising and explaining large language models because it often feels like visualisation is simpler with multimodal tasks.