Language models use word-level embeddings that are trained using text using a pre-train, fine-tune training and evaluation regime. In this presentation, we will see how the embeddings can be enriched with visual knowledge as they are pre-trained and fine-tuned on multiple linguistic tasks. Knowledge of python is needed; experience with torch and huggingface helps, but is not required.