I took the GPT-4 captioned dataset found here: https://civarchive.com/models/281464?modelVersionId=322747, made by steffangund, and then duplicated the entire zip file. I then re-tagged every photo in the set using WD14 tagging conventions, and then trained the LORA using both sets - thus training with short, one & two word descriptors along with more verbose phrases to describe the same images.(I also added a number of screen grabs from the movies Teen Spirit and Neon Demon)
The result, hopefully, is a more responsive model that will listen better to your img-to-txt input, no matter how you input it (so long as it's in English).