ImageCaption Implementation

by Diptin Dahal


Posted on December 9, 2019 at 3:50 PM



ImageCaption for the Ricetta App

The Image Caption model was designed by using the high level Tensor Flow APIs like 'tf.Keras' and 'eager execution'. These APIs were used for building and training the deep learning models. Firstly, MSCOCO dataset of general images were used to train the model. Inception V3 engine was used to preprocess and to cache a subset of the images. Then using the subset the encoder-decoder model was trained. Finally, this was used to generate captions on new images.

The construction of ImageCaption model was initiated in Google collab, where the Runtime Type was switched to GPU, which increased the total storage space given to 358 GBs. But all the downloaded data, and trained data are regulary removed every time theres the sessions time out in the collab interface. The overall trainning for 20 epoch for 30,000 captions where there were more than 82,000 images took around 4 hours. (I also suffered some time-outs during the trainning process and lost my trained model multiple times i.e total time spent became 3X).

As per the details on the process the MSCOCO dataset images were resized to (299,299) which could be understood by Inception engine. Then the processed images were pickled into dictionary of each images. The captions were tokenized and around only 5000 words were saved for final processing and every other words were termed as Unknown to save the memory and processing time. word --> index mapping was then created. The data was furthur divided into training and testing data and the model was trained and verified. At last, the food images obtained from Find Protein were passed through the model and captions were generated.


Leave a Comment: