by Diptin Dahal
Posted on December 9, 2019 at 3:50 PM
ImageCaption for the Ricetta App
The Image Caption model was designed by using the high level Tensor Flow APIs like 'tf.Keras' and 'eager execution'. These APIs were used for
building and training the deep learning models. Firstly, MSCOCO dataset of general images were used to train the model. Inception V3 engine was
used to preprocess and to cache a subset of the images. Then using the subset the encoder-decoder model was trained. Finally, this was used to
generate captions on new images.
The construction of ImageCaption model was initiated in Google collab, where the Runtime Type was switched to GPU, which increased the total storage space
given to 358 GBs. But all the downloaded data, and trained data are regulary removed every time theres the sessions time out in the collab interface. The
overall trainning for 20 epoch for 30,000 captions where there were more than 82,000 images took around 4 hours. (I also suffered some time-outs during the
trainning process and lost my trained model multiple times i.e total time spent became 3X).
As per the details on the process the MSCOCO dataset images were resized to (299,299) which could be understood by Inception engine. Then the processed images
were pickled into dictionary of each images. The captions were tokenized and around only 5000 words were saved for final processing and every other words were
termed as Unknown to save the memory and processing time. word --> index mapping was then created. The data was furthur divided into training and testing data
and the model was trained and verified. At last, the food images obtained from Find Protein were passed through the model and captions were generated.