Scene text recognition is a challenging task for research community, especially with the scripts with diacritical marks such as Vietnamese. In the paper, two different convolutional network architectures for recognising Vietnamese text in natural scenes are presentd. Experiments are conducted to compare the performance of two networks in reading Vietnamese restaurant signs. Experimental results show that the deeper network outperforms the other in recognising accuracy and computational time.

