/continuation/
I would say that kanji homophones are one extra thing the program can get wrong. For instance, the other day I was watching a video, in which 激情 was misinterpreted as 劇場 (both are げきじょう). However, one thing to consider is that despite Japanese's notoriety for having an exorbitant number of homophones that are listed in the dictionary, it's not that all of them are equally common. In fact, most of those kanji compounds dwell predominantly in books, especially those that are centered around serious topics. In an article on, say, history, the vast majority of content words is likely to be two-kanji componds. That will not be the case for colloquial speech though, nor even romance novels for that matter. Those are likely to have more Japanese and Western words, with the number of kanji compounds limited to the most common ones. Going back to my previous example, according to jisho.org, 劇場 is a common word, while 激情 isn't. I guess whether or not a given word is deemed to be common might be one thing that is taken into consideration by the developers. Another thing might be the nearest context.
Of course, the program's performance depends not least on the quality of the source audio and the speed of narration.