Jesse
Are auto-generated Japanese subtitles usually accurate kanji?
I was thinking if it is just generating the subtitles by what is being said, the technology wouldn't really be advanced enough to understand the context and therefore be quite inaccurate right? I mean if they have achieved that then it's really impressive... But so many words are read the same way with different kanji, just checking YouTube doesn't have the subtitles too inaccurate😅
Oct 29, 2019 9:41 PM
Comments · 2
/continuation/

I would say that kanji homophones are one extra thing the program can get wrong. For instance, the other day I was watching a video, in which æż€æƒ… was misinterpreted as 抇栮 (both are げきじょう). However, one thing to consider is that despite Japanese's notoriety for having an exorbitant number of homophones that are listed in the dictionary, it's not that all of them are equally common. In fact, most of those kanji compounds dwell predominantly in books, especially those that are centered around serious topics. In an article on, say, history, the vast majority of content words is likely to be two-kanji componds. That will not be the case for colloquial speech though, nor even romance novels for that matter. Those are likely to have more Japanese and Western words, with the number of kanji compounds limited to the most common ones. Going back to my previous example, according to jisho.org, 抇栮 is a common word, while æż€æƒ… isn't. I guess whether or not a given word is deemed to be common might be one thing that is taken into consideration by the developers. Another thing might be the nearest context.

Of course, the program's performance depends not least on the quality of the source audio and the speed of narration.
October 30, 2019
Hello Jesse!

If you're wondering whether or not you can use auto-generated subtitles to improve your listening comprehension, I can tell you that those are of great help. Of course, they aren't 100% accurate but then again, English ones aren't perfect either.

I tried watching a 2-minute excerpt from a random video. Here's the link. https://www.youtube.com/watch?v=yWVWyfzEdIA I spotted like 10 mistakes in the first 2 minutes, which isn't many. Most of the mistakes are misintepretations of sounds though, not the wrong kanji to disambuguate homophones. Here are some examples of what the program got wrong:

ă‚ąă‚Šăƒ­ăƒ©ă€€ă€misinterpreted as ćˆïŒˆă‚ïŒ‰ă†ăƒ‰ăƒ©ă€‘
ć‘ŒïŒˆă‚ˆïŒ‰ăłăŸă™ă€€ă€èȘ­ïŒˆă‚ˆïŒ‰ăżăŸă™ă€‘
çŸ­ïŒˆăżă˜ă‹ïŒ‰ăă—ăŠă€€ă€æș€ïŒˆăżïŒ‰ăĄéš ïŒˆă‹ăïŒ‰ă—お】
長ăȘăŒïŒ‰ă•ă€€ă€é«˜ïŒˆăŸă‹ïŒ‰ă•ă€‘
éŸłïŒˆăŠăšïŒ‰ă€€ă€ć€§æ°žé ïŒˆăŠăŠăˆă„ăˆă‚“ïŒ‰ă€‘-way off target!
ć’ŒéŸłïŒˆă‚ăŠă‚“ïŒ‰ă€€ă€ăƒŻă‚Žăƒłă€‘
There's also a similar number of mistakes I haven't mentioned here.

So generally it's doing pretty well. A couple of years back, auto-generated Japanese subtitles seemed to be blurting out a random jumble-up of kanji and kana characters, which didn't make any semblance of sense and was totally useless. But now most of the mistakes are of a kind one can get with any other language. On that note, I remeber once not being able to make out a word in an English video. I resorted to auto-generated subtitles which showed something like "a 'less affair' approach" or "a 'less a fare' approach" (I don't remember which one). After a good deal of guesswork and trial and error, I was able to figure it out. It was "a laissez-faire approach." Most of the above examples would require the same kind of guesswork. Still, auto-generated subtitles are very helpful in the sense that they narrow down the list of possible options for you to consider.

/to be continued/
October 30, 2019