Core ml & NLP for japanese

rufy · October 18, 2018, 12:33pm

Hi everyone,

I’ve a question: i need to train a model with CreateML that makes tokenization, Part of speech and lemmatization. is it better to train only one model (i don’t know if it is possible)? or is it better to train one model for tokenization, one model fo Part of speech and one model for lemmatization?

Thanks

shogunkaramazov · October 30, 2018, 10:45pm

@clapollo Can you please help with this when you get a chance? Thank you - much appreciated! :]

clapollo · October 31, 2018, 6:22am

Hi @rufy,

If you are using Create ML, then I’m not sure it’s possible to create a single model that will handle all three tasks. Create ML isn’t super flexible – it’s designed to simplify specific tasks, rather than be an all purpose framework. For example, you can create an MLWordTagger to handle part of speech detection, but that’s all it can do.

Apple’s Natural Language framework supports tasks like tokenization, lemmatization and part of speech tagging for many languages, including Japanese. (Here’s a link to it’s supported languages: Apple Developer Documentation) I don’t know if it has full feature support for every language on that list, so you’ll have to test, but that’s definitely the first place I would look for those tasks.

Now as a general ML question about whether it makes sense to make a single model to handle all three tasks: I’d say…maybe. It’s been shown that training a model to perform more than one task can help it’s performance on all the tasks, so it’s certainly a viable option. But in many cases, you’ll want to do some of these things outside of a neural net. For example, tokenization is usually a pre-processing step performed on the text before passing the extracted tokens to a neural net in order to perform more advanced tasks. And lemmatization is often just a step in the tokenization process – extract the tokens from the sentence, remove “stop words” (which are words like a and the which do not add much information to a sentence), lemmatize where possible, and replace infrequent or unknown terms with a special Unknown token. This all serves to reduce the vocabulary set you need to work with, which is important because NLP is difficult when dealing with infrequently seen or completely unknown words.

I hope that helps!

-Chris

rufy · November 16, 2018, 9:01am

Thank you very much for your answer. I will keep in mind your suggestions

system · April 3, 2019, 4:33am

This topic was automatically closed after 166 days. New replies are no longer allowed.