Click here to Skip to main content
15,035,723 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I have a simple problem but i don't know if machine learning can solve it. As simple as it gets this is a string in Arabic "الرحمن" and this is the same string but with glyphs "الرَّحْمَنِ"

what i want to do is teach the machine around 100k words (in sentences) with glyphs, so when i give it this string "الرحمن" it adds the parts automatically to this "الرَّحْمَنِ". the reason i want to use machine learning is because, in Arabic, according to the word position in a sentence the glyphs changes. for example it can become "الرَّحْمَنُ"...etc

now i know that machine learning makes the machine predict. but can it predict the glyph in a letter by letter? bare in mind that the words i feed to the machine must come in sentences, and thus i will have 2 columns in excel for example, one with the normal sentence and the other with the glyphs added. i don't know if this is doable for i am only good at coding for games and softwares. but if someone can tell me that its, i would be very grateful and it would be enough for me to look for the answer. Thank you

What I have tried:

programming, but its difficult to program the grammar.
Posted
Updated 25-Jan-21 7:45am
Comments
Richard MacCutchan 13-Jan-21 5:15am
   
I would guess that you just need some tables of letters to do the translations. The letters need to be in each form, beginning, middle and end of word. You then need to analyse the word on a letter by letter basis rather than as a full word.

1 solution

ML typically doesn't do "substitutions". You would need to "classify" a word / letter based on context and then make the substitution.

If you've worked with "spelling checkers", you would know that even Word has trouble with context.

You, in effect, want to feed in words that are "misspelled" and have ML "correct them".

You therefore have to identify every "unique" phrase the word can appear in, with enough other words (and "distance" restrictions) so that ML can figure the proper context (i.e. grammar).

You could create training and testing sets by extracting excerpts containing your "100k" words from a large literary work: some pattern like n words before, the word, some m words after. Accounting for "breaks" (i.e. punctuation).

Then you have to "classify" all your samples.

When running the tests, you strip the diacritics from the test set, then verify the matching.

(And don't train with your ultimate test set).
   
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900