Robin Project - Not Only A Super Parser

yilin_yang

2.63/5 (12 votes)

Mar 17, 2008

CPOL

16 min read

18242

230

Robin implements ANN method into parser technology which ends the age of parser generators

Download Robin.zip - 290.21 KB
Web space for Robin Project
Robin at Sourceforge.net (Please check svn for newest update)

Preface

Robin is a kind of bird, nothing special at all.

There was an algorithm named Round-Robin which is widely used in designing Operating System.

I take this word from the Round-Robin idea, as I assume this kind of bird (I have never seen this bird
actually) has some periodical habit, maybe sings at 6:00am everyday... I truly can not find other explanation
of Round-Robin algorithm, and please forgive me if I got bad guess...

The reality is too complicated; only way to build simulation of the reality is to make it regulated, such as the
Robin bird singing at 6:00am everyday. From this point of view, things are simplified reasonable and then you
got a very beautiful view of the massed-up world - then you may find that it has its perfect regularity; it's quite
suppressing and even out of your imaginary.

And then, you see back to yourself... you understand that complication is just illusion... for example, you may
find it hard to understand why human beings have the ability of learning, speaking and thinking, until one day,
you build a machine which has same ability (Turing machine), then you understand why and you understand
yourself... for some point of view, not all point of views, we're some kind of machines... but no worry, we
still have something plus...

Real Parallel

Since this was a long way, there were many forks.

I was always trying to make the real Artificial Neuron Network which is not simply nodes, threshold, and feedbacks.

To be honest, I know few about this classic ANN theory. Because I do not feel it has any means to solve the
language problem. It may be used to recognize the human face, or controlling the body movement, I can not
see it is possible to make the ANN system learning human languages.

So I'm working on other ways to try to solve the problem with other visions which I will talk about later. I did
try to simulate the whole brain of neuron cells communication and even living status which currently I can even
not think that is possible. But yes, I did this and failed.

There was a big problem when I was trying to simulate the real cells: I can not make them working as they
were in the real world. Why? In the real world, they work side by side, they each has own time line. I can
not truly simulate them with the method of threads (threading programming), I can not accept too much
unpredicted matters because it is already too complicated. I have no choice other than trying make them as
Robins, while I can not accept they sing one another... that's too conflict.

Until one day, 2 years ago, I found the solution, I called this True Paralleling, and this is the previous
implementation of Robin framework. Simply saying, parallel working has conditions: if changing same value
within same period does not matter, these running bodies which are assumed to be running in parallel are
qualified to be parallel.

Based on this method, I found it is quite possible to write a compiler which is able to simulate human language
processing with human brain which is composed with neuron cells and all cells are running in parallel. If
simulating human neuron cells brings up too many factors, why not reassemble the compiler with part of
human neuron ability to make better?

Actually, this idea returns the beginning point, trying to let the computer recognize human language, if this is
still not near target, let the computer recognize all computer language easier and let the computer recognize
human language-like computer language, which means, runtime designable language or super parser

Parsers and Machine Translation

This is where Robin project came from.

If you know compilation, you know compilers. Very basically, modern compilers (since FORTRAN) are
made of three parts, the Tokenizer, the Parser and the Code Generator. Many people combine the
Tokenizer and the Parser together to be Parser that makes sense as well. They do some similar jobs.

Why similar has to be two? No, it is only technical solution to the hard problem. In most advanced languages
(BASIC or C or something), which are in fact some kind of English variation, has the natured that the
words (identifiers) are separated by space chars, or ASCII(32). A tokenizer can be easily introduced to
separate words into TOKENs which are simpler to be used by Parser level.

Human languages do not all follow this manner. For example, you don't need to separate two words of
Chinese characters when you read Chinese sentence. Please do not say that you will not code in Chinese, I
do not code in Chinese as well, but does parser have this ability to separate the Chinese words?

Modern parsers can not so far as I know. Something special can do this; the technology is called NLP,
Natural Language Processing. NLP parsers are basically super parsers. They recognize characters and tag
sentences into syntax structures. But however, this is some kind of NP (very hard problem). If all these NLP
parsers can work at our expectation, we have already seen products in the market place.

This is not the fault of researchers, this problem is too hard and it is the same level as Machine Translation. (If
the NLP problem is solved, you will see many Machine Translation products that can work as human
translators, however, unfortunately, I did not see this kind of product for many years, and many times, I got
joke-like translation result, very funny.)

So we just say NLP is more complicated parser which uses rules as common parsers to recognize characters
and words and syntax.

Modern parsers follow rules to recognize syntax and generate result trees. The rules are in the form of BNF,
Backus-Naur Form. There are two kinds of parsing orders: from bottom to top (LR parsing) and from top
to bottom (LL parsing). For most situations, we use LALR parsing which means "from bottom to top parsing
with looking ahead of one token. Why? it is very hard to determine why rule to use if not giving more
information.

Parsers are too complicated, modern parsers are actually built with programs, saying generated by
programs. These generators are call compiler-compilers. You may find such Yacc or GNU Bison as instance.
Almost every compiler is build with tools like Yacc or Bison, whatever C/C++ or BASIC.

Hard Rock

Modern parsers are programs generated by programs. This means, the parsers can not change after their
own compilation, which also means, the language that the parser can parse has no chance to chance unless
the parser changes, which also means, when the parser is compiled, the language is fixed and can not add or
remove or modify any part of the language. Then you see C program is C program, BASIC is BASIC.

What's the problem?

No problem, if you don't want to write BASIC code in .C file. As if you are not hungry, food means
nothing to you.

But what if you're hungry?

What if you want to add new grammars to the current parser and you are not the writer of the parser,
what if you want to change your currently working programming language either
in compilation time or run time?

For example, you would have to do this: (C# code)

class Door
{
void Open()
{
}
void Main()
{
Door door = new Door();
door.Open();
}
}

When you want to do this:

class Door
{
void Open()
{
}
void Main()
{
door is an instance of Door;
Open door;
}
}

You're not allowed by C# parser, and even C# inventor.

No ability of changing the language rules, no ability to take any step to make human understand human better
not even be able to talk about making computer understand human better.

Is it possible to make parsers to dynamically learn and change syntax rules?

The answer is theoretically YES. And in the reality, that means too complicated to make it. The writer of the
compilation books used the word "too complicated" to describe this issue. I did not fully understand what
he referred to, however, until I started programming a real parser many years ago, I saw the problem. It was
very near to impossible. Why? Because our knowledge of human language and human brain (neurology)
was quite not enough to complete
this task by the time the book was written. That was about 40 or 50 years ago, and during the past
decades, there was no significant advance in this science and technology field. We're still using Yacc, maybe
a better Yacc to generate code of parser, or even write the parser by hand.

Coding by hand, that is still viable way than no way at all. However, things change while time elapse;
now we have idea to solve the problem better.

Robin theory

Robin starts new path to solve the problem, what problem?

The problem that we can not write program in the form that we like which is more human-like but not be able to be
understood by parsers; The problem that a modern parser has no way to ascend to a human language parser.

Why it had to be generated? Because of complication. Why complicated? Because not having fully understanding of
the human brain? Do we fully understand now?

No, but much better than ever. We know neuron encoding, we know sequence matching, we know neuron cooperation,
we know main and micro loops (and more that are not written on the books which are current only the knowledge
of the boy).

Actually, Robin is proof of this hypothesis.

How does Robin work?

Robin uses BNF as rules (in XML form), Robin generates result trees. But Robin is different...

Robin assumes that patterns are recognized by artificial neurons: Neurons process signals which are sent from main loop and
micro loops. When a neuron accepts a signal, it reports to next neuron which is connected. The most important thing is that
all neuron are working simultaneously (side by side), which makes all pattern matching working simultaneously.

This means, for a given input signal, the Robin will response as much as possible in every of its parts that can accept this
signal. A certain signal may have different "meanings" in various situation and different patterns; Robin gets them all instead
of selecting "the correct choice" which is actually not possible if you do not look ahead.

It is not allowed to look ahead in Robin. Yes, no look ahead and signal by signal (a character is a signal by this meaning), this
is LR(0) parser. Every language (computer language) is LR(0) language, but not every parser is LR(0) parser. Parsers have
to look ahead to know which is correct pattern to match, and actually no common parser can be real LR(0).

Robin is true LR(0) parser since it does not need to do any look ahead action. However the cost is, Robin does not care
about which is correct pattern. When you do a deep research on any language, you would see that trying to find the correct
pattern is not possible or even not good at all. Why? Because you may get bad guess and that's very common issue.

The how does Robin get correct pattern? Robin just does not get correct pattern at parsing time.

Robin records every possible pattern, in the form of trees. What's more important, these artificial neurons (implemented as
Matchers) send signals to micro loops to inform other neurons. This forms a chain of reaction. And as result, you would get
the Reducing like action comparing with the classical parser.

Robin's rule is actually Reduce-first rule. Reduce-first rule means searching for every possible meaning of the same signal
before looking at next signal. Reduce-first rule is actually reflection of micro-loops in neurology.

Then how to deal with shifting? or saying how to move to next signal?
Shifting action is scheduled and will be performed at the end of all patterns finished their jobs. This is the idea of Robin or
Real Parallel: these patterns can work side by side only if it does not matter in order.

This is like time is zoomed-in, in very sub level of time; pattern matching works in order, but this order never generates bad
result at the given circumstance. We get them running in parallel.

During the signal matching and re-generating process, we record every useful information to build trees dynamical.
And finally we got a set of trees at the end of input signal serial. That is all possible understanding of the given
input serial (input text).

Final results are generated after the end of input, this means, there would be no stable result generated
while reading in signals (characters). This behavior is very similar to LL(k) parsing which is the other
very hard problem and only LL(1) parsing is solved well for now. Robin takes a work-around, no need for
LL(k) problem solution, it is solution to LL(k) problem already.

Now you see what I mean, and you may guess what the next step is.

The next step is to trim the trees to eliminate inter-operation steps.

If in the end, you got a single tree on every level (no level with more than one trees), this means, you got
unique meaning of this input text, no ambiguous at all. Otherwise, some problem in BNF definition that you
describe the language, or maybe the input text has double meanings.(there is situation that not these two
reasons, like some special dangling, this means Robin has problem in implementation, that's something
should be improved).

Expression and Prefixing

A simple Robin framework has two simple native rules: 1. Reduce-First, 2. Late-Tree-Trimming.

This is not enough for more complicated situations, such as parsing an expression.

Why parsing expression is so important?

Actually, human language is not complicated as computer languages in some manner.

For example, most complicated human sentence has no more braces "{}" than three levels, which means
the sentence is in plain and not in nested form very deeply.

Human can not accept this kind of nesting even in long articles. However, this kind of code is very common for
programmers.

Nesting also has corresponding mechanism in the human neurology. That's prefixing.
Prefixing means a set of signals that are always sent before the main signal. They form prefix to that signal,
to make a meaning scope.

Assuming that you're blind and how can you read this expression easily?
(((1+2)*3+(4+5)*6)/7)
It is quite hard to match the parenthesis even you have eyes focusing on it, and also hard to ensure there is no mistake at all.

Classical parsers solve this problem by introducing symbol stacks. This solution actually breaks up parsers and tokenizes
since tokenizes do not require stacks at all. This solution also limited the parsers to have it select the correct patterns
which is not theoretically possible (in the reality, this means, no way to make sure whether it is good or bad guess,
however, you can just guess an answer).

Robin solves this problem by using prefixes of signals.
Prefixing is very similar as stacks but no need to make sure good guess.
All current working matchers can not work with new prefixed signal, and new matchers should be generated for the prefixed
signals. And this actually means, baking up the whole parser (where is necessary) until the prefixing stage is over.
When adding new prefix, a new level of baking up starts right away. You will see how this work if you just type a complicated
expression into Robin input text field. Words can not be better than experience of your own.

Dynamically generating of Matchers is simulation of neuron pattern duplicating. Along the main loop, there are many copies of
same pattern in the form of neuron cell sequence. For example, there may be 10 or more sequences to match "ABC".
You can not feel this because all sequences generates the same result signal ABC (this is token, not chars). Neurons would
encode "ABC" into a single signal. However this how-to is still unknown yet. We just do this simulation with computer
implementation, and no need to care about how neuron works.

Robin implementation

The implementation of Robin is quite tricky. Frankly, it is coded rather than designed.
There is no way to design something that I myself even can not fully understand yet. I have to try many times
on a single condition to find out what's the problem and then think out what's the solution.

By introducing real paralleling technology, you will see many levels of for loops. Too many levels, some even has 8 levels.

Why so many levels? To synchronize patterns and micro-loops, to ensure a signal can traverse to every pattern to avoid pattern
selection, and finally, to make sure trees are correctly trimmed. Besides, there are also many recursive methods, most about
trimming.

Since no other explanation is better than the code itself - code is self-explained,
please read code for design and implementation details.

Conclusion

10 days of coding work is done to complete this period. I feel much better solving this problem than having it as heavy
load for past 14 years. This is good news for me.

Robin is still at its starting stage. I would like to give GPL or LGPL license to this project, so that more people
can join and help developing.

Robin currently needs to confirm with programming languages such as BASIC/C or more by given BNF and examine
its output trees. An language converter can be made for programming languages. A well defined programming language
can also use this ability to provide compilation time syntax extending (like the example of doors).

Then the following step, we try to capture micro-loop signals and even main-loop signals to make dynamic pattern learning.

This way we will make Robin a great ability to learn language. The mechanism is still complicated which I will talk
about in future. Just to inform that this is quite possible and near to target.

This means, Robin will be not only a super parser...

The next step, we not capture any trees, we redirect output signal sequence back to input entry.

We keep learning ability in runtime and not in compilation time. This way, Robin captures its own output
signal sequence which is result of input or even no input sequence, it process the result sequence again
and again... this is primitive implementation of a real think machine... I will not give proof here, just direction.

No proof is better than implementation, however, that just something which will happen in future and not now.

By Yilin (yyl_20020115@yahoo.com.cn)

[20080317-01:13]