|
Write a Wikipedia article. That'll make it true.
I can't imagine any kind of reader/parser which doesn't tokenize by pulling.
|
|
|
|
|
That's not exactly what a pull parser is.
A pull parser parses one small step at a time before returning control to the caller.
while(reader.read()) {...}
You call it like that, and inside the loop you check the nodeType() and the value() and such to get information about the node at the current location.
Microsoft built one for XML in .NET call the XmlReader - you've probably used a derivative of it before, if not directly, then indirectly by way of another XML facility like XPath or the DOM
NewtonSoft has one for JSON but I don't like it, personally.
Real programmers use butterflies
|
|
|
|
|
Yeah, I do that, but it's at a higher level. So -- for instance -- when my loader finds an array of Widgets, it iterates all the Widgets in that array, loading each into the database.
|
|
|
|
|
Yeah, I build that kind of stuff on top of the pull parser. In my Diet JSON and a Coke article I go into that - constructing queries out of navigation and data extraction elements.
You basically build queries and then feed those to the reader, and it drives the reader for you (in fact, it's more efficient than reading by calling read() yourself)
Real programmers use butterflies
|
|
|
|
|
I don't query or search, I simply iterate tokens until I reach the start of an array of objects I'm interested in.
Then I iterate those objects.
That way, I read each file only once.
For the most part, each of the files I'm reading is just one array of objects and I load the whole thing into one database table.
Only the most recent files I'm working with contain multiple arrays containing different types of objects -- and each type of object gets thrown at a different database table.
|
|
|
|
|
I made my parser with selective bulk loading of machine generated JSON in mind, which means when you search it does partial parsing and no normalization, allowing it to find what you're after FAST at the expense of some of the well formedness checking (but like i said, geared for machine generated dumps)
Not that it matters in a .NET environment, but my parser also will not use memory to hold anything you didn't explicitly request which means you need bytes to scan the file, and then store your results. I often do queries with about 256 bytes of RAM to work with. It doesn't even compare field names or undecorate strings in memory - it does it right off the input source (usually a disk, a socket or a string)
My latest codebase i'm working on will even allow you to stream value elements (field values and array members) so you can read massive BLOB values in the document. Gigabytes.
Real programmers use butterflies
modified 24-Dec-20 17:51pm.
|
|
|
|
|
honey the codewitch wrote: selective bulk loading
Yup.
honey the codewitch wrote: machine generated JSON
Yup.
honey the codewitch wrote: partial parsing
Supported.
honey the codewitch wrote: no normalization
That's up to a higher level to determine.
honey the codewitch wrote: at the expense of some of the well formedness checking
Basically none.
honey the codewitch wrote: It doesn't even compare field names
Why would it? That's up to a higher level to determine.
honey the codewitch wrote: undecorate strings in memory
Unquote? Unescape? I do that as late as possible, not until I know I want the value.
Bear in mind also that the underlying reader/tokenizer (?) is not used only for JSON, but for CSV as well.
_____________________________________
| Loader |
|___________________________________|
| JSONenumerator | | CSVenumerator |
|________________| |_______________|
| JSONtokenizer | | CSVtokenizer | Unquoting and unescaping happen here, as appropriate
|________________|__|_______________|
| STREAMtokenizer (base) |
|===================================|
| TextReader |
|===================================|
|
|
|
|
|
Everything you're talking about, because of your abstraction I can tell you you're loading strings into memory and operating on them in memory. Because of your higher level determining these things it's only operating on the strings after the fact. I am not. Now, for .NET that doesn't matter. For an 8kB arduino it does. Point is, our parsers are fundamentally different in that respect.
Also when you said normalization is for a higher level to determine you misunderstand me. I parse no numbers, no strings, nothing, unless you actually request it. That's what I mean by no normalization. Based on what you're telling me of your architecture you are normalizing unconditionally at the parser level i suspect - am almost certain. I do not parse every field or value i encounter. I skip over most of them. they never get turned into anything in value space.
literally most of the time I'm advancing like this:
while(m_source.currentChar()!='{some context sensitive stopping point value}') { m_source.advance(); }
Real programmers use butterflies
|
|
|
|
|
honey the codewitch wrote: our parsers are fundamentally different
Yes.
I suppose the biggest conceptual difference between ours is that I needed to write a fairly general loader utility which could read a "script" and perform the tasks, not write several purpose-built utilities -- one for each file to be loaded. The ability to have it support CSV (and XML) as well as JSON was an afterthought.
honey the codewitch wrote: I parse no numbers, no strings, nothing, unless you actually request it.
Well, mine too. It does have to tokenize so it knows when it finds something you want it to parse, but nothing more than that until it finds a requested array.
If the script being run says, "if you find the start of an array named 'Widgets', then do this with it", then the parser has to know "I just found an array named 'Widgets'".
honey the codewitch wrote: you are normalizing unconditionally at the parser level
Well, I suppose so, insofar as I make values (or names) out of every token, but at that point they're just strings -- name/value pairs with a type -- they're not parsed.
I throw only those strings which we want at the SQL Server and it handles any conversions to numeric or other types, the loader has no say in that.
The loader has no say in data normalization either, it's just passing values as SQL parameters.
Again, I want nearly every value in the file to go to the database, so of course I wind up with every value and throw them all at SQL Server.
It may be a misunderstanding of terms, but in my opinion, no actual "parsing" is done until the (string) values arrive at SQL Server -- that's where the determinations of which name/value pairs go where, what SQL datatype they should be, etc. happens. The loader utility has no knowledge of any of that.
|
|
|
|
|
I'm using parsing in the traditional CS sense of imposing structure on a lexical stream based on patterns in said stream.
Real programmers use butterflies
|
|
|
|
|
PIEBALDconsult wrote: Well, mine too. It does have to tokenize so it knows when it finds something you want it to parse,
I have other ways of finding something. I switch to a fast matching algorithm where I basically look for a quote as if the document were a flat stream of characters and not a hierarchical ordered structure of logical JSON elements. That's what I mean by partial parsing and part of what I mean by denormalized searching/scanning.
It ignores swaths of the document until it finds what you want. For example
reader.skipToField("name",JsonReader::Forward);
This performs the type of flat match that I'm talking about.
reader.skipToField("name",JsonReader::Siblings);
This performs a partially flat and partially structured match, looking for name on this level of the object heirarchy.
reader.skipToField("name",JsonReader::Descendants);
This does a nearly flat match, but basically counts '{' and '}' so it knows when to stop searching.
I've simplified the explanation of what I've done, but that's the gist. I also don't load strings into memory at all when comparing them. I compare one character at a time straight off the "disk" so I never know the whole field name unless it's the one I'm after.
Real programmers use butterflies
|
|
|
|
|
I actually didn't have a clue what it was, but I'm a total noob so I don't count anyway
modified 3-Jun-21 21:01pm.
|
|
|
|
|
...in a break with tradition, I'm going to thank my friends and relatives for their Christmas absence.
|
|
|
|
|
Well thanks for your presence here!
If you can't laugh at yourself - ask me and I will do it for you.
|
|
|
|
|
The break must be thanking them
|
|
|
|
|
|
One of those pages that has so many ads, pop-ups, notifications and so on that after 30 seconds and it was still jumping around (unreadable) I gave up.
I got the gist of it though. I wonder how the CPU copes with a 30-minute dishwasher cycle?
|
|
|
|
|
uBlock.
Works wonders!
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
Name it a Console and it will sell!
(if that is of any consolation to you)
|
|
|
|
|
One of those products you hadn't heard of before now and now you know why.
I'm not sure how many cookies it makes to be happy, but so far it's not 27.
JaxCoder.com
|
|
|
|
|
Consoles nowadays are so power hungry, if you just take out the fans and heatsink you'll end up with the same product...
|
|
|
|
|
I keep a stone by the front door all December to throw at Carol Singers - I call it my Jingle Bell Rock.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
When they sing you can sling. You'd have to be stoned to stand out in the cold and bother strangers all night. I sedimentary that, too.
Ravings en masse^ |
---|
"The difference between genius and stupidity is that genius has its limits." - Albert Einstein | "If you are searching for perfection in others, then you seek disappointment. If you seek perfection in yourself, then you will find failure." - Balboos HaGadol Mar 2010 |
modified 22-Dec-20 11:08am.
|
|
|
|
|
That of quartz is a fine idea, though you may be charged with a salt, ore not.
"the debugger doesn't tell me anything because this code compiles just fine" - random QA comment
"Facebook is where you tell lies to your friends. Twitter is where you tell the truth to strangers." - chriselst
"I don't drink any more... then again, I don't drink any less." - Mike Mullikins uncle
|
|
|
|
|
Expect a Christmas visit from the Cole Porter!
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.
-- 6079 Smith W.
|
|
|
|