|
Is your wife happily married?
|
|
|
|
|
That's not just married women.
Bastard Programmer from Hell
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
All husbands need to remember these 2 things:
0) Happy wife, happy life!
1) You can either be right, or you can be happy.
Kelly Herald
Software Developer
|
|
|
|
|
Hi,
I just stumbled on something and thought I would share it. A while back I mentioned that I was working on a CCC analyzer/solver in my spare time, it's a side project and I haven't finished it. Since I will have some free time towards the end of this year I am picking up the project again. As part of my project I am analyzing the crossword puzzles using a skip-gram word-embedding with over 2 trillion tokens (3M vocab) evaluated in 500 dimensions. The embedding is trained from parts of the English Gigaword corpus, the wikipedia dump and most of the news/science articles from 2011-2017. (Yes, alot of data!)
One of my unit tests checks the 100 common nouns in the English language for certain characteristics.
[Top 10 correlations for Government]
governments 0.723813
minister 0.618532
administration 0.60618
federal 0.595554
governmental 0.587466
cabinet 0.584909
public 0.583068
ministry 0.579487
officials 0.572555
whitlam 0.565244
I like to think that I have a good grasp of the English language. However last night I noticed something that stood out, I saw a word relation that seemed unusual. The word 'Whitlam' was showing up as being very highly related to the word 'government'. I'd never heard of that word before so I looked up the definition. It's not a word, it's a persons name but how could the world's population of 7.9 billion use this word at such a high frequency under the context of 'government'.
The spearman[^] and pearson correlation[^] was so high... it could only mean that the word was being used directly next to the word 'government'. So I needed to find out how this bias has occured and where it was coming from. Then I found it, Whitlam Government[^], there are 434 articles on wikipedia[^] with this phrase. A quick investigation shows that there are over 80,000 indexed web pages using this phrase.
Interesting situation... since I have historic wikipedia dumps and also news articles from prior years I can look for this bias in prior years. I generated an embedding representing the year 2013 and 'Whitlam' scores much much lower. So it seems people are using this phrase much more today that in years past.
So this got me thinking... potentially as an offensive IW attack against NationY that is known to be using NLP to study TopicX it should be quite easy to distort and manipulate the outcome. In fact, you can easily calculate just how many words/articles would be needed to increase the rank/correlation.
As a defensive measure, it should be quite easy to monitor (from an omniscient internet viewpoint) words and phrases being used by the population that begin to deviate away from the current Zipfian distribution[^].
Wikipedia is not a reliable data source, I would recommend avoiding it for important NLP research.
Best Wishes,
-David Delaune
|
|
|
|
|
Wikipedia also needs to be heavily faded for anything that involves political opinion.
|
|
|
|
|
Anything using text as source is hard to process
I never worked directly with any of that, and frankly I didn't understand any of the more technical terms (zipfian?!? ), but I did work once on a project with people that did and I caught a few things (hopefully correctly).
From what they explained at the time, if my memory is working correctly, they filtered those kind of "temporarily important information" using normalization, word appearance rate and a temporal sliding window. As I remember, the algorithm was something like: for a certain period of time (the temporal sliding window) calculate the increase/decrease rate in count of the target word (word appearance rate) compared to a previous period and inversely affect the normalization (if count increases it has a negative effect on the total count and the faster the growth the bigger the impact). Then move the time window forward and repeat.
What happened was that spikes in words due to temporary increase of usage (example due to news articles) were smoothed out while at the same time the overall count of the word would not grow significantly.
I hope I made some sense and that I did not just wrote something that is a complete lie.
|
|
|
|
|
|
Randor wrote: It's interesting to track social interests over time
Yes. But most web sites seem to end up with a curve similar to CodeProject and the question becomes how long is that tail.
Also interesting is that big gray rectangle on Stack Overflow's map. I had to look it up and is Wyoming. Either it has no data or no data was produced. Both equally strange
Randor wrote: Something happened on November 9th[^] (probably here in the Lounge)
Sorry. I think I was offline that day and missed it. Went back in the lounge and couldn't find anything I could connect with that day (but I am not that smart). Taking into account that there is a C# tag on the link you sent and a general web search return the launch of new features for C#.
|
|
|
|
|
Statistics covering USA only has limited interest outside USA.
Unless it is a US only phenomenon. So maybe we should leave Codeproject Github and StackOverflow to the USAians, and make something different for the rest of the world.
|
|
|
|
|
Don't mess with me north man. Today is Saturday and Saturday is wine day. Besides, after my third glass I become a black belt in Kung Fu.
|
|
|
|
|
We needed to rotate the display for our commercial gadget I'm working on.
My GFX library doesn't have that facility.
I added general rotation capability by creating a "view" you could overlay onto any draw destination (a bitmap, a display device, etc) and then draw to that at the rotation and center you give it.
Doing it that way allowed me to slip the rotation transformation in under the existing code that draws the fonts and bitmaps, jpegs and everything else. =)
It's not fast, but I can make it faster if I need to.
Gosh I'm so happy with GFX. The design is really standing up to my attempts to improve it and use it. I haven't had to gut anything over the lifecycle of it, though that's not to say all the changes have been trivial. Still, the overarching design is really holding up, and I'm thrilled about that.
Real programmers use butterflies
|
|
|
|
|
Have you ever considered consolidating your posts like "the old new thing" does?
Bastard Programmer from Hell
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
https://9gag.com/gag/a81YpqZ[^]
back to looking for that other 9gag post....
Charlie Gilley
“They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759
Has never been more appropriate.
|
|
|
|
|
That's a W95 user. Anyone using W10 knows better.
Bastard Programmer from Hell
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
yup. But I swear, the Microsoft clowns are so far behind it's comical. Example: those clowns updated some drivers on my laptop (even though the settings was "dont update drivers (your stability may suffer)". The update reset things to the defaults and they updated the drivers anyway.
Many BSODs later....
Have you ever seen the QRC code on the BSOD screen? It's supposed to take you to help. Try it on your phone sometime. It will get you to a page that says "You are in a helicopter." Factually correct, but totally useless.
I have NEVER seen repair fix anything.
Required reference below. I really think Ron White could work with this as a segway of "You can't fix stupid."
Quote: A helicopter with a pilot and a single passenger was flying around above Seattle when a malfunction disabled all of the aircraft's navigation and communications equipment.
Due to the darkness and haze, the pilot could not determine the helicopter's position and course to get back to the airport.
The pilot saw a tall building with lights on and flew toward it, the pilot had the passenger draw a handwritten sign reading, "WHERE AM I?", and hold it up for the building's occupants to see.
People in the building quickly responded to the aircraft, drew a large sign, and held it in a building window.
Their sign said, "YOU ARE IN A HELICOPTER."
The pilot smiled, waved, looked at his map, determined the course to steer to SEATAC airport, and landed safely.
After they were on the ground, the passenger asked the pilot how the "YOU ARE IN A HELICOPTER" sign helped determine their position.
The pilot responded, "I knew that had to be the Microsoft support building, they gave me a technically correct but entirely useless answer."
Charlie Gilley
“They who can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.” BF, 1759
Has never been more appropriate.
|
|
|
|
|
|
I Never Thought I'd Live to be a Hundred[^]
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
Excellent tune
The less you need, the more you have.
Even a blind squirrel gets a nut...occasionally.
JaxCoder.com
|
|
|
|
|
Saw them live in 1970, they were awesome.
R.I.P. GE
The less you need, the more you have.
Even a blind squirrel gets a nut...occasionally.
JaxCoder.com
|
|
|
|
|
When I was in the Navy, we had a record player in the shop. Someone had the "Days of Future Passed" (?) album, and in a 12 hour shift, it got played 6 times, at least.
Not to be trollish about it, but, the overexposure made me come to HATE the Moody Blues. "Nights in White Satin" particularly. To this day, whenever they come on the radio, I imediately find another channel.
I am sorry to see anyone die, and that certaily applies to Graeme Edge. All of his problems are over. It is the fans left behind that suffer the loss.
|
|
|
|
|
Earlier I said I had trouble understanding the family of bottom up parse table building algorithms.
I don't want to claim I suddenly know it all but I just figured out something that I think ANTLR did based off of it.
Regular Expression engines work by matching characters in text using state machines.
These bottom up parsers work by matching tokens rules (like Foobar -> "bar" "baz") and tokens that it gathers into a stack. I crossed out tokens because that's how i sort of figured it worked before.
Anyway, I don't really care that I understand say, LALR table building a bit better now. But regarding ANTLR, the gentleman who invented it used state machines to recognize and match possible dysjunctions for LL(k) parsing in an incoming token rule stream. Like when a grammar is ambiguous where k=1 but not where k=3 the difference can be made up using recognizer state machines for the additional lookahead. These would be built in a very similar way as my existing code that builds LR(1) parsers.
LL(k) has always been an elusive beast for me where k>1. I don't understand the math, nor did I get the concepts as they were presented to me.
But perhaps I have something here, after all this time now.
If I can get to LL(k) it opens up the possibility of being able to write grammars to parse languages like C# without resorting to custom parsing code for parts of the grammar, or resorting to GLR which is awful to use in practice.
Anyway, woo! I have so much work in front of me now between upcoming CP contributions and actual work that I won't be able to get to this for a bit, but it gives me time for it to sort of crystalize in my head before I start writing code.
Real programmers use butterflies
|
|
|
|
|
Can you please also give a simple introduction to the subject - for single-celled organisms like me?
Or do you think we all here are somehow connected to your thoughts / activities in a magical way?
Thanks
|
|
|
|
|
Sorry, it's a result of trying to keep my post brief and compress a lot of information into a little space.
Tokenizing, or lexing text uses state machines to run regular expressions over an input text stream. It marks "lexemes" in text like IntegerLiteral or Keyword or StringLiteral . In the end it tags all text with a symbol id indicating what it is (keyword, int literal, or whatever) - these are returned as a series (stream) of tokens.
These tokens are used by parsers, which use a special kind of state machine that drives a stack as well as an input cursor. The input cursor this time is over tokens instead of text.
Parsers to put it simply, impose a hierarchal structure over those stream of tokens, based on a grammar. This yields a parse tree.
Some parsers (the LL family of parsers) use leftmost derivation and therefore construct the trees from the top down, starting at the root node in the grammar and working toward the leaves by eating input tokens.
Some parsers (the LR family of parsers) use rightmost derivation and therefore construct the trees from the leaves to the root, matching a series of tokens (or previously reduced rules), and then replacing those with the rule that represents them until the root rule is reduced. Recursively you can create a tree with this as well.
LL(1) means LL parsing with 1 token of lookahead. LL(0) parsers are not practical as they can't match anything significant (but LR(0) can).
LL(2) is more difficult because the extra disjunctions causes an exponential growth of your parse tables. Think of it like the difference between the number of guesses you'd have to make to brute force a 1 vs 2 digit pin #. It's like that.
LL(k) means k is arbitrary and usually dictated by whatever the grammar demands.
I think I can achieve LL(k) by using some of the algorithm from my LR parser construction.
Real programmers use butterflies
|
|
|
|
|
To be honest: I say 95% here can't follow them and make a smile on their faces.
Now you have tried to explain it with LL (1), LL (2), LL (k), your area of expertise ... I claim once again that 95% here do not understand ...
This is your story that needs a lot of knowledge to understand and one need a lot of prior knowledge to understand your posts.
And of course there are always goofs who say yes to everything without understanding it.
Sorry
PLEASE
Keep going with your articles and maybe do an article for beginners like 'parsing for noobs step-by-step' or something like a 'light weight' introduction for parsing.
All the best, you are great!
|
|
|
|
|