Click here to Skip to main content
14,429,185 members
Rate this:
Please Sign up or sign in to vote.
See more:
this site

do as plagiarism check. i can to upload my research and site compare it with 150 billions website ,files and researches site show if i copied paragraph or sentences from internet and show the source of this copy

i want to develop the same idea but in Arabic . i want to know how to search in these number of sites how to store and how to compare my paragraph with these numbers of paragraphs in all these websites belong to WWW
Updated 12-Nov-12 6:46am
joshrduncan2012 12-Nov-12 11:45am
What have you tried so far?
Nelek 12-Nov-12 11:53am
Good luck, it won't be easy.
Sergey Alexandrovich Kryukov 12-Nov-12 14:00pm
If you want to check for exact match of code fragment, this is one thing, but plagiarism... It would give too many false negatives and false positives. Citations are legal if proper attribution is done, and big fragments of text could be plagiarized by introducing tiny differences...

1 solution

Rate this:
Please Sign up or sign in to vote.

Solution 1

This is literally a big task. The answer is you can't do it with the strategy you suggest (of checking all websites):
Downloading what we have now will take a long time. Even if you take a download and check, download and check approach, by the time you have finished a lot of stuff will have changed and will need to be checked again. Worse, more stuff will be added that you are going to be able to download in the same time, so it will take effectively an infinite amount of time to process.

You therefore need to work smart rather than hard. First you can cut the problem domain down: only check sites related to your topic, you can use something like Google Custom Search API to lower the number of sites you need to check. Secondly, you could also use something like Google's API to find text in the thing potentially plagiarised article, the more unusual the text the better or look at the abstracts. You could employ an Heuristic approach to improve performance and results, but that will get complicated. Unfortunately Google has restricted its API without paying[^] Even with google, you'd be hard pressed to get 100% coverage, or 100% accuracy (especially if the article has been re-worded).

Finally, you could look at the existing plagarism checkers(e.g.[^]), this would take the hard work out of your hands entirely, but also you'd lose the interesting part of your project.
Sergey Alexandrovich Kryukov 12-Nov-12 14:02pm
You are talking about detection of matching text fragment, but it hardly helps to fight plagiarism. It would give too many false negatives and false positives. Citations are legal if proper attribution is done, and big fragments of text could be plagiarized by introducing tiny differences. Besides, there is no a criteria to find out which text is original and which is plagiarized.
Keith Barrow 12-Nov-12 14:12pm
If you read my text, I'm basically telling him brute force is effectively impossible, and anything else he is likely to acheive inaccurate results. At no point did I say this was going to produce reliable result. I expect a lone dev, however talented l, is unlikely solve this poblem.
Sergey Alexandrovich Kryukov 12-Nov-12 14:36pm
I basically agree with that. My note was not something to contradict your assessment; it's just another aspect to take into account. Even is you have to code finding the text matches, it would not solve the problem.
nagiub2007 13-Nov-12 4:04am
@Keith Barrow i will search this api thanls alot
Keith Barrow 13-Nov-12 5:01am
IMO you should try and find a simpler problem if this is a University project. Sites like Turnitin will have teams of specialised developers running complex algorithms, and they still won't be 100% accurate. I *really* dislike discouraging developers like this, but it is important to screen the problems first to see if your goals feasible . In my view, a plagarism checker is going to be too hard. Even my suggestions is only the very tip of the iceberg, it only brings down the scale of the challenge, whilst increasing the probability of losing plagiarised articles in the process.
nagiub2007 13-Nov-12 5:47am
thanks a lot but it isn't a University project ,it's related to business
and really i want to develop this site even i will need to purchase API or services do it.
by the way it's seems to be a great challenge for me
my problem is to know how these sites compare my file uploaded
and how get all documents and sites when comparison occurred and all done In an ideal time how??
Keith Barrow 13-Nov-12 5:57am
How? If I knew that, I'd set up a Plagarism-checking website :).
nagiub2007 13-Nov-12 5:59am
thanks for Ur efforts
nagiub2007 13-Nov-12 4:06am
@Sergey Alexandrovich Kryukov
there is websites do it like turnitin
i mentioned in my question i want to know how this websites did it

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100