Click here to Skip to main content
15,875,568 members
Please Sign up or sign in to vote.
1.40/5 (3 votes)
See more:
this site
Turnitin


do as plagiarism check. i can to upload my research and site compare it with 150 billions website ,files and researches site show if i copied paragraph or sentences from internet and show the source of this copy

i want to develop the same idea but in Arabic . i want to know how to search in these number of sites how to store and how to compare my paragraph with these numbers of paragraphs in all these websites belong to WWW
Posted
Updated 12-Nov-12 5:46am
v2
Comments
joshrduncan2012 12-Nov-12 11:45am    
What have you tried so far?
Nelek 12-Nov-12 11:53am    
Good luck, it won't be easy.
Sergey Alexandrovich Kryukov 12-Nov-12 14:00pm    
If you want to check for exact match of code fragment, this is one thing, but plagiarism... It would give too many false negatives and false positives. Citations are legal if proper attribution is done, and big fragments of text could be plagiarized by introducing tiny differences...
--SA

1 solution

This is literally a big task. The answer is you can't do it with the strategy you suggest (of checking all websites):
Downloading what we have now will take a long time. Even if you take a download and check, download and check approach, by the time you have finished a lot of stuff will have changed and will need to be checked again. Worse, more stuff will be added that you are going to be able to download in the same time, so it will take effectively an infinite amount of time to process.

You therefore need to work smart rather than hard. First you can cut the problem domain down: only check sites related to your topic, you can use something like Google Custom Search API to lower the number of sites you need to check. Secondly, you could also use something like Google's API to find text in the thing potentially plagiarised article, the more unusual the text the better or look at the abstracts. You could employ an Heuristic approach to improve performance and results, but that will get complicated. Unfortunately Google has restricted its API without paying[^] Even with google, you'd be hard pressed to get 100% coverage, or 100% accuracy (especially if the article has been re-worded).

Finally, you could look at the existing plagarism checkers(e.g. http://www.duplichecker.com/[^]), this would take the hard work out of your hands entirely, but also you'd lose the interesting part of your project.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 12-Nov-12 14:02pm    
You are talking about detection of matching text fragment, but it hardly helps to fight plagiarism. It would give too many false negatives and false positives. Citations are legal if proper attribution is done, and big fragments of text could be plagiarized by introducing tiny differences. Besides, there is no a criteria to find out which text is original and which is plagiarized.
--SA
Keith Barrow 12-Nov-12 14:12pm    
If you read my text, I'm basically telling him brute force is effectively impossible, and anything else he is likely to acheive inaccurate results. At no point did I say this was going to produce reliable result. I expect a lone dev, however talented l, is unlikely solve this poblem.
Sergey Alexandrovich Kryukov 12-Nov-12 14:36pm    
I basically agree with that. My note was not something to contradict your assessment; it's just another aspect to take into account. Even is you have to code finding the text matches, it would not solve the problem.
--SA
nagiub2007 13-Nov-12 4:04am    
@Keith Barrow i will search this api thanls alot
Keith Barrow 13-Nov-12 5:01am    
IMO you should try and find a simpler problem if this is a University project. Sites like Turnitin will have teams of specialised developers running complex algorithms, and they still won't be 100% accurate. I *really* dislike discouraging developers like this, but it is important to screen the problems first to see if your goals feasible . In my view, a plagarism checker is going to be too hard. Even my suggestions is only the very tip of the iceberg, it only brings down the scale of the challenge, whilst increasing the probability of losing plagiarised articles in the process.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900