Click here to Skip to main content
14,391,825 members
Rate this:
Please Sign up or sign in to vote.
Hi people. I am beginner in R Language.
I have the following problem: from many PDF files containing technical reports (in Portuguese language) from many authors (all is in Natural Language) how can I develop an Intelligent System to identify the Author(s) Name(s) by an input of small set of Keywords that are nearly matched with their works done?

For example, I know that to read and start to process this text in R I can use the following line codes: (where yyyyyyyyyyyyyy is the URL or the drive path where is my PDF file, for ex. XXX.pdf)

install.packages("pdftools")
library(pdftools)
download.file("yyyyyyyyyyyyyy/XXX.pdf", "./XXX.pdf")
text <- pdt_text("./XXX.pdf")

I know that I will need to make a NLP (Natural Language Processing) from here, but how is the best way to do this? Will I need use ontology?
After this, after structured this text processing how can I develop an Intelligent System to identify the Author(s) Name(s) by an input of small set of Keywords that are nearly matched with their works done?

Thanks for any help

What I have tried:

I tried read the text in Natural Language inside a PDF report and it looks ok, but after this I don't know how to proceed.
Posted
Comments
Gerry Schmitz 5-Mar-19 13:32pm
   
If the PDF in encrypted, none of this does you any good. And, R, in this case, looks like a sledge hammer to kill a flea. You haven't even figured out "what" identifies an "author". Once you do that, a simple "text reader" will probably do. NLP?!

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100