How can I extract structured text from free unstructured text

Question

0.00/5 (No votes)

See more:

Experts !

The CT reports of a Hospital (example below) were written as free unstructured text. We need to extract data from these reports into structured data table.
for example

HTML

hemorrhage = Yes / No
Hydrocephalus = Yes / No
etc..

My question is - if you ever tried something similar to this before - what approach should i use ?

HTML

Technique: Axial images through the brain were acquired from skull base to the vertex with 5 mm
slice thickness. Images were reviewed in brain, subdural and bone window settings.
Findings: There are bilateral areas of low attenuation in periventricular and subcortical white matter,
nonspecific but most compatible with microvascular changes. Cortical sulci and basilar cisterns are
normal in size and configuration. There is disproportionate ventriculomegaly involving lateral and
third ventricles primarily. There is no evidence of obstructing mass lesion. There is no intra or
extraaxial fluid collection. There is no parenchymal hemorrhage or mass lesion. There is no
evidence of acute transcortical infarction. There is no transtentorial herniation or midline shift. There
are bilateral cavernous internal carotid and vertebral arterial calcifications.
Visualized paranasal sinuses are normal. Visualized mastoid air cells and orbits are normal. Patient
is status post bilateral cataract removal surgery. Soft tissues of the scalp are normal. There is no
evidence of osseous fracture or aggressive appearing osseous lesion.
Impression:
Hydrocephalus without evidence of obstructing mass lesion. Acute hydrocephalus cannot be
excluded since there are no prior studies available for comparison. Extensive chronic white matter
changes may mask transependymal CSF edema. Correlate with short-term followup to exclude
acute hydrocephalus. Correlate with clinical symptoms to exclude normal pressure hydrocephalus.

IMPORTANT NOTE

[1] this is dummy data, NOT true patient data
[2] This project is for research / training purposes, NOT primary care.

Posted 16-Sep-14 0:33am

Mohamed Kamal

Updated 16-Sep-14 3:22am

v2

Add a Solution

2 solutions

Solution 1

~~At first glance I'd suggest something akin to map/reduce - e.g. the "word count" examples then feeding this into a "what do these words imply".~~

It may be more complex when you have to decide if a phrase is positive or negative - i.e. the consultant writes "The patient had no obvious signs of concussion" - is that Concussion Yes or Concussion No?

What you are going to need to do is to parse the text into sentences or phrases, then have a (I'd suggest parallel) process that takes an indicator phrase/word and looks for it in the sentence. You also need a process to find "negations".

Store these as pending records, then display the text with the phrases highlighted and get a clinician to approve or alter each thing it has found.

You will also find that clinicians have a small and well defined vocabulary so once a phrase has been decoded and checked by the clinician it can be found again in other files and processed accordingly.

Posted 16-Sep-14 0:46am

Duncan Edwards Jones

Updated 16-Sep-14 3:20am

v2

Comments

Mohamed Kamal 16-Sep-14 6:49am

I'd think of some sort of displaying / highlighting controversial phrases like the one you mentioned for the operator to choose from it and store the pattern in a database. What do you think ? @duncan

Duncan Edwards Jones 16-Sep-14 7:12am

Yes - it is going to need manual oversight/intervention. What sort of volume of records are you talking - 10s, or 1000s?

Duncan Edwards Jones 16-Sep-14 9:56am

I've updated my idea a bit - mapReduce is probably overkill

Mohamed Kamal 16-Sep-14 14:30pm

thanks a lot

Sergey Alexandrovich Kryukov 16-Sep-14 10:13am

Well, For such cases notions "no obvious signs of concussion", Fuzzy Set theory exists.

But the problem is different. The real problem is that the consultant may write something unrelated to the notion of "concussion" and still extremely important; none of the classifiers may foresee it.
At the present-day technology level, semantic analysis of the natural language is nearly hopeless, but here the problem is even more difficult.
—SA

Mohamed Kamal 16-Sep-14 14:30pm

thanks a lot

Sergey Alexandrovich Kryukov 16-Sep-14 15:54pm

I wish I could really help more... :-)
—SA

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

BillWoodruff · Accepted Answer · 2014-09-16T02:46:00

Solution 2

There are legal (liability), as well as scientific, reasons one should not attempt to automate extracting summary data from clinical reports like this one. Such automated extracted data used by medical staff without the training to appreciate the subtleties involved could lead to negligent, or fatal, patient care.

What you propose goes far beyond "extracting structured data:" you are proposing extracting clinically significant "meaning" from complex data. A task that is at the "frontier" of Artificial Intelligence.

The correct strategy would be to have a form for the clinician (probably a neurologist, in this case) to fill out in which they give estimated percentages of probability for presence/absence of discrete pathologies.

Words like "Impression," and "Correlate with," are there for a reason: to qualify the assertions that follow as tentative, and to indicate that the findings/observations need to be interpreted after further specific investigations.

I suggest you re-design your project.

Posted 16-Sep-14 2:46am

BillWoodruff

Updated 16-Sep-14 2:48am

v2

Comments

George Jonsson 16-Sep-14 8:49am

My 5.
I was going to ask which hospital this is meant for, so I can make sure to avoid it.

BillWoodruff 16-Sep-14 8:52am

Thanks, George ! I'd avoid that hospital, also :)

Duncan Edwards Jones 16-Sep-14 8:53am

My guess would be that this is probably for research/training purposes, not primary care...

George Jonsson 16-Sep-14 9:06am

One can always hope. :)

Mohamed Kamal 16-Sep-14 9:13am

Dears, It is exactly as you said: for research/training purposes, not primary care... till now it is a virtual home made project with no patients involved at all. The advice of "redesigning" my project is not applicable, as till now the "project" have not yet existed. I used library tm of R and it gave me somewhat acceptable results, but i was thinking of some more accurate. Can you help me ?

R code below, the variable "crude" is the text

library(tm)
corpus <- Corpus(VectorSource(crude))
dtm<-TermDocumentMatrix(corpus)
mat=as.data.frame(as.matrix(dtm))

BillWoodruff 16-Sep-14 10:39am

Hi Mohamed, My response was not meant to imply it wasn't valuable for you to think about this kind of data, and the limits of what might be extracted from it into a summary form useful for some highly-trained persons. imho, the only hope for such summary extraction would be what has been called, in the past, an "expert system." This is a good opportunity for you to become acquainted with computational complexity theory, and the "P vs. NP problem:"

http://en.wikipedia.org/wiki/NP-complete

cheers, Bill

Mohamed Kamal 16-Sep-14 14:30pm

thanks a lot

Sergey Alexandrovich Kryukov 16-Sep-14 10:06am

Good points, my 5.
The Mohamad's arguments above (research/training purpose only) is valid to some extent, but it simply means that this is not a "good project" in general. Not only interpretation of vague phrases of a professional is nearly hopeless, but, even the goal is very questionable: it would mean clearing the freely written text from valuable information its essence and turning it into non-sensible yes/no classifiers.
See also Solution 1 and comments.
—SA