Click here to Skip to main content
15,846,211 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Experts !

The CT reports of a Hospital (example below) were written as free unstructured text. We need to extract data from these reports into structured data table.
for example

hemorrhage = Yes / No
Hydrocephalus = Yes / No

My question is - if you ever tried something similar to this before - what approach should i use ?

Technique: Axial images through the brain were acquired from skull base to the vertex with 5 mm
slice thickness. Images were reviewed in brain, subdural and bone window settings.
Findings: There are bilateral areas of low attenuation in periventricular and subcortical white matter,
nonspecific but most compatible with microvascular changes. Cortical sulci and basilar cisterns are
normal in size and configuration. There is disproportionate ventriculomegaly involving lateral and
third ventricles primarily. There is no evidence of obstructing mass lesion. There is no intra or
extraaxial fluid collection. There is no parenchymal hemorrhage or mass lesion. There is no
evidence of acute transcortical infarction. There is no transtentorial herniation or midline shift. There
are bilateral cavernous internal carotid and vertebral arterial calcifications.
Visualized paranasal sinuses are normal. Visualized mastoid air cells and orbits are normal. Patient
is status post bilateral cataract removal surgery. Soft tissues of the scalp are normal. There is no
evidence of osseous fracture or aggressive appearing osseous lesion.
Hydrocephalus without evidence of obstructing mass lesion. Acute hydrocephalus cannot be
excluded since there are no prior studies available for comparison. Extensive chronic white matter
changes may mask transependymal CSF edema. Correlate with short-term followup to exclude
acute hydrocephalus. Correlate with clinical symptoms to exclude normal pressure hydrocephalus.


[1] this is dummy data, NOT true patient data
[2] This project is for research / training purposes, NOT primary care.
Updated 16-Sep-14 4:22am

At first glance I'd suggest something akin to map/reduce - e.g. the "word count" examples then feeding this into a "what do these words imply".

It may be more complex when you have to decide if a phrase is positive or negative - i.e. the consultant writes "The patient had no obvious signs of concussion" - is that Concussion Yes or Concussion No?

What you are going to need to do is to parse the text into sentences or phrases, then have a (I'd suggest parallel) process that takes an indicator phrase/word and looks for it in the sentence. You also need a process to find "negations".

Store these as pending records, then display the text with the phrases highlighted and get a clinician to approve or alter each thing it has found.

You will also find that clinicians have a small and well defined vocabulary so once a phrase has been decoded and checked by the clinician it can be found again in other files and processed accordingly.
Share this answer
Mohamed Kamal 16-Sep-14 6:49am    
I'd think of some sort of displaying / highlighting controversial phrases like the one you mentioned for the operator to choose from it and store the pattern in a database. What do you think ? @duncan
Duncan Edwards Jones 16-Sep-14 7:12am    
Yes - it is going to need manual oversight/intervention. What sort of volume of records are you talking - 10s, or 1000s?
Duncan Edwards Jones 16-Sep-14 9:56am    
I've updated my idea a bit - mapReduce is probably overkill
Mohamed Kamal 16-Sep-14 14:30pm    
thanks a lot
Sergey Alexandrovich Kryukov 16-Sep-14 10:13am    
Well, For such cases notions "no obvious signs of concussion", Fuzzy Set theory exists.

But the problem is different. The real problem is that the consultant may write something unrelated to the notion of "concussion" and still extremely important; none of the classifiers may foresee it.
At the present-day technology level, semantic analysis of the natural language is nearly hopeless, but here the problem is even more difficult.
There are legal (liability), as well as scientific, reasons one should not attempt to automate extracting summary data from clinical reports like this one. Such automated extracted data used by medical staff without the training to appreciate the subtleties involved could lead to negligent, or fatal, patient care.

What you propose goes far beyond "extracting structured data:" you are proposing extracting clinically significant "meaning" from complex data. A task that is at the "frontier" of Artificial Intelligence.

The correct strategy would be to have a form for the clinician (probably a neurologist, in this case) to fill out in which they give estimated percentages of probability for presence/absence of discrete pathologies.

Words like "Impression," and "Correlate with," are there for a reason: to qualify the assertions that follow as tentative, and to indicate that the findings/observations need to be interpreted after further specific investigations.

I suggest you re-design your project.
Share this answer
George Jonsson 16-Sep-14 8:49am    
My 5.
I was going to ask which hospital this is meant for, so I can make sure to avoid it.
BillWoodruff 16-Sep-14 8:52am    
Thanks, George ! I'd avoid that hospital, also :)
Duncan Edwards Jones 16-Sep-14 8:53am    
My guess would be that this is probably for research/training purposes, not primary care...
George Jonsson 16-Sep-14 9:06am    
One can always hope. :)
Mohamed Kamal 16-Sep-14 9:13am    
Dears, It is exactly as you said: for research/training purposes, not primary care... till now it is a virtual home made project with no patients involved at all. The advice of "redesigning" my project is not applicable, as till now the "project" have not yet existed. I used library tm of R and it gave me somewhat acceptable results, but i was thinking of some more accurate. Can you help me ?

R code below, the variable "crude" is the text

corpus <- Corpus(VectorSource(crude))

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900