Read non-ASCII characters of TIFF tag values in java

Question

0.00/5 (No votes)

See more:

I am trying to read different tag values (like tags 259 (Compression), 33432 (Copyright), 306 (DateTime), 315 (Artist) etc.) from a TIFF image in Java 11.

What I have tried:

I tried with ImageIO like following:

Java

File tiffFile = new File(tiffFileName);

    ImageInputStream input = ImageIO.createImageInputStream(tiffFile) 
    ImageReader reader = ImageIO.getImageReaders(input).next(); 

    reader.setInput(input);
    IIOMetadata metadata = reader.getImageMetadata(0); 

    TIFFDirectory ifd = TIFFDirectory.createFromMetadata(metadata);
    TIFFField myTag = ifd.getTIFFField(33432); 
    String tagString = myTag.getAsString(0);  
    // problem here

    //String[][] replacements = { { "ä", "ae" }, { "ü", "ue" }, { "ö", "oe" }};
    String[][] replacements = {{"\u00C4", "Ae"}, {"\u00DC", "Ue"}, {"\u00D6", "Oe"},    
          {"\u00E4", "ae"}, {"\u00FC", "ue"}, {"\u00F6", "oe"}, {"\u00DF", "ss"} };

    for (String[] replacement : replacements) {
       tagString = tagString.replaceAll(replacement[0], replacement[1]);
    }

But it does not give exact value of the tag. In case of non-ASCII values (ö, ü, ä etc), question marks replace the real values. TIFFField.getAsString(0) return values like Universit�t. But I want Universität.

Can anyone tell me how to get byte values of the tag, then decode it with utf-8 to get the exact tag values ?

Suggestion for alternative java library for reading the TIFF images is also welcome. I just need to read the exact tag values including non-ASCII characters.

Posted 6-Nov-20 0:27am

Member 12213239

Updated 11-Nov-20 2:35am

v6

Add a Solution

Comments

Richard MacCutchan 6-Nov-20 7:32am

The values are correct, it is your display code that is producing the strange characters. You need to know the language that is being used in the text and adjust your display font to match it.

Member 12213239 6-Nov-20 8:58am

any idea how to handle the display font ?

Richard MacCutchan 6-Nov-20 11:20am

That depends on how you are displaying the results.

Member 12213239 6-Nov-20 11:45am

I want to replace the umlaut (ä, ö, and ü) with equivalent characters like ae, oe and ue. My problem here is TIFFField.getAsString(0) return values like Universit�t, not exact value Universität. Can you specifically tell me how to get the exact value including the umlaut ?

Richard MacCutchan 6-Nov-20 12:02pm

No, they do not return "Universit�t", that is produced by you trying to display a character in a font that has no equivalent for that character's value. You need to examine the character's actual value. It is no use trying to print it and hoping for the best. Look at the Character Map application in the Windows Accessories folder on the start menu. That will show you what characters are equivalent in different language fonts.

Member 12213239 6-Nov-20 12:29pm

I am not printing the values here. When i debug the code ( in IntelliJ IDEA ), it shows Universit�t, not exact value Universität. I just need to read the tag value and replace umlauts with equivalent characters (like ae for ä, ue for ü ) in the string. If i can't read the umlaut, i can't replace it with equivalent values. can you give any hints how to read the exact umlaut here ?

Richard MacCutchan 6-Nov-20 12:38pm

Stop looking at them as displayed, and look at the actual numeric value of the character, that is what determines what will be displayed. For example, in Unicode the character ä has the value 0x00E4. And if your display font set uses a different mapping then you will get whatever character is at that value in the font set. All of this information can be found in the Character Map application I referred to above.

Richard MacCutchan 7-Nov-20 3:45am

A string is just an array of bytes. ASCII characters are represented by 8-bit byte values, and Unicode by 16-bits. I suggest you get a book on computer basics and learn how data is stored and manipulated.

Member 12213239 8-Nov-20 11:51am

i am new in Java programming. can you help please ?

Richard MacCutchan 8-Nov-20 12:05pm

This has nothing to do with Java, it is about understanding computers and how data is stored and manipulated, and what each byte, or sequence of bytes, may represent. If you do not understand the basics you are going to struggle more and more.

Member 12213239 8-Nov-20 14:00pm

i got your point. here i am using replaceAll() to replace umlaut in the string. But myTag.getAsString(0) is not returning the exact value. what am i missing here ? How can i manipulate the string differently ?

Richard MacCutchan 9-Nov-20 3:52am

What do you mean by "But myTag.getAsString(0) is not returning the exact value"? I cannot guess what is happening in your system.

Member 12213239 9-Nov-20 5:18am

Please have a look at my updated code above. Here myTag.getAsString(0) return values like Universit�t. But the exact value is Universität. How can i replace the umlaut if i don't get the exact value ? Can you please tell me how to access the byte values and replace the umlaut with the equavalent values like ae for ä, ue for ü and so on ?

Richard MacCutchan 9-Nov-20 5:21am

At the risk of repeating myself ad nauseam: look at the actual values of each character in the returned data.

Member 12213239 9-Nov-20 16:58pm

you mean like this ? String[][] umlautReplacements = { { "\u00C4", "Ae" }, { "\u00DC", "Ue" }, { "\u00D6", "Oe" }, { "\u00E4", "ae" }, { "\u00FC", "ue" }, { "\u00F6", "oe" }, { "\u00DF", "ss" } };

Richard MacCutchan 10-Nov-20 3:27am

Yes.

Member 12213239 9-Nov-20 17:00pm

i checked each character using Unicode Character (Hexadecimal ). But it is showing same result

Richard MacCutchan 10-Nov-20 3:33am

Sorry, I have no idea what that means. I have just tested your code and it works correctly.

Member 12213239 8-Nov-20 14:33pm

I tried to convert the string into byte array and replace the umlaut. but it is not working.

Member 12213239 8-Nov-20 15:31pm

can you give little hints ?

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Patrice T · Answer 1 · 2020-11-09T14:03:00

Quote:
Can anyone tell me how to get byte values of the tag, then decode it with utf-8 to get the exact tag values ?

First, you need to understand that before unicode (DOS era), ascii codes between 128-255 where used for special chars and with pagecodes to handle different charsets.
ASCII Code - The extended ASCII table[^]
One of the reasons TIFF uses this is that TIFF was created before unicode/utf exist, at the time they needed ways to encode non ascii chars.
-So to know what was read, you need to display as hexadecimal.
Your read is probably: 55 6E 69 76 65 72 73 69 74 84 74, ä is usually encoded as 84.
- You need to understand how you data is encoded and then call function that will convert to the coding of your app.
- if you want to update this data, you will need to do a coding in reverse.

In your case, you probably need a conversion from CP437 to urf8.