How do I solve this error

Question

1.00/5 (1 vote)

See more:

Python

import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize



class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf
    
short_pos = open("positive.txt","r").read()
short_neg = open("negative.txt","r").read()

I am getting this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 4645: ordinal not in range(128)

How can i fix this?

What I have tried:

I have tried changing the file, and it works.However, with positive.txt it's not working.

Posted 27-Jan-18 23:00pm

Member 13647869

Updated 27-Jan-18 23:58pm

Add a Solution

Comments

Kornfeld Eliyahu Peter 28-Jan-18 5:11am

Work with unicode or local?

Member 13647869 28-Jan-18 5:21am

I am not quite sure what your question means. Could you please explain?

2 solutions

Solution 2

The problem is that you try to open a file that contains text encoded not in ASCII... Without telling to Python how to open the file it will try to open it as ASCII (the default encoding of Python) and will fail...
Add the 'encoding' param to your open function to solve the problem...
2. Built-in Functions — Python 3.6.4 documentation[^]

Posted 27-Jan-18 23:54pm

Kornfeld Eliyahu Peter

Comments

Member 13647869 28-Jan-18 5:59am

Ok, please bare with me, I am new to all of this =z so as you said I have to use the encode parameter, but the thing is I thought I should encode it using utf-8, but it didn't work:

i wrote this:
short_pos = open("positive.txt", "r",encoding='utf-8').read()
short_neg = open("negative.txt","r",encoding='utf-8').read()

and got this error:
short_neg = open("negative.txt","r",encoding='utf-8').read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte

Kornfeld Eliyahu Peter 28-Jan-18 6:06am

If I understand correctly 'positive.txt' opens with utf-8, but 'negative.txt' does not?!
It seems you files are encoded differently (maybe from different sources...)...
As there is no fool-proof way to determine the encoding of a text file, you have to resolve to try-catch...
You have to set a list of possible encodings and try each of them until success or finish...

Member 13647869 28-Jan-18 6:09am

the files were given to me, as in I do not know what encoding that is used. I changed the positive just to see if the difference, as in compare my assumption with the original one. I wanted to see what would happen if I encode positve.txt with utf-8 would it work? so I left negative.txt the same

Member 13647869 28-Jan-18 7:39am

Thank you, Peter, for your help!!!!!!

Kornfeld Eliyahu Peter 28-Jan-18 7:51am

You are welcome!

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Accepted Answer · 2018-01-27T23:26:00

Solution 1

The ASCII character set only accepts values in the range 0 to 127 inclusive: your byte value of 0xF3 - a hexadecimal value equivelant to 243 in decimal - is outside that range and cannot be translated to an ASCII character.

You are trying to read data as text, but the file does not contain the "right data" - I'd suggest you check the file content, and probably read it as binary data instead of text: Working with Binary Data in Python | DevDungeon[^]

Posted 27-Jan-18 23:26pm

OriginalGriff

Comments

Member 13647869 28-Jan-18 5:38am

The file contains text and numbers, so if I read it as binary data, it wouldn't work, would it? As in, i cant use the split lines function and so on

OriginalGriff 28-Jan-18 5:59am

But "straight text" is not all that it holds - you need to look at it closely and find out if it's your assumptions (that it's just text) or the wrong encoding. As Peter says, if it's Unicode you need to read it as that or it defaults to the much more limited ASCII set. But ... character 0xF3 in Unicode is ó - an accented "o" - and given where you are that's probably not a likely character to get in a string!

Look at the data files, and work out what you need to do with them.

Member 13647869 28-Jan-18 6:02am

It's not just text, as I said it contains text AND numbers. What i am trying to do is use this file to create classifiers for a sentiment analysis

OriginalGriff 28-Jan-18 6:12am

That's not a distinction you should be making. You are assuming that "Number 98765" is "text and numbers" and it isn't - it's all text, as the number 98765 is stored in the file as a sequence of readable characters. If it was stored in the file as a number, it would be stored as four bytes: 0x00, 0x01, 0x81, oxCD - each occupying a "single character space".

"Text" is "anything human readable".
"Binary" is "anything machine readable, but probably not immediately human readable".

You need to examine your files, and see exactly what they contain. Starting with a Hex Editor is a good beginning - anything under 0x20 or above 0x7F may indicate it's a binary file (except for 0x0A and 0x0D)

Member 13647869 28-Jan-18 6:17am

That was my original assumption, which is why in the code its reading the file, r (short_pos = open("positive.txt","r").read()), but then the ascii gave me that error.
I am very confused. As every person I ask gives me a different answer.

Member 13647869 28-Jan-18 6:19am

What if i post the file, as in part of it?

OriginalGriff 28-Jan-18 6:25am

That's probably because we can't see your data! :laugh:
So we know it's not ASCII...

Copy the whole file to Dropbox, and post the link. Don't edit it: what you may do is change the encoding with the editor you use and that means we don't look at the same file.

Member 13647869 28-Jan-18 6:26am

ok, just give me a few minutes! i will create one for the full code, and one for the data

Member 13647869 28-Jan-18 6:36am

Data:
https://www.dropbox.com/sh/qrrzdnt6aayphlh/AADQVUpymD75qReuaIGL7MUGa?dl=0

Code:

https://www.dropbox.com/sh/4qmj43g3ub2of9g/AAB-gLeE2EkpckG8-qbXqpVoa?dl=0

OriginalGriff 28-Jan-18 7:00am

OK: Positive.txt is UTF-8.
Negative? Dunno, you didn't give us that one... :laugh:

Member 13647869 28-Jan-18 7:06am

Perfect! While I share the negative.txt file, could you please advice me how to change my code to accommodate the encoding of utf-8? I am using python 3.6.4

Member 13647869 28-Jan-18 7:07am

https://www.dropbox.com/sh/qrrzdnt6aayphlh/AADQVUpymD75qReuaIGL7MUGa?dl=0

OriginalGriff 28-Jan-18 7:27am

Negative.txt looks like UTF-8 as well - and reads fine as UTF-8 in C#.
I'd go back to where you got the files from and ask them.
I've read it as UFT-8, and written it back as UTF-8 so it gets a BOM:
https://www.dropbox.com/s/duymuob45rykmh4/negativeA.txt?dl=0
Have a look at that and see if it makes it work as utf-8.

You already have the code to do that - you showed it in your comment to Peter!

Member 13647869 28-Jan-18 7:32am

I was not sure if the code was correct, let me try and see

Member 13647869 28-Jan-18 7:39am

THANK YOU VERY MUCH. I AM VERY GRATEFUL!

OriginalGriff 28-Jan-18 7:52am

You're welcome!