Click here to Skip to main content
16,018,818 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
I am running this code below:
Python
# Import modules
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

"""
This module is a string prediction model using LSTM.
It takes a file of strings composed of digits from 0 to 9 and splits them into input and target sequences.
The input sequence is the first five characters and the target sequence is the last five characters shifted by one position.
The model learns to predict the next character in the sequence given the previous five characters.
"""

# Define some constants
VOCAB_SIZE = 10 # number of possible tokens (digits from 0 to 9)
EMBED_SIZE = 32 # size of the embedding vectors
RNN_UNITS = 32 # size of the LSTM output vectors
BATCH_SIZE = 20 # number of sequences to process in each batch

# Import pandas
import pandas as pd

# Define the file path and name
file_path = "C:\\Users\\PC-1\\Desktop\\stringpred.txt"

# Read the file into a DataFrame using pandas.read_csv function
df = pd.read_csv(file_path, header=None)

# Convert the strings in the DataFrame to numeric values by removing the spaces and using pd.to_numeric function
df = df.apply(lambda x: pd.to_numeric(x.str.replace(" ", "")))

# Convert the DataFrame to a numpy array using df.values attribute
arrays = df.values

# Define a function to split the arrays into input and target sequences
def split_sequences(arrays):
    """
    This function splits each array into an input sequence and a target sequence.
    The input sequence is the first five characters and the target sequence is the last five characters shifted by one position.
    """
    # Initialize empty lists to store the input and target sequences
    input_sequences = []
    target_sequences = []

    # Loop over each array in the list
    for a in arrays:
        # Slice the array into input and target sequences
        input_sequence = a[:-1]
        target_sequence = a[1:]

        # Append the sequences to the corresponding lists
        input_sequences.append(input_sequence)
        target_sequences.append(target_sequence)

    return input_sequences, target_sequences

# Split the arrays into input and target sequences using split_sequences function
input_sequences, target_sequences = split_sequences(arrays)

# Split the data into training and testing sets with a ratio of 0.8:0.2 using train_test_split function from sklearn module
X_train, X_test, y_train, y_test = train_test_split(input_sequences, target_sequences, test_size=0.2, random_state=42)

# Reshape the input and target sequences into two-dimensional arrays using np.reshape function from numpy module
X_train = np.reshape(X_train, (-1, 5))
y_train = np.reshape(y_train, (-1, 5))
X_test = np.reshape(X_test, (-1, 5))
y_test = np.reshape(y_test, (-1, 5))

# Add some padding cells to the X_test and y_test arrays until they are divisible by 5 using np.pad function from numpy module
X_test = np.pad(X_test, (0, 5 - len(X_test) % 5), mode="constant")
y_test = np.pad(y_test, (0, 5 - len(y_test) % 5), mode="constant")

# Convert the input and target arrays to numpy arrays of float32 data type using np.asarray function from numpy module
X_train = np.asarray(X_train, dtype=np.float32)
y_train = np.asarray(y_train, dtype=np.float32)
X_test = np.asarray(X_test, dtype=np.float32)
y_test = np.asarray(y_test, dtype=np.float32)

# Define a function to generate a new string given a seed string
def generate_string(seed, model, subarrays):
    """
    This function generates a new string given a seed string using the trained model.
    It predicts the probabilities for the next token using the model and samples from them or takes the most likely token.
    It updates the seed array with the new token and repeats this process for six positions in the sequence.
    It returns the generated string as a concatenation of the tokens.
    """
    # Convert the seed string to an array of tokens
    seed_array = np.array([int(c) for c in seed])

    # Initialize an empty list to store the generated tokens
    output_array = []

    # Loop for six positions in the sequence
    for i in range(6):
        # Predict the probabilities for the next token using the model
        # Loop over the subarrays and concatenate the results
        probs = np.concatenate([model.predict(sub) for sub in subarrays], axis=0)

        # Sample from the probabilities or take the most likely token
        # Here we use sampling for more diversity, but you can change it as you like
        next_token = np.random.choice(VOCAB_SIZE, p=probs[0, -1])

        # Append the token to the output list
        output_array.append(next_token)

        # Update the seed array with the new token
        seed_array = np.append(seed_array[1:], next_token)

    # Convert the output list to a string and return it
    output_string = "".join(map(str, output_array))
    return output_string

# Read the file into a DataFrame using pandas.read_csv function
df = pd.read_csv(file_path, header=None)

# Convert the strings in the DataFrame to numeric values by removing the spaces and using pd.to_numeric function
df = df.apply(lambda x: pd.to_numeric(x.str.replace(" ", "")))

# Convert the DataFrame to a numpy array using df.values attribute
arrays = df.values

# Define a function to split the arrays into input and target sequences
def split_sequences(arrays):
    """
    This function splits each array into an input sequence and a target sequence.
    The input sequence is the first five characters and the target sequence is the last five characters shifted by one position.
    """
    # Initialize empty lists to store the input and target sequences
    input_sequences = []
    target_sequences = []

    # Loop over each array in the list
    for a in arrays:
        # Slice the array into input and target sequences
        input_sequence = a[:-1]
        target_sequence = a[1:]

        # Append the sequences to the corresponding lists
        input_sequences.append(input_sequence)
        target_sequences.append(target_sequence)

    return input_sequences, target_sequences

# Define a function to generate a new string given a seed string
def generate_string(seed, model, subarrays):
    """
    This function generates a new string given a seed string using the trained model.
    It predicts the probabilities for the next token using the model and samples from them or takes the most likely token.
    It updates the seed array with the new token and repeats this process for six positions in the sequence.
    It returns the generated string as a concatenation of the tokens.
    """
    # Convert the seed string to an array of tokens
    seed_array = np.array([int(c) for c in seed])

    # Initialize an empty list to store the generated tokens
    output_array = []

    # Loop for six positions in the sequence
    for i in range(6):
        # Predict the probabilities for the next token using the model
        # Loop over the subarrays and concatenate the results
        probs = np.concatenate([model.predict(sub) for sub in subarrays], axis=0)

        # Sample from the probabilities or take the most likely token
        # Here we use sampling for more diversity, but you can change it as you like
        next_token = np.random.choice(VOCAB_SIZE, p=probs[0, -1])

        # Append the token to the output list
        output_array.append(next_token)

        # Update the seed array with the new token
        seed_array = np.append(seed_array[1:], next_token)

    # Convert the output list to a string and return it
    output_string = "".join(map(str, output_array))

    return output_string

# Read and convert the strings from the file using read_strings function
# arrays = read_strings(file_path)

# Split the arrays into input and target sequences using split_sequences function
input_sequences, target_sequences = split_sequences(arrays)

# Split the data into training and testing sets with a ratio of 0.8:0.2 using train_test_split function from sklearn module
X_train, X_test, y_train, y_test = train_test_split(input_sequences, target_sequences, test_size=0.2, random_state=42)

# Reshape the input and target sequences into two-dimensional arrays using np.reshape function from numpy module
X_train = np.reshape(X_train, (-1, 5))
y_train = np.reshape(y_train, (-1, 5))
X_test = np.reshape(X_test, (-1, 5))
y_test = np.reshape(y_test, (-1, 5))

# Add some padding cells to the X_test array until it is divisible by 5 using np.pad function from numpy module
X_test = np.pad(X_test, (0, 5 - len(X_test) % 5), mode="constant")
y_test = np.pad(y_test, (0, 5 - len(y_test) % 5), mode="constant")

# Convert the input and target arrays to numpy arrays of float32 data type using np.asarray and astype functions from numpy module
X_train = np.asarray(X_train, dtype=np.float32)
y_train = np.asarray(y_train, dtype=np.float32)
X_test = np.asarray(X_test, dtype=np.float32)
y_test = np.asarray(y_test, dtype=np.float32)

# Split the X_test array into subarrays of size 5 using np.array_split function from numpy module
subarrays = np.array_split(X_test, len(X_test) / 5)

# Define the model architecture using keras.Sequential class from tensorflow module
model = keras.Sequential([
    # Embedding layer that maps tokens to vectors using layers.Embedding class from tensorflow module
    layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE),
    # LSTM layer that processes the embedded vectors using layers.LSTM class from tensorflow module
    layers.LSTM(units=RNN_UNITS, return_sequences=True),
    # Dense layer that outputs probabilities over tokens using layers.Dense class from tensorflow module
    layers.Dense(units=VOCAB_SIZE, activation="softmax")
])

# Compile the model with loss and optimizer using model.compile method from tensorflow module
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

# Train the model for some epochs using model.fit method from tensorflow module
model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=10)

# Test the generate_string function with some seed strings
print(generate_string("55420", model, subarrays))
print(generate_string("13120", model, subarrays))
print(generate_string("25050", model, subarrays))


Initially, I had this recurring error message (for three times) as I ran the code:
Traceback (most recent call last):
  File "C:/Users/PC-1/Desktop/String Predict ver03-A-1.py", line 182, in <module>
    arrays = read_strings(file_path)
NameError: name 'read_strings' is not defined


That refers to this line here:
Python
arrays = read_strings(file_path)


...so I had that turn into a comment so it won't mess up the execution, then ran the code again.

Now it is giving me this error message:
Python
Epoch 1/10
Traceback (most recent call last):
  File "C:/Users/PC-1/Desktop/String Predict ver03-A-1.py", line 223, in <module>
    model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=10)
  File "C:\Users\PC-1\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\PC-1\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\training.py", line 1754, in fit
    raise ValueError(
ValueError: Unexpected result of `train_function` (Empty logs). This could be due to issues in input pipeline that resulted in an empty dataset. Otherwise, please use `Model.compile(..., run_eagerly=True)`, or `tf.config.run_functions_eagerly(True)` for more information of where went wrong, or file a issue/bug to `tf.keras`.


Am at my wits' end here - can anyone tell me what to fix here?

If it would help clarify my problem, that code is meant to solve this particular programming problem:

Create a Python source code that will predict the next unique string to appear based on a list of six-character strings ranging from 0 to 5 stored in the windows text file, "stringpred.txt". As an example of what the list of strings look like, refer to the section below:
...
5 5 4 2 0 5
5 4 1 4 5 5
4 4 4 2 2 0
1 3 1 2 0 1
1 2 4 4 5 5
3 2 1 4 5 5
5 1 5 2 5 4
0 1 5 5 5 4
3 3 1 5 3 5
5 3 3 4 3 5
0 5 3 3 0 2
3 3 0 3 5 1
5 2 2 5 4 0
3 4 3 5 2 3
4 5 2 3 4 5
3 0 4 4 5 5
2 1 2 4 5 5
4 3 0 0 1 5
4 3 2 2 2 4
2 5 0 5 0 3
3 5 1 3 4 4
...

Format output as..
   "The next predicted string will be: 

As an example:
3 0 4 4 5 5
2 1 2 4 5 5
4 3 0 0 1 5
4 3 2 2 2 4
2 5 0 5 0 3

The next predicted string will be: 3 5 1 3 4 4


If this is really hard to solve, where other forum site can I go to that can help address this roadblock I ran into?

What I have tried:

Well, I tried turning this line into a comment:
Python
arrays = read_strings(file_path)


...in hopes that it won't mess up the execution. But still, can't get it to run..
Posted

1 solution

Read the error message:
Error
NameError: name 'read_strings' is not defined
And a quick check in yoru code shows there is nothing (function or otherwise) called read_strings in that code.

I'd suggest that you go back to where you copy'n'pasted that code from and find the source of the functions it calls to make it work.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900