Skip to content

Added 3 python files in reference to the Task Maintioned in TODO for … #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 20, 2024

Conversation

rohan12345a
Copy link
Contributor

GPTModel.py: Added a new Python file GPTModel.py which implements a text classification model based on the GPT architecture. This file contains the necessary code to preprocess text data, train the GPT model, and make predictions on new text inputs.

ensemble_method.py: Introduced a new Python file ensemble_method.py that implements an ensemble learning method. This method combines predictions from multiple base models to improve overall performance and robustness.

XLNetTransformer.py: Implemented a transformer-based feature transformer in XLNetTransformer.py. This file contains a class that preprocesses text data using the XLNet tokenizer, encoding it into input IDs suitable for the XLNet model.

These additions address the TODO tasks specified in the issues tab, enhancing the project's functionality and providing more options for text classification and model ensemble techniques.

…Ensemble_strategy, GTPModel and XLNetTransformer.
Copy link
Collaborator

@NirantK NirantK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite put my finger on why: This PR feels a bit AI Generated. Nevertheless, I've added suggestions and if AI can do that — why not.

return predicted_labels

def main():
data_path = "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please avoid hard coding the paths?

Comment on lines +16 to +22
def clean(text):
for token in ["<br/>", "<br>", "<br>"]:
text = re.sub(token, " ", text)

text = re.sub("[\s+\.\!\/_,$%^*()\(\)<>+\"\[\]\-\?;:\'{}`]+|[+——!,。?、~@#¥%……&*()]+", " ", text)

return text.lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this clean is being re-used everywhere, do you want to refactor this out into a separate file?

Comment on lines +25 to +29
def load_imdb_dataset(data_path, nrows=100):
df = pd.read_csv(data_path, nrows=nrows)
texts = df['review'].apply(clean)
labels = df['sentiment']
return texts, labels
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a good idea to not tie to this to new data loaders and use/upgrade the existing ones?

@wabyking wabyking merged commit c3dfd26 into FreedomIntelligence:master Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants