Text classifier
Author: m | 2025-04-24
classifier machine-learning text-classification classification image-classification image-classifier aix appinventor text-classifier appinventorextension ml4k machine-learning-for MINI AD (Classified Text Ad) A cost-effective choice, this ad appears in the classified section in a concise running text format, limited to 50 words. CLASSIFIED DESIGN AD (Display Classified Ad) This ad type combines text
Text classifier, WEKA classifier, WEKA tools, text - Medium
In the user’s queries.1def get_tag(url):2 return re.findall(r"docs\.ray\.io/en/master/([^/]+)", url)[0].split("#")[0]34# Load data5from pathlib import Path6df = pd.read_json(Path(ROOT_DIR, "datasets", "embedding_qa.json"))7df["tag"] = df.source.map(get_tag)8df["section"] = df.source.map(lambda source: source.split("/")[-1])9df["text"] = df["section"] + " " + df["question"]10df.sample(n=5)rerank-df1# Map only what we want to keep2tags_to_keep = ["rllib", "tune", "train", "cluster", "ray-core", "data", "serve", "ray-observability"]3df["tag"] = df.tag.apply(lambda x: x if x in tags_to_keep else "other")4Counter(df.tag)Counter({'rllib': 1269, 'tune': 979, 'train': 697, 'cluster': 690, 'data': 652, 'ray-core': 557, 'other': 406, 'serve': 302, 'ray-observability': 175})LinkPreprocessingWe'll start by creating some preprocessing functions to better represent our data. For example, our documentation has many variables that are camel cased (ex. RayDeepSpeedStrategy). When a tokenizer is used on this, we often lose the individual tokens that we know to be useful and, instead, random subtokens are created.Note: we didn't omnisciently know to create these unique preprocessing functions! This is all a result of methodical iteration. We train a model → view incorrect data points → view how the data was represented (ex. subtokenization) → update preprocessing → iterate ↺1import re2from transformers import BertTokenizer34def split_camel_case_in_sentences(sentences):5 def split_camel_case_word(word):6 return re.sub("([a-z0-9])([A-Z])", r"\1 \2", word)7 processed_sentences = []8 for sentence in sentences:9 processed_words = [] 10 for word in sentence.split():11 processed_words.extend(split_camel_case_word(word).split())12 processed_sentences.append(" ".join(processed_words))13 return processed_sentences1415def preprocess(texts):16 texts = [re.sub(r'(?, r' \1', text) for text in texts]17 texts = [text.replace("_", " ").replace("-", " ").replace("#", " ").replace(".html", "").replace(".", " ") for text in texts]18 texts = split_camel_case_in_sentences(texts) # camelcase19 texts = [tokenizer.tokenize(text) for text in texts] # subtokens20 texts = [" ".join(word for word in text) for text in texts]21 return texts2223print (preprocess(["RayDeepSpeedStrategy"]))24print (preprocess(["What is the default batch_size for map_batches?"]))['ray deep speed strategy']['what is the default batch size for map batch ##es ?']LinkTrainingNow we’re going to train a simple logistic regression model that will predict the tag given the input text.1from sklearn.feature_extraction.text import TfidfVectorizer2from sklearn.linear_model import LogisticRegression3from sklearn.pipeline import Pipeline4from sklearn.preprocessing import FunctionTransformer56# Train classifier7from rag.rerank import preprocess # for pickle8reranker = Pipeline([9 ("preprocess", FunctionTransformer(preprocess)),10 ("vectorizer", TfidfVectorizer(lowercase=True)),11 ("classifier", LogisticRegression(multi_class="multinomial", solver="lbfgs"))12])13reranker.fit(train_df["text"].tolist(), train_df["tag"].tolist())Note: we also trained a BERT classifier and while performance was better than our logistic classifier, these large networks suffer from overconfidence and we can't use a threshold based approach as we do below. And without the threshold approach (where we only rerank when the reranker is truly confident), then the quality score of our application does not improve.1# Inference2question = "training with deepspeed"3custom_predict([question], classifier=reranker)[0]'train'We're now ready to evaluate our trained reranking model. We're going to use a custom prediction function that will predict “other” unless the probability of the highest class is above a certain threshold.1def custom_predict(inputs, classifier, threshold=0.3, other_label="other"):2 y_pred = []3 for item in classifier.predict_proba(inputs):4 prob = max(item)5 index = item.argmax()6 if prob >= threshold:7 pred = classifier.classes_[index]8 else:9 pred = other_label10 y_pred.append(pred)11 return y_pred1213# Evaluation14metrics = {}15y_test Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I don’t see or hear from spammy emails! Spam classificationOther than spam detection, text classifiers can be used to determine sentiment in social media texts, predict categories of news articles, parse and segment unstructured documents, flag the highly talked about fake news articles and more.Text classifiers work by leveraging signals in the text to “guess” the most appropriate classification. For example, in a sentiment classification task, occurrences of certain words or phrases, like slow,problem,wouldn't and not can bias the classifier to predict negative sentiment. The nice thing about text classification is that you have a range of options in terms of what approaches you could use. From unsupervised rules-based approaches to more supervised approaches such as Naive Bayes, SVMs, CRFs and Deep Learning.In this article, we are going to learn how to build and evaluate a text classifier using logistic regression on a news categorization problem. The problem while not extremely hard, is not as straightforward as making a binary prediction (yes/no, spam/ham).Here’s the full source code with accompanying dataset for this tutorial. Note that this is a fairly long tutorial and I would suggest that you break it down to several sessions so that you completely grasp the concepts. The HuffPost DatasetThe dataset that we will be using for this tutorial is from Kaggle. It contains news articles from Huffington Post (HuffPost) from 2014-2018 as seen below. This data set has about ~125,000 articles and 31 different categories. Figure 1: Articles distribution from 2014-2018Now let’s look at the category distribution of these articles (Figure 2). Notice that politics has the most number of articles and education has the lowest number of articles ranging in the hundreds. So, nothing surprising in the category distribution other than we have much fewer articles to learn from categories outside POLITICS. Figure 2: Number of articles per categoryNow, let’s take a quick peek at the dataset (Figure 3).Figure 3: Sneak peak of the news datasetNotice that the fields we have in order to learn a classifier that predicts the category include headline, short_description, link and authors. The ChallengeAs mentioned earlier, the problem that we are going to be tackling is to predict the category of news articles (as seen in Figure 3), using only the description, headline and the url of the articles. Without the actual content of the article itself, the data that we have for learning is actually pretty sparse – a problem you may encounter in the real world. But let’s see if we can still learn from it reasonably well. We will not use the author field because we want to test it on articles from a different news organization, specifically from CNN. In this tutorial, we will use the Logistic Regression algorithm to implement the classifier. In my experience, I have found Logistic Regression to be veryClassify text with BERT - TensorFlow
Is 0.87 and MRR is 0.75, a significant jump. Now we have about 87% of the primary categories appearing within the top 3 predicted categories. In addition, more of the PRIMARY categories are appearing at position 1. This is good news! In Figure 9, you will see how well the model performs on different feature weighting methods and the use of text fields.Figure 9: Experimentation with different combination of feature weighting and text fieldsThere are several observations that can be made from the results in Figure 9:tf-idf based weighting outperforms binary & count based schemescount based feature weighting is no better than binary weightingSparsity has a lot to do with how poorly the model performs. The richer the text field, the better the overall performance of the classifier. Prediction on CNN articlesNow, the fun part! Let’s test it on articles from a different news source than HuffPost. Let’s see how the classifier visually does on articles from CNN. We will predict the top 2 categories.A crime-related story [ see article ]Predicted: politics, crimeEntertainment related story [ see article ]Predicted: entertainment, styleAnother entertainment-related story [ see article ]Predicted: entertainment, styleExercise in space [ see article ]Predicted: science, healthy livingOverall, not bad, huh? The predicted categories make a lot of sense. Note that in the above predictions, we used the headline text. To further improve the predictions, we can enrich the text with the URL tokens and description. Saving Logistic Regression Model Once we have fully developed the model, we want to use it later on unseen documents. Doing this is actually straightforward with sklearn. First, we have to save the transformer to later encode/vectorize any unseen document. Next, we also need to save the trained model so that it can make predictions using the weight vectors. Here’s how you do it:Saving SKLearn Model & Transformerimport picklemodel_path="../models/model.pkl"transformer_path="../models/transformer.pkl"# we need to save both the transformer & model # transformer to encode/vectorize per our settings# model to predictpickle.dump(model,open(model_path, 'wb'))pickle.dump(transformer,open(transformer_path,'wb'))Loading Model & Transformer for Reuseloaded_model = pickle.load(open(model_path, 'rb'))loaded_transformer = pickle.load(open(transformer_path, 'rb'))test_features=loaded_transformer.transform(["President Trump AND THE impeachment story !!!"])get_top_k_predictions(loaded_model,test_features,2)Over to youHere’s the full source code with the accompanying dataset for this tutorial. I hope this article has given you the confidence in implementing your very own high-accuracy text classifier.Keep in mind that text classification is an art as much as it is a science. Your creativity when it comes to text preprocessing, evaluation and feature representation will determine the success of your classifier. A one-size-fits-all approach is rare. What works for this news categorization task, may very well be inadequate for something like bug detection in source code.An exercise for you:Right now, we are at 87% accuracy. How can we improve the accuracy further? What else would you try? Leave a comment below with what you tried, and how well it worked. Aim for a 90-95% accuracy and let us all know what worked! Hints:Curate additional featuresPerform feature selection Tweak model parametersTry balancing number of articles per categorySee Also: How to Build a Text Classifier that Delivers?ResourcesFull. classifier machine-learning text-classification classification image-classification image-classifier aix appinventor text-classifier appinventorextension ml4k machine-learning-forTransformers are Short-Text Classifiers
O no por inteligencia artificial es muy poco claro. Dependiendo de qué tan seguro crea que AI produjo el texto, lo etiquetará como “muy improbable” (menos del 10 % de probabilidad), “poco probable” (10 %-45 % de probabilidad), “poco claro si lo es” (45 % de probabilidad). %-90%), “posiblemente” (90%-98%) o “probable” (más del 98%).Pusimos a AI Text Classifier a prueba usando un documento escrito por ChatGPT, y tomó la decisión correcta.Obviamente, es uno de los verificadores de plagio de ChatGPT de primer nivel.Planes de precios de AI Text ClassifierSin cargo, AI Text Classifier puede detectar instancias de plagio utilizando inteligencia artificial.Consulte el artículo donde profundizamos en AI Text Classifier si desea obtener más información.Originalidad.aiOriginality.ai es un recurso útil para identificar instancias de plagio y otros tipos de contenido generado automáticamente. La escritura producida artificialmente es ahora un lugar común en la sociedad actual. La inteligencia artificial (IA) ha avanzado hasta el punto de que ahora puede generar artículos escritos profesionalmente desde cero en segundos. Software como Originality.ai ha hecho posible detectar instancias de plagio de IA.La detección de IA por sí sola no es suficiente para manejar la ola de contenido de ChatGPT o BardEn nuestra extensión de Chrome gratuita actualizada le permite ver…1. Los escritores escriben2. Contribuciones de los escritores3. Informe de originalidad (con predicción precisa de IA) pic.twitter.com/mFhsAyTOJw—Jonathan Gillham (@JonGillhams) 8 de febrero de 2023Tener un dispositivo que es tan simple de operar es una gran comodidad. Simplemente es necesario que usted:Simplemente pegue el texto copiado en su escáner AI.Escriba la URL de la página que desea asegurarse de que sea válida.Es uno de los verificadores de plagio de ChatGPT más utilizados.Planes de precios de Originality.aiSi usa Originality.ai, debe obtener créditos:$0.01 por crédito, 1 crédito escanea 100 palabrasGPTZeroEl GPTZero fue creado por Edward Tian, estudiante de último año en la Universidad de Princeton. Esta herramienta gratuita para profesores puede reconocer más del 98% de los trabajos creados por ChatGPT. Varios programas de detección más, incluido GPTZero, han surgido desde el debut de ChatGPT. Tech & Learning afirma que Tian ha descrito en detalle la creación, el funcionamiento Scratch Based Default ExtensionsMotionLooksSoundControlEventsSensingOperatorsMy BlocksVariablesArtificial IntelligenceFace DetectionObject DetectionHuman Body DetectionComputer VisionText RecognitionSpeech RecognitionChatGPTNatural Language ProcessingRecognition CardsText to SpeechTranslateOpen CVInnovative ExtensionsQR Code ScannerWeather DataPhysics EngineIFTTT WebhooksInternet of Things (IoT)Data LoggerMusicVideo SensingPenMachine Learning EnvironmentImage Classifier (ML)Object Detection (ML)Pose Classifier (ML)Hand Pose Classifier (ML)Audio Classifier (ML)Number Classifier and Regression (ML)Text Classifier (ML)QuarkyQuarky (Main)Display (Quarky)Robot (Quarky)Sensors (Quarky)Speaker (Quarky)Quarky Ultimate RobotsQuarky Expansion BoardMars RoverHumanoid (Quarky)Quadruped (Quarky)IoT House (Quarky)Quarky MecanumQuarky Robotic ArmDabbleQuarky Advance Line Following AI to identify objects from images. It also tells the location and size of the objects identified. Available in: Block Coding, Python Coding Mode: Stage Mode WiFi Required: No Compatible Hardware in Block Coding: evive, Quarky, Arduino Uno, Arduino Mega, Arduino Nano, ESP32, T-Watch, Boffin, micro:bit, TECbits, LEGO EV3, LEGO Boost, LEGO WeDo 2.0, Go DFA, None Compatible Hardware in Python: Quarky, None Object Declaration in Python: od = ObjectDetection() Extension Catergory: Artificial Intelligence Introduction Object detection is used to locate and identify multiple objects in digital photographs. It is a computer vision technique that helps to detect objects as well as classify them. The object class may appear once or several times in the image. For example, in the following image, object detection assists us to locate the objects and classify them accordingly to the known set of objects.One of the applications of Object detection includes Self-driving vehicles which detect objects in real-time and act accordingly.The object detection extension in PictoBlox allows you to detect the following 90 objects: IDOBJECT (PAPER)SUPER CATEGORY 1personperson 2bicyclevehicle 3carvehicle 4motorcyclevehicle 5airplanevehicle 6busvehicle 7trainvehicle 8truckvehicle 9boatvehicle 10traffic lightoutdoor 11fire hydrantoutdoor 12street signoutdoor 13stop signoutdoor 14parking meteroutdoor 15benchoutdoor 16birdanimal 17catanimal 18doganimal 19horseanimal 20sheepanimal 21cowanimal 22elephantanimal 23bearanimal 24zebraanimal 25giraffeanimal 26hataccessory 27backpackaccessory 28umbrellaaccessory 29shoeaccessory 30eye glassesaccessory 31handbagaccessory 32tieaccessory 33suitcaseaccessory 34frisbeesports 35skissports 36snowboardsports 37sports ballsports 38kitesports 39baseball batsports 40baseball glovesports 41skateboardsports 42surfboardsports 43tennis racketsports 44bottlekitchen 45platekitchen 46wine glasskitchen 47cupkitchen 48forkkitchen 49knifekitchen 50spoonkitchen 51bowlkitchen 52bananafood 53applefood 54sandwichfood 55orangefood 56broccolifood 57carrotfood 58hot dogfood 59pizzafood 60donutfood 61cakefood 62chairfurniture 63couchfurniture 64potted plantfurniture 65bedfurniture 66mirrorfurniture 67dining tablefurniture 68windowfurniture 69deskfurniture 70toiletfurniture 71doorfurniture 72tvelectronic 73laptopelectronic 74mouseelectronic 75remoteelectronic 76keyboardelectronic 77cell phoneelectronic 78microwaveappliance 79ovenappliance 80toasterappliance 81sinkappliance 82refrigeratorappliance 83blenderappliance 84bookindoor 85clockindoor 86vaseindoor 87scissorsindoor 88teddy bearindoor 89hair drierindoor 90toothbrushindoor 91hair brushindoorAccessing Object Detection in Block CodingFollowing is the process to add Object Detection capability to the PictoBlox Project.Open PictoBlox and create a new file.Select the coding environment as appropriate Coding Environment.Next, click on the Add Extension button and add the Object Detection extension.The object detection models will be downloaded, whichText Analyzer Classifier Summarizer download
Pit ChatGPT against Google Search in a YouTube video. And now one year down the line, it feels like this is just the beginning of the AI age and a lot of new product discovery is yet to be made.AI Classifier Launched to Detect AI-written TextChatGPT quickly rose to fame, and it was especially good at creative tasks such as writing academic papers, composing marketing emails, and even creating misinformation campaigns, etc. Seeing the surge in AI-written text on the web, there came the urgent need for AI plagiarism detectors and text checkers.So, two months after ChatGPT’s launch, OpenAI released an official AI Classifier tool to help people distinguish between AI and human-written text.However, in July, OpenAI quietly shut down the service citing a low rate of accuracy. If you’re in need of such a tool, you can check our list of best AI plagiarism checkers. However, in there, we have clearly mentioned that AI-powered plagiarism tools frequently give false positives and inconsistent results.Thus, the effort to correctly identify AI-written text still continues, even after a year of ChatGPT’s release.ChatGPT Plus Subscription LaunchedIn February 2023, four months after its release, OpenAI decided it was time to cash in on the hype and build a loyal and paying community. So, it launched its first subscription plan called ChatGPT Plus for $20 per month.When it first launched, ChatGPT+ allowed users to access the chatbot even during peak times with faster response times. In addition, aChatGPT Plus users would get early access to new features and improvements in the coming months. Initially, the subscription plan was available to customers in the US only and was later expanded to users around the world.ChatGPT API Released for DevelopersCome March 2023, the company finally released the ChatGPT API, giving developers access to the powerful capabilities ofClassifying text with a custom classification model
Of times the PRIMARY category appeared in the top 3 predicted categories divided by the total number of categorization tasks. MRRUnlike accuracy, MRR takes the rank of the first correct answer into consideration (in our case rank of the correctly predicted PRIMARY category). The formula for MRR is as follows:Figure 5: MRR formulawhere Q here refers to all the classification tasks in our test set and rank_{i} is the position of the correctly predicted category. The higher the rank of the correctly predicted category, the higher the MRR. Since we are using the top 3 predictions, MRR will give us a sense of where the PRIMARY category is at in the ranks. If the rank of the PRIMARY category is on average 2, then the MRR would be ~0.5 and at 3, it would be ~0.3. We want to get the PRIMARY category higher up in the ranks. Building the classifierNow it’s finally time to build the classifier! Note that we will be using the LogisticRegression module from sklearn.Make Necessary ImportsStart with the imports.import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer Read dataset and create text field variationsNext, we will be creating different variations of the text we will use to train the classifier. This is to see how adding more content to each field, helps with the classification task. Notice that we create a field using only the description, description + headline, and description + headline + url (tokenized). #read datasetdf=pd.read_json("../data/news_category_dataset.json", lines=True)#create tokenized URL fielddf['tokenized_url']=df['link'].apply(lambda x:tokenize_url(x))#just the descriptiondf['text_desc'] = df['short_description']#description + headlinedf['text_desc_headline'] = df['short_description'] + ' '+ df['headline']#description + tokenized urldf['text_desc_headline_url'] = df['short_description'] + ' '+ df['headline']+" " + df['tokenized_url']def tokenize_url(url:str): url=url.replace(" url=re.sub("(\W|_)+"," ",url) return url Split dataset for training and testingNext, we will create a train / test split of our dataset, where 25% of the dataset will be used for testing based on our evaluation strategy and remaining will be used for training the classifier. # GET A TRAIN TEST SPLIT (set seed for consistent results)training_data, testing_data = train_test_split(df,random_state = 2000)# GET LABELSY_train=training_data['category'].valuesY_test=testing_data['category'].values # GET FEATURESX_train,X_test,feature_transformer=extract_features(df,field,training_data,testing_data,type=feature_rep)def extract_features(df,field,training_data,testing_data,type="binary"): """Extract features using different methods""" logging.info("Extracting features and creating vocabulary...") if "binary" in type: # BINARY FEATURE REPRESENTATION cv= CountVectorizer(binary=True, max_df=0.95) cv.fit_transform(training_data[field].values) train_feature_set=cv.transform(training_data[field].values) test_feature_set=cv.transform(testing_data[field].values) return train_feature_set,test_feature_set,cv elif "counts" in type: # COUNT BASED FEATURE REPRESENTATION cv= CountVectorizer(binary=False, max_df=0.95) cv.fit_transform(training_data[field].values) train_feature_set=cv.transform(training_data[field].values) test_feature_set=cv.transform(testing_data[field].values) return train_feature_set,test_feature_set,cv else: # TF-IDF BASED FEATURE REPRESENTATION tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_df=0.95) tfidf_vectorizer.fit_transform(training_data[field].values) train_feature_set=tfidf_vectorizer.transform(training_data[field].values) test_feature_set=tfidf_vectorizer.transform(testing_data[field].values) return train_feature_set,test_feature_set,tfidf_vectorizerPrepare featuresEarlier, we talked about feature representation and different feature weighting schemes. In `extract_features(…)` from above, is where we extract the different types of features based on the weighting schemes.First, note that cv.fit_transform(...) from the above code, snippet creates a vocabulary based on the training set. Next, `cv.transform(…)` takes in any text (test or unseen texts) and transforms it according to the vocabulary of the training set, limiting the words by the specified count restrictions (`min_df`, `max_df`) and applying necessary stop words if specified. It returns a term-document matrix where each column in the matrix represents a word in the. classifier machine-learning text-classification classification image-classification image-classifier aix appinventor text-classifier appinventorextension ml4k machine-learning-forA Text Classifier Using Java and GridDB
& landmark detection, OCR, safe search. Document AIEnterprise Document OCRDigitize text from documents (PDFs, scanned documents as images, or Microsoft DocX files).Extract text in 200+ languages, 50 handwritten languages.Add-ons to recognize math formulas, styles, etc.Document AI WorkbenchExtract, classify and split any documents with generative ai (foundational models)Custom Extractor: uses foundational models to quickly create parsers without extensive data labeling or training.Custom classifier and document splitter for efficient processing.Pretrained modelsText and field extraction from domain-specific documents.Text extraction and digitization across a variety of procurement, lending, identity and contractual documents.Best forGeneral text-extraction use cases that require low latency and high capacity.Key featuresPre-built features like image labeling, face & landmark detection, OCR, safe search. Best forDigitize text from documents (PDFs, scanned documents as images, or Microsoft DocX files).Key featuresExtract text in 200+ languages, 50 handwritten languages.Add-ons to recognize math formulas, styles, etc.Best forExtract, classify and split any documents with generative ai (foundational models)Key featuresCustom Extractor: uses foundational models to quickly create parsers without extensive data labeling or training.Custom classifier and document splitter for efficient processing.Best forText and field extraction from domain-specific documents.Key featuresHow It WorksTo understand and process documents, use Document AI.For images, we recommend using Cloud Vision.Both give you access to pre-trained ML models that you can deploy as-is through APIs or uptrain. You can also train your own custom models from scratch with AutoML - no ML expertise needed. First 1000 units every month are free when you use Cloud Vision or Document OCR - try it with a simple API call.How Cloud Vision recognizes and classifies imagesDemoSee Document OCR in action with your own documentsTry the Document AI API with a simple drag-and-drop.Common UsesBuild an end-to-end document solutionHow-tosHow-tosImage tagging, processing and searchHow-tosAdditional resourcesUse Cloud Vision API and AutoML to tag and process imagesImage tagging is also referred to as image labeling. Cloud Vision API can identify and label general objects, landmarks, locations, logos, activities, animal species, products, and more in an image. Once the images are tagged with the detected labels, image search, processing and management are automated and easier.If you need targeted custom labels, use Cloud AutoML to train a custom ML model.To use Google OCR technologies on premise, use OCR On-Prem, available in the Cloud Marketplace.Deploy Cloud Vision APIDeploy in console: Event-driven image processing using Cloud Functions and Cloud VisionSkills Boost labs: Image processingHow-to guides: Cloud Vision APIHow-tosUse Cloud Vision API and AutoML to tag and process imagesImageComments
In the user’s queries.1def get_tag(url):2 return re.findall(r"docs\.ray\.io/en/master/([^/]+)", url)[0].split("#")[0]34# Load data5from pathlib import Path6df = pd.read_json(Path(ROOT_DIR, "datasets", "embedding_qa.json"))7df["tag"] = df.source.map(get_tag)8df["section"] = df.source.map(lambda source: source.split("/")[-1])9df["text"] = df["section"] + " " + df["question"]10df.sample(n=5)rerank-df1# Map only what we want to keep2tags_to_keep = ["rllib", "tune", "train", "cluster", "ray-core", "data", "serve", "ray-observability"]3df["tag"] = df.tag.apply(lambda x: x if x in tags_to_keep else "other")4Counter(df.tag)Counter({'rllib': 1269, 'tune': 979, 'train': 697, 'cluster': 690, 'data': 652, 'ray-core': 557, 'other': 406, 'serve': 302, 'ray-observability': 175})LinkPreprocessingWe'll start by creating some preprocessing functions to better represent our data. For example, our documentation has many variables that are camel cased (ex. RayDeepSpeedStrategy). When a tokenizer is used on this, we often lose the individual tokens that we know to be useful and, instead, random subtokens are created.Note: we didn't omnisciently know to create these unique preprocessing functions! This is all a result of methodical iteration. We train a model → view incorrect data points → view how the data was represented (ex. subtokenization) → update preprocessing → iterate ↺1import re2from transformers import BertTokenizer34def split_camel_case_in_sentences(sentences):5 def split_camel_case_word(word):6 return re.sub("([a-z0-9])([A-Z])", r"\1 \2", word)7 processed_sentences = []8 for sentence in sentences:9 processed_words = [] 10 for word in sentence.split():11 processed_words.extend(split_camel_case_word(word).split())12 processed_sentences.append(" ".join(processed_words))13 return processed_sentences1415def preprocess(texts):16 texts = [re.sub(r'(?, r' \1', text) for text in texts]17 texts = [text.replace("_", " ").replace("-", " ").replace("#", " ").replace(".html", "").replace(".", " ") for text in texts]18 texts = split_camel_case_in_sentences(texts) # camelcase19 texts = [tokenizer.tokenize(text) for text in texts] # subtokens20 texts = [" ".join(word for word in text) for text in texts]21 return texts2223print (preprocess(["RayDeepSpeedStrategy"]))24print (preprocess(["What is the default batch_size for map_batches?"]))['ray deep speed strategy']['what is the default batch size for map batch ##es ?']LinkTrainingNow we’re going to train a simple logistic regression model that will predict the tag given the input text.1from sklearn.feature_extraction.text import TfidfVectorizer2from sklearn.linear_model import LogisticRegression3from sklearn.pipeline import Pipeline4from sklearn.preprocessing import FunctionTransformer56# Train classifier7from rag.rerank import preprocess # for pickle8reranker = Pipeline([9 ("preprocess", FunctionTransformer(preprocess)),10 ("vectorizer", TfidfVectorizer(lowercase=True)),11 ("classifier", LogisticRegression(multi_class="multinomial", solver="lbfgs"))12])13reranker.fit(train_df["text"].tolist(), train_df["tag"].tolist())Note: we also trained a BERT classifier and while performance was better than our logistic classifier, these large networks suffer from overconfidence and we can't use a threshold based approach as we do below. And without the threshold approach (where we only rerank when the reranker is truly confident), then the quality score of our application does not improve.1# Inference2question = "training with deepspeed"3custom_predict([question], classifier=reranker)[0]'train'We're now ready to evaluate our trained reranking model. We're going to use a custom prediction function that will predict “other” unless the probability of the highest class is above a certain threshold.1def custom_predict(inputs, classifier, threshold=0.3, other_label="other"):2 y_pred = []3 for item in classifier.predict_proba(inputs):4 prob = max(item)5 index = item.argmax()6 if prob >= threshold:7 pred = classifier.classes_[index]8 else:9 pred = other_label10 y_pred.append(pred)11 return y_pred1213# Evaluation14metrics = {}15y_test
2025-04-16Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmail’s spam classifier, I don’t see or hear from spammy emails! Spam classificationOther than spam detection, text classifiers can be used to determine sentiment in social media texts, predict categories of news articles, parse and segment unstructured documents, flag the highly talked about fake news articles and more.Text classifiers work by leveraging signals in the text to “guess” the most appropriate classification. For example, in a sentiment classification task, occurrences of certain words or phrases, like slow,problem,wouldn't and not can bias the classifier to predict negative sentiment. The nice thing about text classification is that you have a range of options in terms of what approaches you could use. From unsupervised rules-based approaches to more supervised approaches such as Naive Bayes, SVMs, CRFs and Deep Learning.In this article, we are going to learn how to build and evaluate a text classifier using logistic regression on a news categorization problem. The problem while not extremely hard, is not as straightforward as making a binary prediction (yes/no, spam/ham).Here’s the full source code with accompanying dataset for this tutorial. Note that this is a fairly long tutorial and I would suggest that you break it down to several sessions so that you completely grasp the concepts. The HuffPost DatasetThe dataset that we will be using for this tutorial is from Kaggle. It contains news articles from Huffington Post (HuffPost) from 2014-2018 as seen below. This data set has about ~125,000 articles and 31 different categories. Figure 1: Articles distribution from 2014-2018Now let’s look at the category distribution of these articles (Figure 2). Notice that politics has the most number of articles and education has the lowest number of articles ranging in the hundreds. So, nothing surprising in the category distribution other than we have much fewer articles to learn from categories outside POLITICS. Figure 2: Number of articles per categoryNow, let’s take a quick peek at the dataset (Figure 3).Figure 3: Sneak peak of the news datasetNotice that the fields we have in order to learn a classifier that predicts the category include headline, short_description, link and authors. The ChallengeAs mentioned earlier, the problem that we are going to be tackling is to predict the category of news articles (as seen in Figure 3), using only the description, headline and the url of the articles. Without the actual content of the article itself, the data that we have for learning is actually pretty sparse – a problem you may encounter in the real world. But let’s see if we can still learn from it reasonably well. We will not use the author field because we want to test it on articles from a different news organization, specifically from CNN. In this tutorial, we will use the Logistic Regression algorithm to implement the classifier. In my experience, I have found Logistic Regression to be very
2025-04-09Is 0.87 and MRR is 0.75, a significant jump. Now we have about 87% of the primary categories appearing within the top 3 predicted categories. In addition, more of the PRIMARY categories are appearing at position 1. This is good news! In Figure 9, you will see how well the model performs on different feature weighting methods and the use of text fields.Figure 9: Experimentation with different combination of feature weighting and text fieldsThere are several observations that can be made from the results in Figure 9:tf-idf based weighting outperforms binary & count based schemescount based feature weighting is no better than binary weightingSparsity has a lot to do with how poorly the model performs. The richer the text field, the better the overall performance of the classifier. Prediction on CNN articlesNow, the fun part! Let’s test it on articles from a different news source than HuffPost. Let’s see how the classifier visually does on articles from CNN. We will predict the top 2 categories.A crime-related story [ see article ]Predicted: politics, crimeEntertainment related story [ see article ]Predicted: entertainment, styleAnother entertainment-related story [ see article ]Predicted: entertainment, styleExercise in space [ see article ]Predicted: science, healthy livingOverall, not bad, huh? The predicted categories make a lot of sense. Note that in the above predictions, we used the headline text. To further improve the predictions, we can enrich the text with the URL tokens and description. Saving Logistic Regression Model Once we have fully developed the model, we want to use it later on unseen documents. Doing this is actually straightforward with sklearn. First, we have to save the transformer to later encode/vectorize any unseen document. Next, we also need to save the trained model so that it can make predictions using the weight vectors. Here’s how you do it:Saving SKLearn Model & Transformerimport picklemodel_path="../models/model.pkl"transformer_path="../models/transformer.pkl"# we need to save both the transformer & model # transformer to encode/vectorize per our settings# model to predictpickle.dump(model,open(model_path, 'wb'))pickle.dump(transformer,open(transformer_path,'wb'))Loading Model & Transformer for Reuseloaded_model = pickle.load(open(model_path, 'rb'))loaded_transformer = pickle.load(open(transformer_path, 'rb'))test_features=loaded_transformer.transform(["President Trump AND THE impeachment story !!!"])get_top_k_predictions(loaded_model,test_features,2)Over to youHere’s the full source code with the accompanying dataset for this tutorial. I hope this article has given you the confidence in implementing your very own high-accuracy text classifier.Keep in mind that text classification is an art as much as it is a science. Your creativity when it comes to text preprocessing, evaluation and feature representation will determine the success of your classifier. A one-size-fits-all approach is rare. What works for this news categorization task, may very well be inadequate for something like bug detection in source code.An exercise for you:Right now, we are at 87% accuracy. How can we improve the accuracy further? What else would you try? Leave a comment below with what you tried, and how well it worked. Aim for a 90-95% accuracy and let us all know what worked! Hints:Curate additional featuresPerform feature selection Tweak model parametersTry balancing number of articles per categorySee Also: How to Build a Text Classifier that Delivers?ResourcesFull
2025-04-23