diff --git a/README.md b/README.md index 82e496d76..1b9d4dc40 100644 --- a/README.md +++ b/README.md @@ -70,6 +70,7 @@ Please share your story by answering 1 quick question * Variable Creation * Variable Selection * Datetime Features +* Text Features * Time Series * Preprocessing * Scaling @@ -146,6 +147,9 @@ Please share your story by answering 1 quick question * DatetimeFeatures * DatetimeSubtraction * DatetimeOrdinal + +### Text Features + * TextFeatures ### Time Series * LagFeatures diff --git a/docs/api_doc/index.rst b/docs/api_doc/index.rst index 4e09a1a31..2a11913fc 100644 --- a/docs/api_doc/index.rst +++ b/docs/api_doc/index.rst @@ -25,6 +25,7 @@ Creation creation/index datetime/index + text/index Selection diff --git a/docs/api_doc/text/TextFeatures.rst b/docs/api_doc/text/TextFeatures.rst new file mode 100644 index 000000000..7b2b4f76f --- /dev/null +++ b/docs/api_doc/text/TextFeatures.rst @@ -0,0 +1,6 @@ +TextFeatures +============ + +.. autoclass:: feature_engine.text.TextFeatures + :members: + diff --git a/docs/api_doc/text/index.rst b/docs/api_doc/text/index.rst new file mode 100644 index 000000000..f87392fdd --- /dev/null +++ b/docs/api_doc/text/index.rst @@ -0,0 +1,13 @@ +.. -*- mode: rst -*- + +Text Features +============= + +Feature-engine's text transformers extract numerical features from text/string +variables. + +.. toctree:: + :maxdepth: 1 + + TextFeatures + diff --git a/docs/index.rst b/docs/index.rst index a04f8d4bb..371827505 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -65,6 +65,7 @@ Feature-engine includes transformers for: - Creation of new features - Feature selection - Datetime features +- Text features - Time series - Preprocessing - Scaling @@ -260,6 +261,11 @@ extract many new features from the date and time parts of the datetime variable: - :doc:`api_doc/datetime/DatetimeSubtraction`: computes subtractions between datetime variables - :doc:`api_doc/datetime/DatetimeOrdinal`: converts datetime variables into ordinal numbers +Text: +~~~~~ + +- :doc:`api_doc/text/TextFeatures`: extracts numerical features from text/string variables + Feature Selection: ~~~~~~~~~~~~~~~~~~ diff --git a/docs/user_guide/index.rst b/docs/user_guide/index.rst index c786e77e1..52c33a8f4 100644 --- a/docs/user_guide/index.rst +++ b/docs/user_guide/index.rst @@ -28,6 +28,7 @@ Creation creation/index datetime/index + text/index Selection diff --git a/docs/user_guide/text/TextFeatures.rst b/docs/user_guide/text/TextFeatures.rst new file mode 100644 index 000000000..84d1b4e22 --- /dev/null +++ b/docs/user_guide/text/TextFeatures.rst @@ -0,0 +1,365 @@ +.. _text_features: + +.. currentmodule:: feature_engine.text + +Extracting Features from Text +============================= + +Short pieces of text are often found among the variables in our datasets. For example, +in insurance, a text variable can describe the circumstances of an accident. Customer +feedback is also stored as a text variable. + +While text data as such can't be used to train machine learning models, we can extract +a lot of numerical information from these texts, which can provide predictive features +to train machine learning models. + +Feature-engine allows you to quickly extract numerical features from short pieces of +text, to complement your predictive models. These features aim to capture a piece of +text’s complexity by looking at some statistical parameters of the text, such as the +word length and count, the number of words and unique words used, the number of +sentences, and so on. + +:class:`TextFeatures()` extracts many numerical features from text out-of-the-box. + +TextFeatures +------------ + +:class:`TextFeatures()` extracts numerical features from text/string variables. +This transformer is useful for extracting basic text statistics that can be used +as features in machine learning models. Users must explicitly specify which columns +contain text data via the `variables` parameter. + +Unlike scikit-learn's CountVectorizer or TfidfVectorizer which create sparse matrices, +:class:`TextFeatures()` extracts metadata features that remain in DataFrame format +and can be easily combined with other Feature-engine or sklearn transformers in a pipeline. + +Text Features +------------- + +:class:`TextFeatures()` can extract the following features from a text piece: + +- **char_count**: Number of characters in the text +- **word_count**: Number of words (whitespace-separated tokens) +- **sentence_count**: Number of sentences (based on .!? punctuation) +- **avg_word_length**: Average length of words +- **digit_count**: Number of digit characters +- **letter_count**: Number of alphabetic characters (a-z, A-Z) +- **uppercase_count**: Number of uppercase letters +- **lowercase_count**: Number of lowercase letters +- **special_char_count**: Number of special characters (non-alphanumeric) +- **whitespace_count**: Number of whitespace characters +- **whitespace_ratio**: Ratio of whitespace to total characters +- **digit_ratio**: Ratio of digits to total characters +- **uppercase_ratio**: Ratio of uppercase to total characters +- **has_digits**: Binary indicator if text contains digits +- **has_uppercase**: Binary indicator if text contains uppercase +- **is_empty**: Binary indicator if text is empty +- **starts_with_uppercase**: Binary indicator if text starts with uppercase +- **ends_with_punctuation**: Binary indicator if text ends with .!? +- **unique_word_count**: Number of unique words (case-insensitive) +- **lexical_diversity**: Ratio of unique words to total words + +The **number of sentences** is inferred by :class:`TextFeatures()` by counting blocks of +sentence-ending punctuation (., !, ?) as a proxy for sentence boundaries. This means that +multiple consecutive punctuation marks (e.g., "!!!" or "??") are counted as a single +sentence-ending, which avoids overestimating the count in emphatic text. + +However, this is still a simple heuristic. It won't handle edge cases like abbreviations +(e.g., 'Dr.', 'U.S.', 'e.g.', 'i.e.') or text without punctuation. These abbreviations +will be counted as sentence endings, resulting in an overestimate of the actual sentence +count. + +The features **number of unique words** and **lexical diversity** are intended to +capture the complexity of the text. Simpler texts have few unique words and tend to +repeat them. More complex texts use a wider array of words and tend not to repeat them. +Hence, in more complex texts, both the number of unique words and the lexical diversity +are greater. + +Handling missing values +----------------------- + +By default, :class:`TextFeatures()` ignores missing values by treating them as empty +strings (`missing_values='ignore'`). You can change this behavior by setting the +parameter to `'raise'` if you prefer the transformer to raise an error when encountering +missing data. + +In this case, missing values will be treated as empty strings, and the numerical features +will be calculated accordingly (e.g., word count and character count will be 0) as shown +in the following example: + +.. code:: python + + import pandas as pd + import numpy as np + from feature_engine.text import TextFeatures + + # Create sample data with NaN + X = pd.DataFrame({ + 'text': ['Hello', np.nan, 'World'] + }) + + # Set up the transformer (defaults to ignore missing values) + tf = TextFeatures( + variables=['text'], + features=['char_count'] + ) + + # Transform + X_transformed = tf.fit_transform(X) + + print(X_transformed) + +In the resulting dataframe, we see that the row with NaN returned 0 in the character +count: + +.. code-block:: none + + text text_char_count + 0 Hello 5 + 1 NaN 0 + 2 World 5 + +Python demo +----------- + +In this section, we'll show how to use :class:`TextFeatures()`. +Let's create a dataframe with text data: + +.. code:: python + + import pandas as pd + from feature_engine.text import TextFeatures + + # Create sample data + X = pd.DataFrame({ + 'review': [ + 'This product is AMAZING! Best purchase ever.', + 'Not great. Would not recommend.', + 'OK for the price. 3 out of 5 stars.', + 'TERRIBLE!!! DO NOT BUY!', + ], + 'title': [ + 'Great Product', + 'Disappointed', + 'Average', + 'Awful', + ] + }) + + print(X) + +The input dataframe looks like this: + +.. code-block:: none + + review title + 0 This product is AMAZING! Best purchase ever. Great Product + 1 Not great. Would not recommend. Disappointed + 2 OK for the price. 3 out of 5 stars. Average + 3 TERRIBLE!!! DO NOT BUY! Awful + +Now let's extract 5 specific text features: the number of words, the number of +characters, the number of sentences, whether the text has digits, and the ratio of +upper- to lowercase: + +.. code:: python + + # Set up the transformer with specific features + tf = TextFeatures( + variables=['review'], + features=[ + 'word_count', + 'char_count', + 'sentence_count', + 'has_digits', + 'uppercase_ratio', + ]) + + # Fit and transform + X_transformed = tf.fit_transform(X) + + print(X_transformed) + +In the following output, we see the resulting dataframe containing the numerical +features extracted from the pieces of text: + +.. code-block:: none + + review title review_word_count review_char_count + 0 This product is AMAZING! Best purchase ever. Great Product 7 38 + 1 Not great. Would not recommend. Disappointed 5 27 + 2 OK for the price. 3 out of 5 stars. Average 9 27 + 3 TERRIBLE!!! DO NOT BUY! Awful 4 20 + + review_sentence_count review_has_digits review_uppercase_ratio + 0 2 0 0.236842 + 1 2 0 0.074074 + 2 2 1 0.074074 + 3 2 0 0.800000 + +Extracting all features +~~~~~~~~~~~~~~~~~~~~~~~ + +By default, if no text features are specified, all available features will be extracted: + +.. code:: python + + # Extract all features from a single text column + tf = TextFeatures(variables=['review']) + X_transformed = tf.fit_transform(X) + + print(X_transformed.head()) + +The output dataframe contains all 20 text features extracted from the `review` column: + +.. code-block:: none + + review title review_char_count review_word_count + 0 This product is AMAZING! Best purchase ever. Great Product 38 7 + 1 Not great. Would not recommend. Disappointed 27 5 + 2 OK for the price. 3 out of 5 stars. Average 27 9 + 3 TERRIBLE!!! DO NOT BUY! Awful 20 4 + + review_sentence_count review_avg_word_length review_digit_count review_letter_count + 0 2 6.285714 0 36 + 1 2 6.200000 0 25 + 2 2 3.888889 2 23 + 3 2 5.750000 0 16 + + review_uppercase_count review_lowercase_count review_special_char_count review_whitespace_count + 0 9 27 2 6 + 1 2 23 2 4 + 2 2 21 2 8 + 3 16 0 4 3 + + review_whitespace_ratio review_digit_ratio review_uppercase_ratio review_has_digits + 0 0.136364 0.000000 0.236842 0 + 1 0.129032 0.000000 0.074074 0 + 2 0.228571 0.074074 0.074074 1 + 3 0.130435 0.000000 0.800000 0 + + review_has_uppercase review_is_empty review_starts_with_uppercase review_ends_with_punctuation + 0 1 0 1 1 + 1 1 0 1 1 + 2 1 0 1 1 + 3 1 0 1 1 + + review_unique_word_count review_lexical_diversity + 0 7 1.0 + 1 4 1.25 + 2 9 1.0 + 3 4 1.0 + +Dropping original columns +~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can drop the original text columns after extracting features, by setting the +parameter `drop_original` to `True`: + +.. code:: python + + tf = TextFeatures( + variables=['review'], + features=['word_count', 'char_count'], + drop_original=True + ) + + X_transformed = tf.fit_transform(X) + + print(X_transformed) + +The original `'review'` column has been removed, and only the `'title'` column and the +extracted features remain: + +.. code-block:: none + + title review_word_count review_char_count + 0 Great Product 7 38 + 1 Disappointed 5 27 + 2 Average 9 27 + 3 Awful 4 20 + +Combining with scikit-learn Bag-of-Words +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In most NLP tasks, it is common to use bag-of-words (e.g., `CountVectorizer`) or TF-IDF +(e.g., `TfidfVectorizer`) to represent the text. :class:`TextFeatures()` can be used +alongside these transformers to provide additional metadata that might improve model +performance. + +In the following example, we compare a baseline model using only TF-IDF with a model +that combines TF-IDF and :class:`TextFeatures()` metadata: + +.. code:: python + + import pandas as pd + from sklearn.datasets import fetch_20newsgroups + from sklearn.model_selection import train_test_split + + from feature_engine.text import TextFeatures + + # Load and split data + data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.sport.hockey']) + df = pd.DataFrame({'text': data.data, 'target': data.target}) + X_train, X_test, y_train, y_test = train_test_split( + df[['text']], df['target'], test_size=0.3, random_state=42 + ) + + print(X_train.head()) + +The input dataframe contains the raw text of newsgroup posts: + +.. code-block:: none + + text + 562 From: xxx@yyy.zzz (John Smith)\nSubject: Re:... + 459 From: aaa@bbb.ccc (Jane Doe)\nSubject: Shutt... + 21 From: ddd@eee.fff\nSubject: Space Station Fr... + 892 From: ggg@hhh.iii\nSubject: NHL Scores\nOrga... + 317 From: jjj@kkk.lll (Bob Wilson)\nSubject: Re:... + +Now let's set up two pipelines to compare a baseline model using only TF-IDF with a +model that combines TF-IDF and :class:`TextFeatures()` metadata: + +.. code:: python + + from sklearn.pipeline import Pipeline + from sklearn.feature_extraction.text import TfidfVectorizer + from sklearn.compose import ColumnTransformer + from sklearn.linear_model import LogisticRegression + from sklearn.preprocessing import StandardScaler + + # 1. Baseline: TF-IDF only + tfidf_pipe = Pipeline([ + ('vec', ColumnTransformer([ + ('tfidf', TfidfVectorizer(max_features=500), 'text') + ])), + ('clf', LogisticRegression()) + ]) + tfidf_pipe.fit(X_train, y_train) + print(f"TF-IDF Accuracy: {tfidf_pipe.score(X_test, y_test):.3f}") + + # 2. Combined: TextFeatures + TF-IDF + combined_pipe = Pipeline([ + ('features', ColumnTransformer([ + ('text_meta', TextFeatures(variables=['text']), 'text'), + ('tfidf', TfidfVectorizer(max_features=500), 'text') + ])), + ('scaler', StandardScaler()), + ('clf', LogisticRegression()) + ]) + combined_pipe.fit(X_train, y_train) + print(f"Combined Accuracy: {combined_pipe.score(X_test, y_test):.3f}") + +Below we see the accuracy of a model trained using only the bag of words, respect to a +model trained using both the bag of words and the additional meta data: + +.. code-block:: none + + TF-IDF Accuracy: 0.957 + Combined Accuracy: 0.963 + +By adding statistical metadata through :class:`TextFeatures()`, we provided the model +with information about text length, complexity, and style that is not explicitly +captured by a word-count-based approach like TF-IDF, leading to a small but noticeable +improvement in performance. diff --git a/docs/user_guide/text/index.rst b/docs/user_guide/text/index.rst new file mode 100644 index 000000000..0a7ce55bb --- /dev/null +++ b/docs/user_guide/text/index.rst @@ -0,0 +1,18 @@ +.. -*- mode: rst -*- + +Text Feature Extraction +======================= + +Feature-engine's text module includes transformers to extract numerical features +from text/string variables. + +Text feature extraction is useful for machine learning problems where you have +text data but want to derive numerical statistics without, or in addition to, creating sparse +bag-of-words or TF-IDF representations. + +**Transformers** + +.. toctree:: + :maxdepth: 1 + + TextFeatures diff --git a/feature_engine/text/__init__.py b/feature_engine/text/__init__.py new file mode 100644 index 000000000..14626b79c --- /dev/null +++ b/feature_engine/text/__init__.py @@ -0,0 +1,9 @@ +""" +The module text includes classes to extract features from text/string variables. +""" + +from .text_features import TextFeatures + +__all__ = [ + "TextFeatures", +] diff --git a/feature_engine/text/text_features.py b/feature_engine/text/text_features.py new file mode 100644 index 000000000..8299863ff --- /dev/null +++ b/feature_engine/text/text_features.py @@ -0,0 +1,332 @@ +# Authors: Ankit Hemant Lade (contributor) +# License: BSD 3 clause +from typing import List, Optional, Union, cast + +import pandas as pd +from sklearn.base import BaseEstimator, TransformerMixin +from sklearn.utils.validation import check_is_fitted + +from feature_engine._base_transformers.mixins import GetFeatureNamesOutMixin +from feature_engine._check_init_parameters.check_init_input_params import ( + _check_param_drop_original, + _check_param_missing_values, +) +from feature_engine.dataframe_checks import ( + _check_optional_contains_na, + _check_X_matches_training_df, + check_X, +) + +# Available text features and their computation functions +TEXT_FEATURES = { + "char_count": lambda x: x.str.replace(r"\s+", "", regex=True).str.len(), + "word_count": lambda x: x.str.strip().str.split().str.len(), + "sentence_count": lambda x: x.str.count(r"[.!?]+"), + "avg_word_length": lambda x: x.str.strip().str.len() + / x.str.strip().str.split().str.len(), + "digit_count": lambda x: x.str.count(r"\d"), + "letter_count": lambda x: x.str.count(r"[a-zA-Z]"), + "uppercase_count": lambda x: x.str.count(r"[A-Z]"), + "lowercase_count": lambda x: x.str.count(r"[a-z]"), + "special_char_count": lambda x: x.str.count(r"[^a-zA-Z0-9\s]"), + "whitespace_count": lambda x: x.str.count(r"\s"), + "whitespace_ratio": lambda x: x.str.count(r"\s") / x.str.len().replace(0, 1), + "digit_ratio": lambda x: x.str.count(r"\d") + / x.str.replace(r"\s+", "", regex=True).str.len().replace(0, 1), + "uppercase_ratio": lambda x: x.str.count(r"[A-Z]") + / x.str.replace(r"\s+", "", regex=True).str.len().replace(0, 1), + "has_digits": lambda x: x.str.contains(r"\d", regex=True).astype(int), + "has_uppercase": lambda x: x.str.contains(r"[A-Z]", regex=True).astype(int), + "is_empty": lambda x: (x.str.len() == 0).astype(int), + "starts_with_uppercase": lambda x: x.str.match(r"^[A-Z]").astype(int), + "ends_with_punctuation": lambda x: x.str.match(r".*[.!?]$").astype(int), + "unique_word_count": lambda x: (x.str.lower().str.split().apply(set).str.len()), + "lexical_diversity": lambda x: x.str.strip().str.split().str.len() + / x.str.lower().str.split().apply(set).str.len(), +} + + +class TextFeatures(TransformerMixin, BaseEstimator, GetFeatureNamesOutMixin): + """ + TextFeatures() extracts numerical features from text/string variables. This + transformer is useful for extracting basic text statistics that can be used + as features in machine learning models. + + A list with the text variables must be passed as an argument. + + More details in the :ref:`User Guide `. + + Parameters + ---------- + variables: string, list + The list of text/string variables to extract features from. + + features: list, default=None + List of text features to extract. Available features are: + + - 'char_count': Number of characters in the text + - 'word_count': Number of words (whitespace-separated tokens) + - 'sentence_count': Number of sentences (based on .!? punctuation) + - 'avg_word_length': Average length of words + - 'digit_count': Number of digit characters + - 'letter_count': Number of alphabetic characters (a-z, A-Z) + - 'uppercase_count': Number of uppercase letters + - 'lowercase_count': Number of lowercase letters + - 'special_char_count': Number of special characters (non-alphanumeric) + - 'whitespace_count': Number of whitespace characters + - 'whitespace_ratio': Ratio of whitespace to total characters + - 'digit_ratio': Ratio of digits to total characters + - 'uppercase_ratio': Ratio of uppercase to total characters + - 'has_digits': Binary indicator if text contains digits + - 'has_uppercase': Binary indicator if text contains uppercase + - 'is_empty': Binary indicator if text is empty + - 'starts_with_uppercase': Binary indicator if text starts with uppercase + - 'ends_with_punctuation': Binary indicator if text ends with .!? + - 'unique_word_count': Number of unique words (case-insensitive) + - 'lexical_diversity': Ratio of unique words to total words + + If None, extracts all available features. + + missing_values: string, default='ignore' + If 'ignore', NaNs will be filled with an empty string before feature + extraction. If 'raise', the transformer will raise an error if missing data + is found. + + drop_original: bool, default=False + Whether to drop the original text columns after transformation. + + Attributes + ---------- + variables_: + The list of text variables that will be transformed. + + features_: + The list of features that will be extracted. + + feature_names_in_: + List with the names of features seen during fit. + + n_features_in_: + The number of features in the train set used in fit. + + Methods + ------- + fit: + This transformer does not learn parameters. + + fit_transform: + Fit to data, then transform it. + + transform: + Extract text features and add them to the dataframe. + + get_feature_names_out: + Get output feature names for transformation. + + See Also + -------- + feature_engine.encoding.StringSimilarityEncoder : + Encodes categorical variables based on string similarity. + + Examples + -------- + + >>> import pandas as pd + >>> from feature_engine.text import TextFeatures + >>> X = pd.DataFrame({ + ... 'text': ['Hello World!', 'Python is GREAT.', 'ML rocks 123'] + ... }) + >>> tf = TextFeatures( + ... variables=['text'], + ... features=['char_count', 'word_count', 'has_digits'] + ... ) + >>> tf.fit(X) + TextFeatures(features=['char_count', 'word_count', 'has_digits'], + variables=['text']) + >>> X = tf.transform(X) + >>> pd.options.display.max_columns = 10 + >>> print(X) + text text_char_count text_word_count text_has_digits + 0 Hello World! 11 2 0 + 1 Python is GREAT. 14 3 0 + 2 ML rocks 123 10 3 1 + """ + + def __init__( + self, + variables: Union[str, List[str]], + features: Optional[List[str]] = None, + missing_values: str = "ignore", + drop_original: bool = False, + ) -> None: + + # Validate variables + if isinstance(variables, str): + variables = [variables] + if not isinstance(variables, list) or not all( + isinstance(v, str) for v in variables + ): + raise ValueError( + "variables must be a string or a list of strings. " + f"Got {type(variables).__name__} instead." + ) + + # Validate features + if features is not None: + if not isinstance(features, list) or not all( + isinstance(f, str) for f in features + ): + raise ValueError( + "features must be None or a list of strings. " + f"Got {type(features).__name__} instead." + ) + invalid_features = set(features) - set(TEXT_FEATURES.keys()) + if invalid_features: + raise ValueError( + f"Invalid features: {invalid_features}. " + f"Available features are: {list(TEXT_FEATURES.keys())}" + ) + + _check_param_drop_original(drop_original) + _check_param_missing_values(missing_values) + + self.variables = variables + self.features = features + self.missing_values = missing_values + self.drop_original = drop_original + + def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): + """ + This transformer does not learn any parameters. + + Parameters + ---------- + X: pandas dataframe of shape = [n_samples, n_features] + The training input samples. Can be the entire dataframe, not just the + variables to transform. + + y: pandas Series, or np.array. Defaults to None. + The target. It is not needed in this transformer. You can pass y or None. + """ + + # check input dataframe + X = check_X(X) + + # Validate user-specified variables exist + missing = set(self.variables) - set(X.columns) + if missing: + raise ValueError(f"Variables {missing} are not present in the dataframe.") + + # Validate that the variables are object or string + non_text = [ + col + for col in self.variables + if not ( + pd.api.types.is_string_dtype(X[col]) + or pd.api.types.is_object_dtype(X[col]) + ) + ] + if non_text: + raise ValueError( + f"Variables {non_text} are not object or string. " + "Please provide text variables only." + ) + + self.variables_ = self.variables + + # check if dataset contains na + if self.missing_values == "raise": + _check_optional_contains_na(X, cast(list[Union[str, int]], self.variables_)) + + # Set features to extract + if self.features is None: + self.features_ = list(TEXT_FEATURES.keys()) + else: + self.features_ = self.features + + # save input features + self.feature_names_in_ = X.columns.tolist() + + # save train set shape + self.n_features_in_ = X.shape[1] + + return self + + def transform(self, X: pd.DataFrame) -> pd.DataFrame: + """ + Extract text features and add them to the dataframe. + + Parameters + ---------- + X: pandas dataframe of shape = [n_samples, n_features] + The data to transform. + + Returns + ------- + X_new: Pandas dataframe + The dataframe with the original columns plus the new text features. + """ + + # Check method fit has been called + check_is_fitted(self) + + # check that input is a dataframe + X = check_X(X) + + # Check if input data contains same number of columns as dataframe used to fit. + _check_X_matches_training_df(X, self.n_features_in_) + + # check if dataset contains na + if self.missing_values == "raise": + _check_optional_contains_na(X, cast(list[Union[str, int]], self.variables_)) + else: + X[self.variables_] = X[self.variables_].fillna("") + + # reorder variables to match train set + X = X[self.feature_names_in_] + + # Extract features for each text variable + for var in self.variables_: + for feature_name in self.features_: + new_col_name = f"{var}_{feature_name}" + feature_func = TEXT_FEATURES[feature_name] + X[new_col_name] = feature_func(X[var]) + + # Fill any NaN values resulting from computation with 0 + X[new_col_name] = X[new_col_name].fillna(0) + + if self.drop_original: + X = X.drop(columns=self.variables_) + + return X + + def get_feature_names_out(self, input_features=None) -> List[str]: + """ + Get output feature names for transformation. + + Parameters + ---------- + input_features : array-like of str or None, default=None + Input features. If ``None``, uses ``feature_names_in_``. + + Returns + ------- + feature_names_out : list of str + Output feature names. + """ + check_is_fitted(self) + + # Start with original features + if self.drop_original: + feature_names = [ + f for f in self.feature_names_in_ if f not in self.variables_ + ] + else: + feature_names = list(self.feature_names_in_) + + # Add new text feature names + for var in self.variables_: + for feature_name in self.features_: + feature_names.append(f"{var}_{feature_name}") + + return feature_names diff --git a/tests/test_text/__init__.py b/tests/test_text/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/tests/test_text/test_text_features.py b/tests/test_text/test_text_features.py new file mode 100644 index 000000000..ba906885f --- /dev/null +++ b/tests/test_text/test_text_features.py @@ -0,0 +1,780 @@ +import pandas as pd +import pytest + +from feature_engine.text import TextFeatures +from feature_engine.text.text_features import TEXT_FEATURES + +# ============================================================================== +# INIT TESTS +# ============================================================================== + + +@pytest.mark.parametrize( + "invalid_variables", + [ + 123, + True, + [1, 2], + ["text", 123], + {"text": 1}, + ], +) +def test_invalid_variables_raises_error(invalid_variables): + with pytest.raises(ValueError, match="variables must be a string or a list of"): + TextFeatures(variables=invalid_variables) + + +@pytest.mark.parametrize( + "invalid_features, err_msg", + [ + ("some_string", "features must be"), + ([1, 2], "features must be"), + (123, "features must be"), + (True, "features must be"), + (["some_string", True], "features must be"), + ({"some_string": 1}, "features must be"), + (["invalid_feature"], "Invalid features"), + (["char_count", "invalid_feature"], "Invalid features"), + ], +) +def test_invalid_features_raises_error(invalid_features, err_msg): + with pytest.raises(ValueError, match=err_msg): + TextFeatures(variables=["text"], features=invalid_features) + + +# ============================================================================== +# FIT TESTS +# ============================================================================== + + +@pytest.mark.parametrize( + "variables, features", + [ + ("text", None), + (["string"], ["char_count"]), + (["text", "string"], ["sentence_count", "avg_word_length"]), + ], +) +def test_fit_stores_attributes(variables, features): + X = pd.DataFrame({"text": ["Hello"], "string": ["Bye"]}) + transformer = TextFeatures(variables=variables, features=features) + transformer.fit(X) + + assert ( + transformer.variables_ == variables + if isinstance(variables, list) + else transformer.variables_ == [variables] + ) + assert ( + transformer.features_ == list(TEXT_FEATURES.keys()) + if features is None + else transformer.features_ == features + ) + assert transformer.feature_names_in_ == ["text", "string"] + assert transformer.n_features_in_ == 2 + + +def test_missing_variable_raises_error(): + X = pd.DataFrame({"text": ["Hello"]}) + transformer = TextFeatures(variables=["nonexistent"]) + with pytest.raises(ValueError, match="not present in the dataframe"): + transformer.fit(X) + + +@pytest.mark.parametrize("variables", ["Age", "Marks", "dob"]) +def test_no_text_columns_raises_error(df_vartypes, variables): + transformer = TextFeatures(variables=variables) + with pytest.raises(ValueError, match="not object or string"): + transformer.fit(df_vartypes) + + +def test_nan_handling_raise_error_fit(df_na): + transformer = TextFeatures( + variables=["City"], features=["char_count"], missing_values="raise" + ) + msg = "`missing_values='ignore'` when initialising this transformer" + with pytest.raises(ValueError, match=msg): + transformer.fit(df_na) + + +# ============================================================================== +# TRANSFORM TESTS - GENERAL +# ============================================================================== + + +def test_transform_on_new_data(): + X_train = pd.DataFrame({"text": ["Hello World", "Foo Bar"]}) + X_test = pd.DataFrame({"text": ["New Data", "Test 123"]}) + + transformer = TextFeatures( + variables=["text"], features=["char_count", "has_digits"] + ) + transformer.fit(X_train) + X_tr = transformer.transform(X_test) + + assert X_tr["text_char_count"].tolist() == [7, 7] + assert X_tr["text_has_digits"].tolist() == [0, 1] + + +def test_nan_handling_raise_error_transform(): + X_train = pd.DataFrame({"text": ["Hello", "World"]}) + X_test = pd.DataFrame({"text": ["Hello", None, "World"]}) + transformer = TextFeatures( + variables=["text"], features=["char_count"], missing_values="raise" + ) + transformer.fit(X_train) + msg = "`missing_values='ignore'` when initialising this transformer" + with pytest.raises(ValueError, match=msg): + transformer.transform(X_test) + + +def test_nan_handling(): + X = pd.DataFrame({"text": ["Hello", None, "World"]}) + transformer = TextFeatures(variables=["text"], features=["char_count"]) + X_tr = transformer.fit_transform(X) + + # NaN should be filled with empty string, resulting in char_count of 0 + assert X_tr["text_char_count"].tolist() == [5, 0, 5] + + +def test_default_all_features(): + """Test extracting all features with default parameters.""" + X = pd.DataFrame({"text": ["Hello World!", "Python 123", "AI"]}) + transformer = TextFeatures(variables=["text"]) + X_tr = transformer.fit_transform(X) + + # Spot check a few features to ensure they were added and computed + assert X_tr["text_char_count"].tolist() == [11, 9, 2] + assert X_tr["text_word_count"].tolist() == [2, 2, 1] + assert X_tr["text_digit_count"].tolist() == [0, 3, 0] + + +def test_specific_features(): + """Test extracting specific features only.""" + X = pd.DataFrame({"text": ["Hello", "World"]}) + transformer = TextFeatures( + variables=["text"], features=["char_count", "word_count"] + ) + X_tr = transformer.fit_transform(X) + + # Check only specified features are extracted + assert X_tr.columns.tolist() == ["text", "text_char_count", "text_word_count"] + + +def test_specific_variables(): + """Test extracting features from specific variables only.""" + X = pd.DataFrame( + {"text1": ["Hello", "World"], "text2": ["Foo", "Bar"], "numeric": [1, 2]} + ) + transformer = TextFeatures(variables=["text1"], features=["char_count"]) + X_tr = transformer.fit_transform(X) + + # Only text1 should have features extracted + assert X_tr.columns.tolist() == ["text1", "text2", "numeric", "text1_char_count"] + + +def test_drop_original(): + """Test drop_original parameter.""" + X = pd.DataFrame({"text": ["Hello", "World"], "other": [1, 2]}) + transformer = TextFeatures( + variables=["text"], features=["char_count"], drop_original=True + ) + X_tr = transformer.fit_transform(X) + + assert X_tr.columns.tolist() == ["other", "text_char_count"] + + +def test_string_variable_input(): + """Test that passing a single string variable works (auto-converted to list).""" + X = pd.DataFrame({"text": ["Hello", "World"], "other": ["A", "B"]}) + transformer = TextFeatures(variables="text", features=["char_count"]) + X_tr = transformer.fit_transform(X) + + assert transformer.variables_ == ["text"] + assert X_tr.columns.tolist() == ["text", "other", "text_char_count"] + assert X_tr["text_char_count"].tolist() == [5, 5] + + +def test_multiple_text_columns(): + """Test extracting features from multiple text columns.""" + X = pd.DataFrame({"a": ["Hello", "World"], "b": ["Foo", "Bar"]}) + transformer = TextFeatures( + variables=["a", "b"], features=["char_count", "word_count"] + ) + X_tr = transformer.fit_transform(X) + + assert X_tr.columns.tolist() == [ + "a", + "b", + "a_char_count", + "a_word_count", + "b_char_count", + "b_word_count", + ] + + +# ============================================================================== +# TRANSFORM - TEST TEXT FEATURES +# ============================================================================== + + +@pytest.fixture(scope="module") +def df_text(): + df = pd.DataFrame( + { + "text": [ + "Hello World!", + "HELLO", + "12345", + "e.g. i.e.", + " ", + " trailing ", + "abc...", + "", + None, + "A? B! C.", + "HeLLo", + "Hi! @#", + "A1b2 C3d4!@#$", + "???", + "i.e., this is wrong", + "Is 1 > 2? No, 100%!", + "Hello. World", + "Hello. World.", + "Hello... World!?!", + "This is a proper sentence containing " + "supercalifragilisticexpialidocious and exceptionally long words.", + ] + } + ) + return df + + +def test_whitespace_features(df_text): + text_features = ["whitespace_count", "whitespace_ratio"] + transformer = TextFeatures(variables=["text"], features=text_features) + X_tr = transformer.fit_transform(df_text) + assert X_tr["text_whitespace_count"].tolist() == [ + 1, + 0, + 0, + 1, + 3, + 2, + 0, + 0, + 0, + 2, + 0, + 1, + 1, + 0, + 3, + 5, + 1, + 1, + 1, + 10, + ] + assert X_tr["text_whitespace_ratio"].tolist() == [ + 0.08333333333333333, + 0.0, + 0.0, + 0.1111111111111111, + 1.0, + 0.2, + 0.0, + 0.0, + 0.0, + 0.25, + 0.0, + 0.16666666666666666, + 0.07692307692307693, + 0.0, + 0.15789473684210525, + 0.2631578947368421, + 0.08333333333333333, + 0.07692307692307693, + 0.058823529411764705, + 0.09900990099009901, + ] + + +def test_digit_features(df_text): + transformer = TextFeatures( + variables=["text"], features=["digit_count", "digit_ratio", "has_digits"] + ) + X_tr = transformer.fit_transform(df_text) + assert X_tr["text_digit_count"].tolist() == [ + 0, + 0, + 5, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 4, + 0, + 0, + 5, + 0, + 0, + 0, + 0, + ] + assert X_tr["text_digit_ratio"].tolist() == [ + 0.0, + 0.0, + 1.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.3333333333333333, + 0.0, + 0.0, + 0.35714285714285715, + 0.0, + 0.0, + 0.0, + 0.0, + ] + assert X_tr["text_has_digits"].tolist() == [ + 0, + 0, + 1, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 0, + 0, + 1, + 0, + 0, + 0, + 0, + ] + + +def test_uppercase_features(df_text): + transformer = TextFeatures( + variables=["text"], + features=[ + "uppercase_count", + "uppercase_ratio", + "has_uppercase", + "starts_with_uppercase", + ], + ) + X_tr = transformer.fit_transform(df_text) + assert X_tr["text_uppercase_count"].tolist() == [ + 2, + 5, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 3, + 3, + 1, + 2, + 0, + 0, + 2, + 2, + 2, + 2, + 1, + ] + assert X_tr["text_uppercase_ratio"].tolist() == [ + 0.18181818181818182, + 1.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.5, + 0.6, + 0.2, + 0.16666666666666666, + 0.0, + 0.0, + 0.14285714285714285, + 0.18181818181818182, + 0.16666666666666666, + 0.125, + 0.01098901098901099, + ] + assert X_tr["text_has_uppercase"].tolist() == [ + 1, + 1, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 1, + 1, + 1, + 0, + 0, + 1, + 1, + 1, + 1, + 1, + ] + assert X_tr["text_starts_with_uppercase"].tolist() == [ + 1, + 1, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 1, + 1, + 1, + 0, + 0, + 1, + 1, + 1, + 1, + 1, + ] + + +def test_punctuation_features(df_text): + transformer = TextFeatures( + variables=["text"], features=["special_char_count", "ends_with_punctuation"] + ) + X_tr = transformer.fit_transform(df_text) + assert X_tr["text_special_char_count"].tolist() == [ + 1, + 0, + 0, + 4, + 0, + 0, + 3, + 0, + 0, + 3, + 0, + 3, + 4, + 3, + 3, + 5, + 1, + 2, + 6, + 1, + ] + assert X_tr["text_ends_with_punctuation"].tolist() == [ + 1, + 0, + 0, + 1, + 0, + 0, + 1, + 0, + 0, + 1, + 0, + 0, + 0, + 1, + 0, + 1, + 0, + 1, + 1, + 1, + ] + + +def test_word_features(df_text): + transformer = TextFeatures( + variables=["text"], + features=[ + "word_count", + "unique_word_count", + "lexical_diversity", + "avg_word_length", + ], + ) + X_tr = transformer.fit_transform(df_text) + assert X_tr["text_word_count"].tolist() == [ + 2, + 1, + 1, + 2, + 0, + 1, + 1, + 0, + 0, + 3, + 1, + 2, + 2, + 1, + 4, + 6, + 2, + 2, + 2, + 11, + ] + assert X_tr["text_unique_word_count"].tolist() == [ + 2, + 1, + 1, + 2, + 0, + 1, + 1, + 0, + 0, + 3, + 1, + 2, + 2, + 1, + 4, + 6, + 2, + 2, + 2, + 11, + ] + assert X_tr["text_lexical_diversity"].tolist() == [ + 1.0, + 1.0, + 1.0, + 1.0, + 0.0, + 1.0, + 1.0, + 0.0, + 0.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + 1.0, + ] + assert X_tr["text_avg_word_length"].tolist() == [ + 6.0, + 5.0, + 5.0, + 4.5, + 0.0, + 8.0, + 6.0, + 0.0, + 0.0, + 2.6666666666666665, + 5.0, + 3.0, + 6.5, + 3.0, + 4.75, + 3.1666666666666665, + 6.0, + 6.5, + 8.5, + 9.181818181818182, + ] + + +def test_basic_features(df_text): + transformer = TextFeatures( + variables=["text"], + features=[ + "char_count", + "sentence_count", + "letter_count", + "lowercase_count", + "is_empty", + ], + ) + X_tr = transformer.fit_transform(df_text) + assert X_tr["text_char_count"].tolist() == [ + 11, + 5, + 5, + 8, + 0, + 8, + 6, + 0, + 0, + 6, + 5, + 5, + 12, + 3, + 16, + 14, + 11, + 12, + 16, + 91, + ] + assert X_tr["text_sentence_count"].tolist() == [ + 1, + 0, + 0, + 4, + 0, + 0, + 1, + 0, + 0, + 3, + 0, + 1, + 1, + 1, + 2, + 2, + 1, + 2, + 2, + 1, + ] + assert X_tr["text_letter_count"].tolist() == [ + 10, + 5, + 0, + 4, + 0, + 8, + 3, + 0, + 0, + 3, + 5, + 2, + 4, + 0, + 13, + 4, + 10, + 10, + 10, + 90, + ] + assert X_tr["text_lowercase_count"].tolist() == [ + 8, + 0, + 0, + 4, + 0, + 8, + 3, + 0, + 0, + 0, + 2, + 1, + 2, + 0, + 13, + 2, + 8, + 8, + 8, + 89, + ] + assert X_tr["text_is_empty"].tolist() == [ + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 1, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + ] + + +# ============================================================================== +# OTHER METHOD TESTS +# ============================================================================== + + +def test_get_feature_names_out(): + X = pd.DataFrame({"text": ["Hello"], "other": [1]}) + transformer = TextFeatures( + variables=["text"], features=["char_count", "word_count"] + ) + transformer.fit(X) + + feature_names = transformer.get_feature_names_out() + expected_features = ["text", "other", "text_char_count", "text_word_count"] + assert feature_names == expected_features + + +def test_get_feature_names_out_with_drop(): + """Test get_feature_names_out with drop_original=True.""" + X = pd.DataFrame({"text": ["Hello"], "other": [1]}) + transformer = TextFeatures( + variables=["text"], features=["char_count"], drop_original=True + ) + transformer.fit(X) + + feature_names = transformer.get_feature_names_out() + expected_features = ["other", "text_char_count"] + assert feature_names == expected_features