[ENH] Add TextFeatures transformer for text feature extraction by ankitlade12 · Pull Request #880 · feature-engine/feature_engine

ankitlade12 · 2026-01-08T15:52:24Z

Add TextFeatures class to extract features from text columns
Support for features: char_count, word_count, digit_count, uppercase_count, etc.
Add comprehensive tests with pytest parametrize
Add user guide documentation

solegalli

Thanks a lot!

This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.

Other than that, we need the various docs file and we'll be good to go :)

Thanks again!

feature_engine/text/text_features.py

tests/test_text/test_text_features.py

solegalli · 2026-01-26T03:39:08Z

We need to rebase main so the 2 remaining tests pass.

docs/user_guide/text/TextFeatures.rst

docs/user_guide/text/index.rst

docs/user_guide/text/TextFeatures.rst

feature_engine/text/text_features.py

solegalli

Hi @ankitlade12

I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.

I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.

Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.

Thank you very much for this great addition.

tests/test_text/test_text_features.py

solegalli · 2026-02-21T01:36:30Z

feature_engine/text/text_features.py

+        x.str.lower().str.split().apply(
+            lambda s: len(set(s)) if isinstance(s, list) else 0
+        ) / x.str.split().str.len()
+    ).fillna(0),


Following the above, would this not be better:

x.str.strip().str.split().str.len() / x.str.lower().str.split().apply(set).str.len()

solegalli · 2026-02-21T01:38:38Z

feature_engine/text/text_features.py

+    >>> X = tf.transform(X)
+    >>> X
+                   text  text_char_count  text_word_count  text_has_digits
+    0      Hello World!               12                2                0


We might need to adjust this example after updating the feature calculations

solegalli · 2026-02-21T01:41:48Z

feature_engine/text/text_features.py

+    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
+        """
+        Stores feature names and validates that the specified variables are
+        present and are of string/object type.


Suggested change

present and are of string/object type.

solegalli · 2026-02-21T01:42:08Z

feature_engine/text/text_features.py

+
+    def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
+        """
+        Stores feature names and validates that the specified variables are


Suggested change

Stores feature names and validates that the specified variables are

This transformer does not learn any parameters.

solegalli · 2026-02-21T01:44:08Z

feature_engine/text/text_features.py

+        Parameters
+        ----------
+        X: pandas dataframe of shape = [n_samples, n_features]
+            The training input samples.


Suggested change

The training input samples.

The training input samples. Can be the entire dataframe, not just the

variables to transform.

solegalli · 2026-02-21T01:44:36Z

feature_engine/text/text_features.py

+        y: pandas Series, or np.array. Defaults to None.
+            It is not needed in this transformer. You can pass y or None.
+
+        Returns


We don't include the return in the fit method

solegalli · 2026-02-21T01:50:10Z

feature_engine/text/text_features.py

+        X = X[self.feature_names_in_]
+
+        # Fill NaN with empty string for feature extraction
+        X[self.variables_] = X[self.variables_].fillna("")


in the user guide we mention that TextFeatures will raise an error on NA by default, and that to override this error, we need to set missing_values to ignore. But we don't have this functionality here.

We need to either add it, or remove the part on missing data on the docs.

To follow with feature-engine's traditional behaviour, we should probably include it. But I see that handling na under the hood could be more practical.

Anyhow, if we decide to add it, I would set missing_values: to ignore by default and give the option to the user to set it to raise if they want to raise NA.

solegalli · 2026-02-21T01:53:19Z

tests/test_text/test_text_features.py

+    assert "text_digit_count" in X_tr.columns
+
+    # Check char_count
+    assert X_tr["text_char_count"].tolist() == [12, 10, 2]


I think with this logic we do not need the 3 previous assert of whether the column is present.

solegalli · 2026-02-21T01:56:22Z

tests/test_text/test_text_features.py

+from feature_engine.text import TextFeatures
+
+
+def test_default_all_features():


we need to text all features are created as expected. We only test 3.

I would also take the opportunity to test edge cases, like all caps, all digits, 3 different punctuation marks one after the other, the i.e., example, and so on.

In fact, I think that given the complexity of this transformer, i.e., makes a lots of calculations, it would be much better to have 1 test per feature, just to make sure that the result is the one we expect, and we dpn't get surprises like trailing whitespace.

Ok, going further down the tests file I can see that the individual features are tested below, one way or another.

Could I ask to refactor this file:

split the file in sections: first init tests, then fit tests, then transform tests, then other method tests.

test that each extracted feature returns the expected result in one section, one test after the other, and if possible following the order of TEXT_FEATURES

That would help with review and future maintenance a lot :)

solegalli · 2026-02-21T02:08:15Z

tests/test_text/test_text_features.py

+
+
+def test_empty_string_handling():
+    """Test handling of empty strings."""


I wonder if instead of having a dedicated test for empty strings and only testing that 2 features are generated correctly, it would be better to just add an empty string to the tests that test every individual feature.

solegalli · 2026-02-21T02:09:14Z

tests/test_text/test_text_features.py

+
+def test_letter_count():
+    """Test letter count feature."""
+    X = pd.DataFrame({"text": ["Hello 123", "WORLD!", "abc..."]})


Like, here, I would add the empty string, and maybe even NAN, and maybe an edge case with non alpha numeric characters. I would also add 1 example with trailing whitespace. To cover all edge cases.

solegalli · 2026-02-21T02:12:09Z

tests/test_text/test_text_features.py

+    assert X_tr["text_letter_count"].tolist() == [5, 5, 3]
+
+
+def test_uppercase_features():


I would probably bring the uppercase ratio to this test, so we have all upper case features together, and add nan, empty string, and also combinations of letters and numbers and symbols or both, just to make sure the extractions works as intended.

solegalli · 2026-02-21T02:14:01Z

tests/test_text/test_text_features.py

+
+def test_sentence_count():
+    """Test sentence counting."""
+    X = pd.DataFrame({"text": ["Hello. World!", "One sentence", "A? B! C."]})


I would add ??? or .?! to test edge cases and also "i.e., this is wrong" or something on those lines

solegalli · 2026-02-21T02:16:49Z

tests/test_text/test_text_features.py

+
+def test_unique_word_features():
+    """Test unique word features."""
+    X = pd.DataFrame({"text": ["the the the", "a b c", "x"]})


i think this covers everything, but in the past, having very simple examples didn't allow me to see some errors that would arise with more complex features. So I think we should include a sentence with all different words as well for a lexical diversity of 1 and maximum unique words, for example. I would combine the use of ? > and numbers in the example texts, to make sure everything works fine.

solegalli · 2026-02-21T02:17:26Z

tests/test_text/test_text_features.py

+
+def test_invalid_feature_raises_error():
+    """Test that invalid feature name raises ValueError."""
+    with pytest.raises(ValueError, match="Invalid features"):


Here I would use pytest fixture and past various invalid inputs, like an invalid text, or a boolean or a number.

solegalli · 2026-02-21T02:18:02Z

tests/test_text/test_text_features.py

+def test_invalid_variables_raises_error():
+    """Test that invalid variables parameter raises ValueError."""
+    with pytest.raises(ValueError, match="variables must be"):
+        TextFeatures(variables=123)


same here, I would use pytest to pass multiple invalid inputs.

solegalli · 2026-02-21T02:18:50Z

tests/test_text/test_text_features.py

+    assert hasattr(transformer, "variables_")
+    assert hasattr(transformer, "features_")
+    assert hasattr(transformer, "feature_names_in_")
+    assert hasattr(transformer, "n_features_in_")


It is important to test that the attributes contain the expected values as well.

solegalli · 2026-02-21T02:19:36Z

tests/test_text/test_text_features.py

+    transformer.fit(X)
+
+    feature_names = transformer.get_feature_names_out()
+    assert "text" in feature_names


I think it would be better to test feature_names against the expected list.

solegalli · 2026-02-21T02:19:48Z

tests/test_text/test_text_features.py

+    )
+    transformer.fit(X)
+
+    feature_names = transformer.get_feature_names_out()


I think it would be better to test feature_names against the expected list.

solegalli · 2026-02-21T02:22:16Z

tests/test_text/test_text_features.py

+    transformer = TextFeatures(variables="text", features=["char_count"])
+    X_tr = transformer.fit_transform(X)
+
+    assert "text_char_count" in X_tr.columns


For this test, I would test that variables_ is a list containing the string only:

assert transformer.variables_ == ['text']

solegalli · 2026-02-21T02:23:32Z

tests/test_text/test_text_features.py

+def test_invalid_features_type_raises_error():
+    """Test that invalid features type raises ValueError."""
+    with pytest.raises(ValueError, match="features must be"):
+        TextFeatures(variables=["text"], features="char_count")


char_count is so similar to the feature name, that at first sight it is difficult to see why this would raise an error. I would use a more different string, and also use pytest parametrize to pass more values, other than texts, like a boolean.

solegalli · 2026-02-21T02:24:15Z

tests/test_text/test_text_features.py

+    )
+    X_tr = transformer.fit_transform(X)
+
+    assert "a_char_count" in X_tr.columns


I am still not convinced that testing that columns are present is the right approach.

solegalli · 2026-02-21T02:26:21Z

tests/test_text/test_text_features.py

+
+def test_punctuation_features():
+    """Test punctuation-related features."""
+    X = pd.DataFrame({"text": ["Hello.", "World", "Hi!"]})


It would be great to have proper sentences, for example one text with 2 sentences, one with punctuation in the middle but not at the end. One text with 2 sentences both with punctuation, on the middle and at the end. And also an example with multiple punctuation.

solegalli · 2026-02-21T02:27:05Z

tests/test_text/test_text_features.py

+    )
+    X_tr = transformer.fit_transform(X)
+
+    assert X_tr["text_digit_ratio"].tolist() == [0.5, 0.0]


I would probably pass the digit ratio to the digits test, the uppercase to the uppercase test, and rename this one for whitespace.

solegalli · 2026-02-21T02:27:46Z

tests/test_text/test_text_features.py

+
+def test_avg_word_length():
+    """Test average word length feature."""
+    X = pd.DataFrame({"text": ["ab cd", "a"]})


It would be good to include examples with more than a few letters in addition to these, and proper sentences.

solegalli · 2026-02-21T02:28:26Z

tests/test_text/test_text_features.py

+
+def test_lowercase_count():
+    """Test lowercase count feature."""
+    X = pd.DataFrame({"text": ["Hello", "WORLD"]})


We should probably include examples with numbers and special characters to cover all angles.

solegalli · 2026-02-21T02:29:44Z

tests/test_text/test_text_features.py

+
+def test_variables_list_non_strings_raises_error():
+    """Test that a list of non-string variables raises ValueError."""
+    with pytest.raises(ValueError, match="variables must be"):


i would use parametrize to pass a number as well as a list of numbers, and some other invalid inputs.

solegalli · 2026-02-21T02:30:23Z

tests/test_text/test_text_features.py

+        TextFeatures(variables=[1, 2])
+
+
+def test_features_list_non_strings_raises_error():


After we implement parametrize in a test about that also tests this input, then we don't need this test anymore.

solegalli · 2026-02-21T02:31:21Z

tests/test_text/test_text_features.py

+        TextFeatures(variables=["text"], features=[1, 2])
+
+
+def test_more_tags():


We normally don't test the tags. I would remove these 2 tests. They are useful when testing compatibility with sklearn, but since, this transformer requires texts, I don't imagine that that will be possible.

solegalli

Hey @ankitlade12

Thank you so much for the changes and apologies for the delayed response. I was without internet till yesterday.

This PR is looking really good.

When reading the user guide, I noted that some of the outputs where not what I expected, and I think that's due to trailing whitespaces.

I suggest we remove whitespaces before the calculation of some of the features, because intuitively, a user may count letters and then see that the outputs don't match (like I did) and lose trust in the transformer.

After we modify the functionality, we should also update the user guide and the example in the docstring.

In addition, it would be great if we could tidy the test file a bit: divide it in sections, put the tests for the different features all together, to smooth the review process and also make maintaining the class easier looking forward.

In the tests, I would also try and include as many edge cases as possible, to make sure we have all angles covered and we get the result we expect in all circumstances. Like all caps, all lower, all numbers, empty strings, nan, etc, for all features (whenever relevant).

Once we have a test per feature, then we could see what is the best way to test the remaining functionality, like testing that all features are rendered, or some, or extracting from only 1 text col, or 2 text cols, because there could be a bit of a repetition there, that we might not need. I am not sure what's best.

I would also use pytest parametrize when testing that the transformer raises errors, to test more than 1 invalid input at a time.

Thanks a lot for the great work. I feel that we are almost ready, and we could manage to make a new release after we merge this PR :)

solegalli mentioned this pull request Jan 11, 2026

Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875

Closed

solegalli reviewed Jan 11, 2026

View reviewed changes

ankitlade12 requested a review from solegalli January 23, 2026 16:40