Skip to content

Comments

[ENH] Add TextFeatures transformer for text feature extraction#880

Open
ankitlade12 wants to merge 10 commits intofeature-engine:mainfrom
ankitlade12:add-text-features
Open

[ENH] Add TextFeatures transformer for text feature extraction#880
ankitlade12 wants to merge 10 commits intofeature-engine:mainfrom
ankitlade12:add-text-features

Conversation

@ankitlade12
Copy link
Contributor

  • Add TextFeatures class to extract features from text columns
  • Support for features: char_count, word_count, digit_count, uppercase_count, etc.
  • Add comprehensive tests with pytest parametrize
  • Add user guide documentation

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

Thanks a lot!

This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.

Other than that, we need the various docs file and we'll be good to go :)

Thanks again!

@ankitlade12 ankitlade12 requested a review from solegalli January 23, 2026 16:40
@solegalli
Copy link
Collaborator

We need to rebase main so the 2 remaining tests pass.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.

I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.

Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.

Thank you very much for this great addition.

x.str.lower().str.split().apply(
lambda s: len(set(s)) if isinstance(s, list) else 0
) / x.str.split().str.len()
).fillna(0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the above, would this not be better:

x.str.strip().str.split().str.len() / x.str.lower().str.split().apply(set).str.len()

>>> X = tf.transform(X)
>>> X
text text_char_count text_word_count text_has_digits
0 Hello World! 12 2 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to adjust this example after updating the feature calculations

def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
"""
Stores feature names and validates that the specified variables are
present and are of string/object type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
present and are of string/object type.


def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
"""
Stores feature names and validates that the specified variables are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Stores feature names and validates that the specified variables are
This transformer does not learn any parameters.

Parameters
----------
X: pandas dataframe of shape = [n_samples, n_features]
The training input samples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The training input samples.
The training input samples. Can be the entire dataframe, not just the
variables to transform.

y: pandas Series, or np.array. Defaults to None.
It is not needed in this transformer. You can pass y or None.
Returns
Copy link
Collaborator

@solegalli solegalli Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't include the return in the fit method

X = X[self.feature_names_in_]

# Fill NaN with empty string for feature extraction
X[self.variables_] = X[self.variables_].fillna("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the user guide we mention that TextFeatures will raise an error on NA by default, and that to override this error, we need to set missing_values to ignore. But we don't have this functionality here.

We need to either add it, or remove the part on missing data on the docs.

To follow with feature-engine's traditional behaviour, we should probably include it. But I see that handling na under the hood could be more practical.

Anyhow, if we decide to add it, I would set missing_values: to ignore by default and give the option to the user to set it to raise if they want to raise NA.

assert "text_digit_count" in X_tr.columns

# Check char_count
assert X_tr["text_char_count"].tolist() == [12, 10, 2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this logic we do not need the 3 previous assert of whether the column is present.

from feature_engine.text import TextFeatures


def test_default_all_features():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to text all features are created as expected. We only test 3.

I would also take the opportunity to test edge cases, like all caps, all digits, 3 different punctuation marks one after the other, the i.e., example, and so on.

In fact, I think that given the complexity of this transformer, i.e., makes a lots of calculations, it would be much better to have 1 test per feature, just to make sure that the result is the one we expect, and we dpn't get surprises like trailing whitespace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, going further down the tests file I can see that the individual features are tested below, one way or another.

Could I ask to refactor this file:

  • split the file in sections: first init tests, then fit tests, then transform tests, then other method tests.
  • test that each extracted feature returns the expected result in one section, one test after the other, and if possible following the order of TEXT_FEATURES

That would help with review and future maintenance a lot :)



def test_empty_string_handling():
"""Test handling of empty strings."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of having a dedicated test for empty strings and only testing that 2 features are generated correctly, it would be better to just add an empty string to the tests that test every individual feature.


def test_letter_count():
"""Test letter count feature."""
X = pd.DataFrame({"text": ["Hello 123", "WORLD!", "abc..."]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, here, I would add the empty string, and maybe even NAN, and maybe an edge case with non alpha numeric characters. I would also add 1 example with trailing whitespace. To cover all edge cases.

assert X_tr["text_letter_count"].tolist() == [5, 5, 3]


def test_uppercase_features():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably bring the uppercase ratio to this test, so we have all upper case features together, and add nan, empty string, and also combinations of letters and numbers and symbols or both, just to make sure the extractions works as intended.


def test_sentence_count():
"""Test sentence counting."""
X = pd.DataFrame({"text": ["Hello. World!", "One sentence", "A? B! C."]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add ??? or .?! to test edge cases and also "i.e., this is wrong" or something on those lines


def test_unique_word_features():
"""Test unique word features."""
X = pd.DataFrame({"text": ["the the the", "a b c", "x"]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this covers everything, but in the past, having very simple examples didn't allow me to see some errors that would arise with more complex features. So I think we should include a sentence with all different words as well for a lexical diversity of 1 and maximum unique words, for example. I would combine the use of ? > and numbers in the example texts, to make sure everything works fine.


def test_invalid_feature_raises_error():
"""Test that invalid feature name raises ValueError."""
with pytest.raises(ValueError, match="Invalid features"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would use pytest fixture and past various invalid inputs, like an invalid text, or a boolean or a number.

def test_invalid_variables_raises_error():
"""Test that invalid variables parameter raises ValueError."""
with pytest.raises(ValueError, match="variables must be"):
TextFeatures(variables=123)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, I would use pytest to pass multiple invalid inputs.

assert hasattr(transformer, "variables_")
assert hasattr(transformer, "features_")
assert hasattr(transformer, "feature_names_in_")
assert hasattr(transformer, "n_features_in_")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to test that the attributes contain the expected values as well.

transformer.fit(X)

feature_names = transformer.get_feature_names_out()
assert "text" in feature_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to test feature_names against the expected list.

)
transformer.fit(X)

feature_names = transformer.get_feature_names_out()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to test feature_names against the expected list.

transformer = TextFeatures(variables="text", features=["char_count"])
X_tr = transformer.fit_transform(X)

assert "text_char_count" in X_tr.columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test, I would test that variables_ is a list containing the string only:

assert transformer.variables_ == ['text']

def test_invalid_features_type_raises_error():
"""Test that invalid features type raises ValueError."""
with pytest.raises(ValueError, match="features must be"):
TextFeatures(variables=["text"], features="char_count")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_count is so similar to the feature name, that at first sight it is difficult to see why this would raise an error. I would use a more different string, and also use pytest parametrize to pass more values, other than texts, like a boolean.

)
X_tr = transformer.fit_transform(X)

assert "a_char_count" in X_tr.columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not convinced that testing that columns are present is the right approach.


def test_punctuation_features():
"""Test punctuation-related features."""
X = pd.DataFrame({"text": ["Hello.", "World", "Hi!"]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have proper sentences, for example one text with 2 sentences, one with punctuation in the middle but not at the end. One text with 2 sentences both with punctuation, on the middle and at the end. And also an example with multiple punctuation.

)
X_tr = transformer.fit_transform(X)

assert X_tr["text_digit_ratio"].tolist() == [0.5, 0.0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably pass the digit ratio to the digits test, the uppercase to the uppercase test, and rename this one for whitespace.


def test_avg_word_length():
"""Test average word length feature."""
X = pd.DataFrame({"text": ["ab cd", "a"]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to include examples with more than a few letters in addition to these, and proper sentences.


def test_lowercase_count():
"""Test lowercase count feature."""
X = pd.DataFrame({"text": ["Hello", "WORLD"]})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably include examples with numbers and special characters to cover all angles.


def test_variables_list_non_strings_raises_error():
"""Test that a list of non-string variables raises ValueError."""
with pytest.raises(ValueError, match="variables must be"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would use parametrize to pass a number as well as a list of numbers, and some other invalid inputs.

TextFeatures(variables=[1, 2])


def test_features_list_non_strings_raises_error():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we implement parametrize in a test about that also tests this input, then we don't need this test anymore.

TextFeatures(variables=["text"], features=[1, 2])


def test_more_tags():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally don't test the tags. I would remove these 2 tests. They are useful when testing compatibility with sklearn, but since, this transformer requires texts, I don't imagine that that will be possible.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ankitlade12

Thank you so much for the changes and apologies for the delayed response. I was without internet till yesterday.

This PR is looking really good.

When reading the user guide, I noted that some of the outputs where not what I expected, and I think that's due to trailing whitespaces.

I suggest we remove whitespaces before the calculation of some of the features, because intuitively, a user may count letters and then see that the outputs don't match (like I did) and lose trust in the transformer.

After we modify the functionality, we should also update the user guide and the example in the docstring.

In addition, it would be great if we could tidy the test file a bit: divide it in sections, put the tests for the different features all together, to smooth the review process and also make maintaining the class easier looking forward.

In the tests, I would also try and include as many edge cases as possible, to make sure we have all angles covered and we get the result we expect in all circumstances. Like all caps, all lower, all numbers, empty strings, nan, etc, for all features (whenever relevant).

Once we have a test per feature, then we could see what is the best way to test the remaining functionality, like testing that all features are rendered, or some, or extracting from only 1 text col, or 2 text cols, because there could be a bit of a repetition there, that we might not need. I am not sure what's best.

I would also use pytest parametrize when testing that the transformer raises errors, to test more than 1 invalid input at a time.

Thanks a lot for the great work. I feel that we are almost ready, and we could manage to make a new release after we merge this PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants