[ENH] Add TextFeatures transformer for text feature extraction#880
[ENH] Add TextFeatures transformer for text feature extraction#880ankitlade12 wants to merge 10 commits intofeature-engine:mainfrom
Conversation
ankitlade12
commented
Jan 8, 2026
- Add TextFeatures class to extract features from text columns
- Support for features: char_count, word_count, digit_count, uppercase_count, etc.
- Add comprehensive tests with pytest parametrize
- Add user guide documentation
solegalli
left a comment
There was a problem hiding this comment.
Hi @ankitlade12
Thanks a lot!
This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.
Other than that, we need the various docs file and we'll be good to go :)
Thanks again!
|
We need to rebase main so the 2 remaining tests pass. |
solegalli
left a comment
There was a problem hiding this comment.
Hi @ankitlade12
I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.
I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.
Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.
Thank you very much for this great addition.
| x.str.lower().str.split().apply( | ||
| lambda s: len(set(s)) if isinstance(s, list) else 0 | ||
| ) / x.str.split().str.len() | ||
| ).fillna(0), |
There was a problem hiding this comment.
Following the above, would this not be better:
x.str.strip().str.split().str.len() / x.str.lower().str.split().apply(set).str.len()
| >>> X = tf.transform(X) | ||
| >>> X | ||
| text text_char_count text_word_count text_has_digits | ||
| 0 Hello World! 12 2 0 |
There was a problem hiding this comment.
We might need to adjust this example after updating the feature calculations
| def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): | ||
| """ | ||
| Stores feature names and validates that the specified variables are | ||
| present and are of string/object type. |
There was a problem hiding this comment.
| present and are of string/object type. |
|
|
||
| def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None): | ||
| """ | ||
| Stores feature names and validates that the specified variables are |
There was a problem hiding this comment.
| Stores feature names and validates that the specified variables are | |
| This transformer does not learn any parameters. |
| Parameters | ||
| ---------- | ||
| X: pandas dataframe of shape = [n_samples, n_features] | ||
| The training input samples. |
There was a problem hiding this comment.
| The training input samples. | |
| The training input samples. Can be the entire dataframe, not just the | |
| variables to transform. |
| y: pandas Series, or np.array. Defaults to None. | ||
| It is not needed in this transformer. You can pass y or None. | ||
| Returns |
There was a problem hiding this comment.
We don't include the return in the fit method
| X = X[self.feature_names_in_] | ||
|
|
||
| # Fill NaN with empty string for feature extraction | ||
| X[self.variables_] = X[self.variables_].fillna("") |
There was a problem hiding this comment.
in the user guide we mention that TextFeatures will raise an error on NA by default, and that to override this error, we need to set missing_values to ignore. But we don't have this functionality here.
We need to either add it, or remove the part on missing data on the docs.
To follow with feature-engine's traditional behaviour, we should probably include it. But I see that handling na under the hood could be more practical.
Anyhow, if we decide to add it, I would set missing_values: to ignore by default and give the option to the user to set it to raise if they want to raise NA.
| assert "text_digit_count" in X_tr.columns | ||
|
|
||
| # Check char_count | ||
| assert X_tr["text_char_count"].tolist() == [12, 10, 2] |
There was a problem hiding this comment.
I think with this logic we do not need the 3 previous assert of whether the column is present.
| from feature_engine.text import TextFeatures | ||
|
|
||
|
|
||
| def test_default_all_features(): |
There was a problem hiding this comment.
we need to text all features are created as expected. We only test 3.
I would also take the opportunity to test edge cases, like all caps, all digits, 3 different punctuation marks one after the other, the i.e., example, and so on.
In fact, I think that given the complexity of this transformer, i.e., makes a lots of calculations, it would be much better to have 1 test per feature, just to make sure that the result is the one we expect, and we dpn't get surprises like trailing whitespace.
There was a problem hiding this comment.
Ok, going further down the tests file I can see that the individual features are tested below, one way or another.
Could I ask to refactor this file:
- split the file in sections: first init tests, then fit tests, then transform tests, then other method tests.
- test that each extracted feature returns the expected result in one section, one test after the other, and if possible following the order of TEXT_FEATURES
That would help with review and future maintenance a lot :)
|
|
||
|
|
||
| def test_empty_string_handling(): | ||
| """Test handling of empty strings.""" |
There was a problem hiding this comment.
I wonder if instead of having a dedicated test for empty strings and only testing that 2 features are generated correctly, it would be better to just add an empty string to the tests that test every individual feature.
|
|
||
| def test_letter_count(): | ||
| """Test letter count feature.""" | ||
| X = pd.DataFrame({"text": ["Hello 123", "WORLD!", "abc..."]}) |
There was a problem hiding this comment.
Like, here, I would add the empty string, and maybe even NAN, and maybe an edge case with non alpha numeric characters. I would also add 1 example with trailing whitespace. To cover all edge cases.
| assert X_tr["text_letter_count"].tolist() == [5, 5, 3] | ||
|
|
||
|
|
||
| def test_uppercase_features(): |
There was a problem hiding this comment.
I would probably bring the uppercase ratio to this test, so we have all upper case features together, and add nan, empty string, and also combinations of letters and numbers and symbols or both, just to make sure the extractions works as intended.
|
|
||
| def test_sentence_count(): | ||
| """Test sentence counting.""" | ||
| X = pd.DataFrame({"text": ["Hello. World!", "One sentence", "A? B! C."]}) |
There was a problem hiding this comment.
I would add ??? or .?! to test edge cases and also "i.e., this is wrong" or something on those lines
|
|
||
| def test_unique_word_features(): | ||
| """Test unique word features.""" | ||
| X = pd.DataFrame({"text": ["the the the", "a b c", "x"]}) |
There was a problem hiding this comment.
i think this covers everything, but in the past, having very simple examples didn't allow me to see some errors that would arise with more complex features. So I think we should include a sentence with all different words as well for a lexical diversity of 1 and maximum unique words, for example. I would combine the use of ? > and numbers in the example texts, to make sure everything works fine.
|
|
||
| def test_invalid_feature_raises_error(): | ||
| """Test that invalid feature name raises ValueError.""" | ||
| with pytest.raises(ValueError, match="Invalid features"): |
There was a problem hiding this comment.
Here I would use pytest fixture and past various invalid inputs, like an invalid text, or a boolean or a number.
| def test_invalid_variables_raises_error(): | ||
| """Test that invalid variables parameter raises ValueError.""" | ||
| with pytest.raises(ValueError, match="variables must be"): | ||
| TextFeatures(variables=123) |
There was a problem hiding this comment.
same here, I would use pytest to pass multiple invalid inputs.
| assert hasattr(transformer, "variables_") | ||
| assert hasattr(transformer, "features_") | ||
| assert hasattr(transformer, "feature_names_in_") | ||
| assert hasattr(transformer, "n_features_in_") |
There was a problem hiding this comment.
It is important to test that the attributes contain the expected values as well.
| transformer.fit(X) | ||
|
|
||
| feature_names = transformer.get_feature_names_out() | ||
| assert "text" in feature_names |
There was a problem hiding this comment.
I think it would be better to test feature_names against the expected list.
| ) | ||
| transformer.fit(X) | ||
|
|
||
| feature_names = transformer.get_feature_names_out() |
There was a problem hiding this comment.
I think it would be better to test feature_names against the expected list.
| transformer = TextFeatures(variables="text", features=["char_count"]) | ||
| X_tr = transformer.fit_transform(X) | ||
|
|
||
| assert "text_char_count" in X_tr.columns |
There was a problem hiding this comment.
For this test, I would test that variables_ is a list containing the string only:
assert transformer.variables_ == ['text']
| def test_invalid_features_type_raises_error(): | ||
| """Test that invalid features type raises ValueError.""" | ||
| with pytest.raises(ValueError, match="features must be"): | ||
| TextFeatures(variables=["text"], features="char_count") |
There was a problem hiding this comment.
char_count is so similar to the feature name, that at first sight it is difficult to see why this would raise an error. I would use a more different string, and also use pytest parametrize to pass more values, other than texts, like a boolean.
| ) | ||
| X_tr = transformer.fit_transform(X) | ||
|
|
||
| assert "a_char_count" in X_tr.columns |
There was a problem hiding this comment.
I am still not convinced that testing that columns are present is the right approach.
|
|
||
| def test_punctuation_features(): | ||
| """Test punctuation-related features.""" | ||
| X = pd.DataFrame({"text": ["Hello.", "World", "Hi!"]}) |
There was a problem hiding this comment.
It would be great to have proper sentences, for example one text with 2 sentences, one with punctuation in the middle but not at the end. One text with 2 sentences both with punctuation, on the middle and at the end. And also an example with multiple punctuation.
| ) | ||
| X_tr = transformer.fit_transform(X) | ||
|
|
||
| assert X_tr["text_digit_ratio"].tolist() == [0.5, 0.0] |
There was a problem hiding this comment.
I would probably pass the digit ratio to the digits test, the uppercase to the uppercase test, and rename this one for whitespace.
|
|
||
| def test_avg_word_length(): | ||
| """Test average word length feature.""" | ||
| X = pd.DataFrame({"text": ["ab cd", "a"]}) |
There was a problem hiding this comment.
It would be good to include examples with more than a few letters in addition to these, and proper sentences.
|
|
||
| def test_lowercase_count(): | ||
| """Test lowercase count feature.""" | ||
| X = pd.DataFrame({"text": ["Hello", "WORLD"]}) |
There was a problem hiding this comment.
We should probably include examples with numbers and special characters to cover all angles.
|
|
||
| def test_variables_list_non_strings_raises_error(): | ||
| """Test that a list of non-string variables raises ValueError.""" | ||
| with pytest.raises(ValueError, match="variables must be"): |
There was a problem hiding this comment.
i would use parametrize to pass a number as well as a list of numbers, and some other invalid inputs.
| TextFeatures(variables=[1, 2]) | ||
|
|
||
|
|
||
| def test_features_list_non_strings_raises_error(): |
There was a problem hiding this comment.
After we implement parametrize in a test about that also tests this input, then we don't need this test anymore.
| TextFeatures(variables=["text"], features=[1, 2]) | ||
|
|
||
|
|
||
| def test_more_tags(): |
There was a problem hiding this comment.
We normally don't test the tags. I would remove these 2 tests. They are useful when testing compatibility with sklearn, but since, this transformer requires texts, I don't imagine that that will be possible.
solegalli
left a comment
There was a problem hiding this comment.
Hey @ankitlade12
Thank you so much for the changes and apologies for the delayed response. I was without internet till yesterday.
This PR is looking really good.
When reading the user guide, I noted that some of the outputs where not what I expected, and I think that's due to trailing whitespaces.
I suggest we remove whitespaces before the calculation of some of the features, because intuitively, a user may count letters and then see that the outputs don't match (like I did) and lose trust in the transformer.
After we modify the functionality, we should also update the user guide and the example in the docstring.
In addition, it would be great if we could tidy the test file a bit: divide it in sections, put the tests for the different features all together, to smooth the review process and also make maintaining the class easier looking forward.
In the tests, I would also try and include as many edge cases as possible, to make sure we have all angles covered and we get the result we expect in all circumstances. Like all caps, all lower, all numbers, empty strings, nan, etc, for all features (whenever relevant).
Once we have a test per feature, then we could see what is the best way to test the remaining functionality, like testing that all features are rendered, or some, or extracting from only 1 text col, or 2 text cols, because there could be a bit of a repetition there, that we might not need. I am not sure what's best.
I would also use pytest parametrize when testing that the transformer raises errors, to test more than 1 invalid input at a time.
Thanks a lot for the great work. I feel that we are almost ready, and we could manage to make a new release after we merge this PR :)