python 3.x - Removing strings that match multiple regex patterns from pandas series -
i have pandas dataframe column containing text needs cleaned of strings match various regex patterns. current attempt (given below) loops through each pattern, creating new column containing match if found, , loops through dataframe, splitting column @ found match. drop unneeded matching column 're_match'.
while works current use case, can't think there must more efficient, vectorised way of doing in pandas, without needing use iterrows()
, creating new column. question is, there more optimal way of removing strings match multiple regex patterns column?
in current use case unwanted strings @ end of text block, hence, use of split(...)[0]
. however, great if unwanted strings extracted point in text.
also, note combining regexes 1 long single pattern unpreferrable, there tens of patterns of change on regular basis.
df = pd.read_csv('data.csv', index_col=0) patterns = [ '( regex1 \d+)', '((?: regex 2)? \d{1,2} )', '( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )', ] p in patterns: df['re_match'] = df['text'].str.extract( pat=p, flags=re.ignorecase, expand=false ) df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx') index, row in df.iterrows(): df.loc[index, 'text'] = row['text'].split(row['re_match'])[0] df = df.drop('re_match', axis=1)
thank help
there indeed , called df.applymap(some_function)
.
consider following example:
from pandas import dataframe import pandas pd, re df = dataframe({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']}) def cleanitup(val): """ multiplies digit values """ rx = re.compile(r'^\d+$') if rx.match(val): return int(val) * 10 else: return val # here magic starts df.applymap(cleanitup)
obviously, made up, in every cell only digits before, these have been multiplied 10, every other value has been left untouched.
in mind, can check , rearrange values if necessary in function cleanitup()
.
Comments
Post a Comment