python 3.x - Removing strings that match multiple regex patterns from pandas series -
i have pandas dataframe column containing text needs cleaned of strings match various regex patterns. current attempt (given below) loops through each pattern, creating new column containing match if found, , loops through dataframe, splitting column @ found match. drop unneeded matching column 're_match'.
while works current use case, can't think there must more efficient, vectorised way of doing in pandas, without needing use iterrows() , creating new column. question is, there more optimal way of removing strings match multiple regex patterns column?
in current use case unwanted strings @ end of text block, hence, use of split(...)[0]. however, great if unwanted strings extracted point in text.
also, note combining regexes 1 long single pattern unpreferrable, there tens of patterns of change on regular basis.
df = pd.read_csv('data.csv', index_col=0) patterns = [     '( regex1 \d+)',     '((?: regex 2)? \d{1,2} )',     '( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )', ]  p in patterns:      df['re_match'] = df['text'].str.extract(         pat=p, flags=re.ignorecase, expand=false     )     df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')      index, row in df.iterrows():         df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]  df = df.drop('re_match', axis=1)   thank help
there indeed , called df.applymap(some_function).
 consider following example:
from pandas import dataframe import pandas pd, re df = dataframe({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})  def cleanitup(val):     """ multiplies digit values """     rx = re.compile(r'^\d+$')     if rx.match(val):         return int(val) * 10     else:         return val  # here magic starts df.applymap(cleanitup)   obviously, made up, in every cell only digits before, these have been multiplied 10, every other value has been left untouched.
 in mind, can check , rearrange values if necessary in function cleanitup().
Comments
Post a Comment