0
Follow
4
View

How to use apply() in pandas with other columns and rows?

csy19920701 注册会员
2023-01-26 11:00

...doesn't work perfectly, as the counts don't increment as there are more of the same letter in a word, e.g. for row 5755, f should be 2 because there are two f's in biffy, but it's only 1.

Original answer:


There's actually a really concise way to do exactly this what you're trying to do: .str.get_dummies(). It takes a separator string, and splits all the strings in that column by the separator, and creates a one-hot encoded dataframe out of them (exactly what you're trying to do):

>>> df['words'].str.get_dummies('')
      a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
0     0  0  1  0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1     0  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
2     0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
3     1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0  0  0
4     0  0  0  1  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  1  0  0  0
...  .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0
5753  0  1  0  0  1  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0  0  0  0  0  0  0
5754  1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0
5755  0  1  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
5756  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  0

To add that back to the original dataframe:

tmp_df = df['words'].str.get_dummies('')
df[tmp_df.columns] = tmp_df

Output:

>>> df
      words  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
0     which  0  0  1  0  0  0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1     there  0  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
2     their  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
3     about  1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0  0  0
4     would  0  0  0  1  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  1  0  0  0
...     ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752  osier  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0
5753  roble  0  1  0  0  1  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0  0  0  0  0  0  0
5754  rumba  1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0
5755  biffy  0  1  0  0  0  1  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
5756  pupal  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  1  0  0  0  0  1  0  0  0  0  0

[5757 rows x 27 columns]
effect611 注册会员
2023-01-26 11:00

...how I will go about it and avaoid lambda because it is an anticlimax in pandas

  1. Add temp column which is a list of each letter in words
  2. Explode the df on column 1 above
  3. Apply pd.get_dummies on column 2
  4. 3 above gives frequency of each letter in 2
  5. groupby outcome in 4 and sum, gives a tally of each letter per index
  6. Leverage pd.join to pin 5 back to original df.
  7. drop temp column from the df

Code below

words_df['words1']=(list(map(','.join, words_df['words'])))#disintegrate words into letters
s= (words_df.drop(columns='words1')#drop new added column
    .join(#Join original df to computed outcome
        pd.get_dummies(#pd.get_dummies method
            words_df.assign(words1=words_df['words1'].str.split(',')).explode('words1')['words1'],prefix='', prefix_sep='')#raw breakdown by letters
        .groupby(level=0).sum().reset_index()#groupby index and sum letters per index
    )
   )
yechuosini 注册会员
2023-01-26 11:00

...other answers are really good solutions. I tested how fast they are using your data. It seems get_dummies+groupby is faster than collections.Counter.

def richardec():
    letter_cols = list(string.ascii_lowercase)
    words_df[letter_cols] = pd.DataFrame(words_df['words'].apply(Counter).tolist()).fillna(0).astype(int)[letter_cols]
    return words_df

def wwnde():
    words_df['words1']=(list(map(','.join, words_df['words'])))#disintegrate words into letters
    s= (words_df.drop(columns='words1')#drop new added column
        .join(#Join original df to computed outcome
            pd.get_dummies(#pd.get_dummies method
                words_df.assign(words1=words_df['words1'].str.split(',')).explode('words1')['words1'],prefix='', prefix_sep='')#raw breakdown by letters
            .groupby(level=0).sum().reset_index()#groupby index and sum letters per index
        )
       )
    return s

def list_Series_get_dummies():
    # this is the same as @wwnde's answer
    # the only difference is that instead of separating the letters of 
    # each word with a comma and then later splitting on it, 
    # it just creates lists from the letters and explodes them
    # otherwise exactly same
    out = words_df.join(pd.get_dummies(pd.Series(map(list, words_df['words'])).explode()).groupby(level=0).sum())
    return out

%timeit -n 1000 richardec()
60.9 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 wwnde()
56.7 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n 1000 list_Series_get_dummies()
50.8 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
coolsubo 注册会员
2023-01-26 11:00

...can use the built-in collections.Counter function, which returns a dict containing unique elements of a string/list and their counts, then convert it to a dataframe, fill NaNs, and assign it back to the original dataframe:

from collections import Counter

letter_cols = df.columns[1:]
df[letter_cols] = pd.DataFrame(df['words'].apply(Counter).tolist()).fillna(0).astype(int)[letter_cols]

Output:

>>> df
      words  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
0     which  0  0  1  0  0  0  0  2  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
1     there  0  0  0  0  2  0  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
2     their  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0
3     about  1  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0  0  0  0
4     would  0  0  0  1  0  0  0  0  0  0  0  1  0  0  1  0  0  0  0  0  1  0  1  0  0  0
...     ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
5752  osier  0  0  0  0  1  0  0  0  1  0  0  0  0  0  1  0  0  1  1  0  0  0  0  0  0  0
5753  roble  0  1  0  0  1  0  0  0  0  0  0  1  0  0  1  0  0  1  0  0  0  0  0  0  0  0
5754  rumba  1  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  1  0  0  0  0  0
5755  biffy  0  1  0  0  0  2  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
5756  pupal  1  0  0  0  0  0  0  0  0  0  0  1  0  0  0  2  0  0  0  0  1  0  0  0  0  0

[5757 rows x 27 columns]

About the Author

Question Info

Publish Time
2023-01-26 10:59
Update Time
2023-01-26 10:59

Related Question